Hi, Spark runs in local with a speed less than in cluster. Cluster machines usually have a high configuration and also the tasks are distrubuted in workers in order to get a faster result. So you will always find a difference in speed when running in local and when running in cluster. Try running the same in a cluster and evaluate the speed there.
Thanks On Thu, Nov 20, 2014 at 6:52 PM, Blackeye [via Apache Spark User List] < ml-node+s1001560n1937...@n3.nabble.com> wrote: > I am using spark streaming 1.1.0 locally (not in a cluster). I created a > simple app that parses the data (about 10.000 entries), stores it in a > stream and then makes some transformations on it. Here is the code: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *def main(args : Array[String]){ val master = "local[8]" val conf > = new SparkConf().setAppName("Tester").setMaster(master) val sc = new > StreamingContext(conf, Milliseconds(110000)) val stream = > sc.receiverStream(new MyReceiver("localhost", 9999)) val parsedStream = > parse(stream) parsedStream.foreachRDD(rdd => > println(rdd.first()+"\nRULE STARTS "+System.currentTimeMillis())) val > result1 = parsedStream .filter(entry => > entry.symbol.contains("walking") && entry.symbol.contains("true") && > entry.symbol.contains("id0")) .map(_.time) val result2 = > parsedStream .filter(entry => entry.symbol == "disappear" && > entry.symbol.contains("id0")) .map(_.time) val result3 = result1 > .transformWith(result2, (rdd1, rdd2: RDD[Int]) => > rdd1.subtract(rdd2)) result3.foreachRDD(rdd => > println(rdd.first()+"\nRULE ENDS "+System.currentTimeMillis())) > sc.start() sc.awaitTermination() } def parse(stream: DStream[String]) = > { stream.flatMap { line => val entries = > line.split("assert").filter(entry => !entry.isEmpty) entries.map { > tuple => val pattern = > """\s*[(](.+)[,]\s*([0-9]+)+\s*[)]\s*[)]\s*[,|\.]\s*""".r tuple > match { case pattern(symbol, time) => new > Data(symbol, time.toInt) } } } } case class Data > (symbol: String, time: Int)* > > I have a batch duration of 110.000 milliseconds in order to receive all > the data in one batch. I believed that, even locally, the spark is very > fast. In this case, it takes about 3.5sec to execute the rule (between > "RULE STARTS" and "RULE ENDS"). Am I doing something wrong or this is the > expected time? Any advise > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spark-user-list.1001560.n3.nabble.com/Slow-performance-in-spark-streaming-tp19371.html > To start a new topic under Apache Spark User List, email > ml-node+s1001560n1...@n3.nabble.com > To unsubscribe from Apache Spark User List, click here > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2> > . > NAML > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Slow-performance-in-spark-streaming-tp19371p19476.html Sent from the Apache Spark User List mailing list archive at Nabble.com.