Re: Slow performance in spark streaming

Prannoy Fri, 21 Nov 2014 03:36:08 -0800

Hi,

Spark runs in local with a speed less than in cluster. Cluster machines
usually have a high configuration and also the tasks are distrubuted in
workers in order to get a faster result. So you will always find a
difference in speed when running in local and when running in cluster. Try
running the same in a cluster and evaluate the speed there.


Thanks

On Thu, Nov 20, 2014 at 6:52 PM, Blackeye [via Apache Spark User List] <
ml-node+s1001560n1937...@n3.nabble.com> wrote:

> I am using spark streaming 1.1.0 locally (not in a cluster). I created a
> simple app that parses the data (about 10.000 entries), stores it in a
> stream and then makes some transformations on it. Here is the code:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *def main(args : Array[String]){     val master = "local[8]"     val conf
> = new SparkConf().setAppName("Tester").setMaster(master)     val sc = new
> StreamingContext(conf, Milliseconds(110000))     val stream =
> sc.receiverStream(new MyReceiver("localhost", 9999))     val parsedStream =
> parse(stream)     parsedStream.foreachRDD(rdd =>
> println(rdd.first()+"\nRULE STARTS "+System.currentTimeMillis()))     val
> result1 = parsedStream        .filter(entry =>
> entry.symbol.contains("walking")        && entry.symbol.contains("true") &&
> entry.symbol.contains("id0"))        .map(_.time)     val result2 =
> parsedStream        .filter(entry =>         entry.symbol == "disappear" &&
> entry.symbol.contains("id0"))        .map(_.time)     val result3 = result1
>       .transformWith(result2, (rdd1, rdd2: RDD[Int]) =>
> rdd1.subtract(rdd2))     result3.foreachRDD(rdd =>
> println(rdd.first()+"\nRULE ENDS "+System.currentTimeMillis()))
>  sc.start()    sc.awaitTermination() } def parse(stream: DStream[String]) =
> {     stream.flatMap { line =>         val entries =
> line.split("assert").filter(entry => !entry.isEmpty)         entries.map {
> tuple =>             val pattern =
> """\s*[(](.+)[,]\s*([0-9]+)+\s*[)]\s*[)]\s*[,|\.]\s*""".r             tuple
> match {               case pattern(symbol, time) =>               new
> Data(symbol, time.toInt)             }          }     } } case class Data
> (symbol: String, time: Int)*
>
> I have a batch duration of 110.000 milliseconds in order to receive all
> the data in one batch. I believed that, even locally, the spark is very
> fast. In this case, it takes about 3.5sec to execute the rule (between
> "RULE STARTS" and "RULE ENDS"). Am I doing something wrong or this is the
> expected time? Any advise
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Slow-performance-in-spark-streaming-tp19371.html
>  To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Slow-performance-in-spark-streaming-tp19371p19476.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Slow performance in spark streaming

Reply via email to