RE: Creating topology in spark streaming
What is called Bolt in Storm is essentially a combination of [Transformation/Action and DStream RDD] in Spark – so to achieve a higher parallelism for specific Transformation/Action on specific Dstream RDD simply repartition it to the required number of partitions which directly relates to the corresponding number of Threads From: anshu shukla [mailto:anshushuk...@gmail.com] Sent: Wednesday, May 6, 2015 9:33 AM To: ayan guha Cc: user@spark.apache.org; d...@spark.apache.org Subject: Re: Creating topology in spark streaming But main problem is how to increase the level of parallelism for any particular bolt logic . suppose i want this type of topology . https://storm.apache.org/documentation/images/topology.png How we can manage it . On Wed, May 6, 2015 at 1:36 PM, ayan guha guha.a...@gmail.com wrote: Every transformation on a dstream will create another dstream. You may want to take a look at foreachrdd? Also, kindly share your code so people can help better On 6 May 2015 17:54, anshu shukla anshushuk...@gmail.com wrote: Please help guys, Even After going through all the examples given i have not understood how to pass the D-streams from one bolt/logic to other (without writing it on HDFS etc.) just like emit function in storm . Suppose i have topology with 3 bolts(say) BOLT1(parse the tweets nd emit tweet using given hashtags)=Bolt2(Complex logic for sentiment analysis over tweets)===BOLT3(submit tweets to the sql database using spark SQL) Now since Sentiment analysis will take most of the time ,we have to increase its level of parallelism for tuning latency. Howe to increase the levele of parallelism since the logic of topology is not clear . -- Thanks Regards, Anshu Shukla Indian Institute of Sciences -- Thanks Regards, Anshu Shukla
Re: Creating topology in spark streaming
Hi, I agree with Evo, Spark works at a different abstraction level than Storm, and there is not a direct translation from Storm topologies to Spark Streaming jobs. I think something remotely close is the notion of lineage of DStreams or RDDs, which is similar to a logical plan of an engine like Apache Pig. Here https://github.com/JerryLead/SparkInternals/blob/master/pdf/2-JobLogicalPlan.pdf is a diagram of a spark logical plan by a third party. I would suggest you reading the book Learning Spark https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/foreword01.html for more on this. But in general I think that Storm has an abstraction level closer to MapReduce, and Spark has an abstraction level closer to Pig, so the correspondence between Storm and Spark notions cannot be perfect. Greetings, Juan 2015-05-06 11:37 GMT+02:00 Evo Eftimov evo.efti...@isecc.com: What is called Bolt in Storm is essentially a combination of [Transformation/Action and DStream RDD] in Spark – so to achieve a higher parallelism for specific Transformation/Action on specific Dstream RDD simply repartition it to the required number of partitions which directly relates to the corresponding number of Threads *From:* anshu shukla [mailto:anshushuk...@gmail.com] *Sent:* Wednesday, May 6, 2015 9:33 AM *To:* ayan guha *Cc:* user@spark.apache.org; d...@spark.apache.org *Subject:* Re: Creating topology in spark streaming But main problem is how to increase the level of parallelism for any particular bolt logic . suppose i want this type of topology . https://storm.apache.org/documentation/images/topology.png How we can manage it . On Wed, May 6, 2015 at 1:36 PM, ayan guha guha.a...@gmail.com wrote: Every transformation on a dstream will create another dstream. You may want to take a look at foreachrdd? Also, kindly share your code so people can help better On 6 May 2015 17:54, anshu shukla anshushuk...@gmail.com wrote: Please help guys, Even After going through all the examples given i have not understood how to pass the D-streams from one bolt/logic to other (without writing it on HDFS etc.) just like emit function in storm . Suppose i have topology with 3 bolts(say) *BOLT1(parse the tweets nd emit tweet using given hashtags)=Bolt2(Complex logic for sentiment analysis over tweets)===BOLT3(submit tweets to the sql database using spark SQL)* Now since Sentiment analysis will take most of the time ,we have to increase its level of parallelism for tuning latency. Howe to increase the levele of parallelism since the logic of topology is not clear . -- Thanks Regards, Anshu Shukla Indian Institute of Sciences -- Thanks Regards, Anshu Shukla
Re: Creating topology in spark streaming
Hi, You can use the method repartition from DStream (for the Scala API) or JavaDStream (for the Java API) defrepartition(numPartitions: Int): DStream https://spark.apache.org/docs/latest/api/scala/org/apache/spark/streaming/dstream/DStream.html [T] Return a new DStream with an increased or decreased level of parallelism. Each RDD in the returned DStream has exactly numPartitions partitions. I think the post http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/ on integration of Spark Streaming gives very interesting review on the subject, although the integration with Kafka it's not up to date with https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html Hope that helps. Greetings, Juan 2015-05-06 10:32 GMT+02:00 anshu shukla anshushuk...@gmail.com: But main problem is how to increase the level of parallelism for any particular bolt logic . suppose i want this type of topology . https://storm.apache.org/documentation/images/topology.png How we can manage it . On Wed, May 6, 2015 at 1:36 PM, ayan guha guha.a...@gmail.com wrote: Every transformation on a dstream will create another dstream. You may want to take a look at foreachrdd? Also, kindly share your code so people can help better On 6 May 2015 17:54, anshu shukla anshushuk...@gmail.com wrote: Please help guys, Even After going through all the examples given i have not understood how to pass the D-streams from one bolt/logic to other (without writing it on HDFS etc.) just like emit function in storm . Suppose i have topology with 3 bolts(say) *BOLT1(parse the tweets nd emit tweet using given hashtags)=Bolt2(Complex logic for sentiment analysis over tweets)===BOLT3(submit tweets to the sql database using spark SQL)* Now since Sentiment analysis will take most of the time ,we have to increase its level of parallelism for tuning latency. Howe to increase the levele of parallelism since the logic of topology is not clear . -- Thanks Regards, Anshu Shukla Indian Institute of Sciences -- Thanks Regards, Anshu Shukla
Re: Creating topology in spark streaming
Thanks alot Juan, That was a great post, One more thing if u can .Any there any demo/blog telling how to configure or create a topology of different types .. i mean how we can decide the pipelining model in spark as done in storm for https://storm.apache.org/documentation/images/topology.png . On Wed, May 6, 2015 at 2:47 PM, Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi, You can use the method repartition from DStream (for the Scala API) or JavaDStream (for the Java API) defrepartition(numPartitions: Int): DStream https://spark.apache.org/docs/latest/api/scala/org/apache/spark/streaming/dstream/DStream.html [T] Return a new DStream with an increased or decreased level of parallelism. Each RDD in the returned DStream has exactly numPartitions partitions. I think the post http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/ on integration of Spark Streaming gives very interesting review on the subject, although the integration with Kafka it's not up to date with https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html Hope that helps. Greetings, Juan 2015-05-06 10:32 GMT+02:00 anshu shukla anshushuk...@gmail.com: But main problem is how to increase the level of parallelism for any particular bolt logic . suppose i want this type of topology . https://storm.apache.org/documentation/images/topology.png How we can manage it . On Wed, May 6, 2015 at 1:36 PM, ayan guha guha.a...@gmail.com wrote: Every transformation on a dstream will create another dstream. You may want to take a look at foreachrdd? Also, kindly share your code so people can help better On 6 May 2015 17:54, anshu shukla anshushuk...@gmail.com wrote: Please help guys, Even After going through all the examples given i have not understood how to pass the D-streams from one bolt/logic to other (without writing it on HDFS etc.) just like emit function in storm . Suppose i have topology with 3 bolts(say) *BOLT1(parse the tweets nd emit tweet using given hashtags)=Bolt2(Complex logic for sentiment analysis over tweets)===BOLT3(submit tweets to the sql database using spark SQL)* Now since Sentiment analysis will take most of the time ,we have to increase its level of parallelism for tuning latency. Howe to increase the levele of parallelism since the logic of topology is not clear . -- Thanks Regards, Anshu Shukla Indian Institute of Sciences -- Thanks Regards, Anshu Shukla -- Thanks Regards, Anshu Shukla
Re: Creating topology in spark streaming
But main problem is how to increase the level of parallelism for any particular bolt logic . suppose i want this type of topology . https://storm.apache.org/documentation/images/topology.png How we can manage it . On Wed, May 6, 2015 at 1:36 PM, ayan guha guha.a...@gmail.com wrote: Every transformation on a dstream will create another dstream. You may want to take a look at foreachrdd? Also, kindly share your code so people can help better On 6 May 2015 17:54, anshu shukla anshushuk...@gmail.com wrote: Please help guys, Even After going through all the examples given i have not understood how to pass the D-streams from one bolt/logic to other (without writing it on HDFS etc.) just like emit function in storm . Suppose i have topology with 3 bolts(say) *BOLT1(parse the tweets nd emit tweet using given hashtags)=Bolt2(Complex logic for sentiment analysis over tweets)===BOLT3(submit tweets to the sql database using spark SQL)* Now since Sentiment analysis will take most of the time ,we have to increase its level of parallelism for tuning latency. Howe to increase the levele of parallelism since the logic of topology is not clear . -- Thanks Regards, Anshu Shukla Indian Institute of Sciences -- Thanks Regards, Anshu Shukla
Re: Creating topology in spark streaming
Every transformation on a dstream will create another dstream. You may want to take a look at foreachrdd? Also, kindly share your code so people can help better On 6 May 2015 17:54, anshu shukla anshushuk...@gmail.com wrote: Please help guys, Even After going through all the examples given i have not understood how to pass the D-streams from one bolt/logic to other (without writing it on HDFS etc.) just like emit function in storm . Suppose i have topology with 3 bolts(say) *BOLT1(parse the tweets nd emit tweet using given hashtags)=Bolt2(Complex logic for sentiment analysis over tweets)===BOLT3(submit tweets to the sql database using spark SQL)* Now since Sentiment analysis will take most of the time ,we have to increase its level of parallelism for tuning latency. Howe to increase the levele of parallelism since the logic of topology is not clear . -- Thanks Regards, Anshu Shukla Indian Institute of Sciences
RE: Creating topology in spark streaming
The “abstraction level” of Storm or shall we call it Architecture, is effectively Pipelines of Nodes/Agents – Pipelines is one of the standard Parallel Programming Patterns which you can use on multicore CPUs as well as Distributed Systems – the chaps from Storm simply implemented it as a reusable framework for distributed systems and offered it for free. Effectively it you have a set of independent Agents chained in a pipeline as the output from the previous Agent feeds into the Input of the next Agent Spark Streaming (which is essentially Batch Spark but with some optimizations for Streaming) on the other hand is more like a Map Reduce framework where you always have to have a Central Job/Task Manager scheduling and submitting tasks to remote distributed nodes, collecting the results / statuses and then scheduling and sending some more tasks and so on “Map Reduce” is simply another Parallel Programming pattern known as Data Parallelism or Data Parallel Programming. Although you can also have Data Parallelism without a Central Scheduler From: Juan Rodríguez Hortalá [mailto:juan.rodriguez.hort...@gmail.com] Sent: Wednesday, May 6, 2015 11:20 AM To: Evo Eftimov Cc: anshu shukla; ayan guha; user@spark.apache.org Subject: Re: Creating topology in spark streaming Hi, I agree with Evo, Spark works at a different abstraction level than Storm, and there is not a direct translation from Storm topologies to Spark Streaming jobs. I think something remotely close is the notion of lineage of DStreams or RDDs, which is similar to a logical plan of an engine like Apache Pig. Here https://github.com/JerryLead/SparkInternals/blob/master/pdf/2-JobLogicalPlan.pdf is a diagram of a spark logical plan by a third party. I would suggest you reading the book Learning Spark https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/foreword01.html for more on this. But in general I think that Storm has an abstraction level closer to MapReduce, and Spark has an abstraction level closer to Pig, so the correspondence between Storm and Spark notions cannot be perfect. Greetings, Juan 2015-05-06 11:37 GMT+02:00 Evo Eftimov evo.efti...@isecc.com: What is called Bolt in Storm is essentially a combination of [Transformation/Action and DStream RDD] in Spark – so to achieve a higher parallelism for specific Transformation/Action on specific Dstream RDD simply repartition it to the required number of partitions which directly relates to the corresponding number of Threads From: anshu shukla [mailto:anshushuk...@gmail.com] Sent: Wednesday, May 6, 2015 9:33 AM To: ayan guha Cc: user@spark.apache.org; d...@spark.apache.org Subject: Re: Creating topology in spark streaming But main problem is how to increase the level of parallelism for any particular bolt logic . suppose i want this type of topology . https://storm.apache.org/documentation/images/topology.png How we can manage it . On Wed, May 6, 2015 at 1:36 PM, ayan guha guha.a...@gmail.com wrote: Every transformation on a dstream will create another dstream. You may want to take a look at foreachrdd? Also, kindly share your code so people can help better On 6 May 2015 17:54, anshu shukla anshushuk...@gmail.com wrote: Please help guys, Even After going through all the examples given i have not understood how to pass the D-streams from one bolt/logic to other (without writing it on HDFS etc.) just like emit function in storm . Suppose i have topology with 3 bolts(say) BOLT1(parse the tweets nd emit tweet using given hashtags)=Bolt2(Complex logic for sentiment analysis over tweets)===BOLT3(submit tweets to the sql database using spark SQL) Now since Sentiment analysis will take most of the time ,we have to increase its level of parallelism for tuning latency. Howe to increase the levele of parallelism since the logic of topology is not clear . -- Thanks Regards, Anshu Shukla Indian Institute of Sciences -- Thanks Regards, Anshu Shukla