RE: Creating topology in spark streaming

2015-05-06 Thread Evo Eftimov
What is called Bolt in Storm is essentially a combination of 
[Transformation/Action and DStream RDD] in Spark – so to achieve a higher 
parallelism for specific Transformation/Action on specific Dstream RDD simply 
repartition it to the required number of partitions which directly relates to 
the corresponding number of Threads   

 

From: anshu shukla [mailto:anshushuk...@gmail.com] 
Sent: Wednesday, May 6, 2015 9:33 AM
To: ayan guha
Cc: user@spark.apache.org; d...@spark.apache.org
Subject: Re: Creating topology in spark streaming

 

But main problem is how to increase the level of parallelism  for any 
particular bolt logic .

 

suppose i  want  this type of topology .

 

https://storm.apache.org/documentation/images/topology.png

 

How we can manage it .

 

On Wed, May 6, 2015 at 1:36 PM, ayan guha guha.a...@gmail.com wrote:

Every transformation on a dstream will create another dstream. You may want to 
take a look at foreachrdd? Also, kindly share your code so people can help 
better

On 6 May 2015 17:54, anshu shukla anshushuk...@gmail.com wrote:

Please help  guys, Even  After going through all the examples given i have not 
understood how to pass the  D-streams  from one bolt/logic to other (without 
writing it on HDFS etc.) just like emit function in storm .

Suppose i have topology with 3  bolts(say) 

 

BOLT1(parse the tweets nd emit tweet using given hashtags)=Bolt2(Complex 
logic for sentiment analysis over tweets)===BOLT3(submit tweets to the sql 
database using spark SQL)


 

 

Now  since Sentiment analysis will take most of the time ,we have to increase 
its level of parallelism for tuning latency. Howe to increase the levele of 
parallelism since the logic of topology is not clear .

 

-- 

Thanks  Regards,
Anshu Shukla

Indian Institute of Sciences





 

-- 

Thanks  Regards,
Anshu Shukla



Re: Creating topology in spark streaming

2015-05-06 Thread Juan Rodríguez Hortalá
Hi,

I agree with Evo, Spark works at a different abstraction level than Storm,
and there is not a direct translation from Storm topologies to Spark
Streaming jobs. I think something remotely close is the notion of lineage
of  DStreams or RDDs, which is similar to a logical plan of an engine like
Apache Pig. Here
https://github.com/JerryLead/SparkInternals/blob/master/pdf/2-JobLogicalPlan.pdf
is a diagram of a spark logical plan by a third party. I would suggest you
reading the book Learning Spark
https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/foreword01.html
for more on this. But in general I think that Storm has an abstraction
level closer to MapReduce, and Spark has an abstraction level closer to
Pig, so the correspondence between Storm and Spark notions cannot be
perfect.

Greetings,

Juan




2015-05-06 11:37 GMT+02:00 Evo Eftimov evo.efti...@isecc.com:

 What is called Bolt in Storm is essentially a combination of
 [Transformation/Action and DStream RDD] in Spark – so to achieve a higher
 parallelism for specific Transformation/Action on specific Dstream RDD
 simply repartition it to the required number of partitions which directly
 relates to the corresponding number of Threads



 *From:* anshu shukla [mailto:anshushuk...@gmail.com]
 *Sent:* Wednesday, May 6, 2015 9:33 AM
 *To:* ayan guha
 *Cc:* user@spark.apache.org; d...@spark.apache.org
 *Subject:* Re: Creating topology in spark streaming



 But main problem is how to increase the level of parallelism  for any
 particular bolt logic .



 suppose i  want  this type of topology .



 https://storm.apache.org/documentation/images/topology.png



 How we can manage it .



 On Wed, May 6, 2015 at 1:36 PM, ayan guha guha.a...@gmail.com wrote:

 Every transformation on a dstream will create another dstream. You may
 want to take a look at foreachrdd? Also, kindly share your code so people
 can help better

 On 6 May 2015 17:54, anshu shukla anshushuk...@gmail.com wrote:

 Please help  guys, Even  After going through all the examples given i have
 not understood how to pass the  D-streams  from one bolt/logic to other
 (without writing it on HDFS etc.) just like emit function in storm .

 Suppose i have topology with 3  bolts(say)



 *BOLT1(parse the tweets nd emit tweet using given
 hashtags)=Bolt2(Complex logic for sentiment analysis over
 tweets)===BOLT3(submit tweets to the sql database using spark SQL)*





 Now  since Sentiment analysis will take most of the time ,we have to
 increase its level of parallelism for tuning latency. Howe to increase the
 levele of parallelism since the logic of topology is not clear .



 --

 Thanks  Regards,
 Anshu Shukla

 Indian Institute of Sciences





 --

 Thanks  Regards,
 Anshu Shukla



Re: Creating topology in spark streaming

2015-05-06 Thread Juan Rodríguez Hortalá
Hi,

You can use the method repartition from DStream (for the Scala API) or
JavaDStream (for the Java API)

defrepartition(numPartitions: Int): DStream
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/streaming/dstream/DStream.html
[T]

Return a new DStream with an increased or decreased level of parallelism.
Each RDD in the returned DStream has exactly numPartitions partitions.

I think the post
http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/
on integration of Spark Streaming gives very interesting review on the
subject, although the integration with Kafka it's not up to date with
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

Hope that helps.

Greetings,

Juan

2015-05-06 10:32 GMT+02:00 anshu shukla anshushuk...@gmail.com:

 But main problem is how to increase the level of parallelism  for any
 particular bolt logic .

 suppose i  want  this type of topology .

 https://storm.apache.org/documentation/images/topology.png

 How we can manage it .

 On Wed, May 6, 2015 at 1:36 PM, ayan guha guha.a...@gmail.com wrote:

 Every transformation on a dstream will create another dstream. You may
 want to take a look at foreachrdd? Also, kindly share your code so people
 can help better
 On 6 May 2015 17:54, anshu shukla anshushuk...@gmail.com wrote:

 Please help  guys, Even  After going through all the examples given i
 have not understood how to pass the  D-streams  from one bolt/logic to
 other (without writing it on HDFS etc.) just like emit function in storm .
 Suppose i have topology with 3  bolts(say)

 *BOLT1(parse the tweets nd emit tweet using given
 hashtags)=Bolt2(Complex logic for sentiment analysis over
 tweets)===BOLT3(submit tweets to the sql database using spark SQL)*


 Now  since Sentiment analysis will take most of the time ,we have to
 increase its level of parallelism for tuning latency. Howe to increase the
 levele of parallelism since the logic of topology is not clear .

 --
 Thanks  Regards,
 Anshu Shukla
 Indian Institute of Sciences




 --
 Thanks  Regards,
 Anshu Shukla



Re: Creating topology in spark streaming

2015-05-06 Thread anshu shukla
Thanks alot Juan,

That  was a great  post, One more thing if u can .Any there any demo/blog
 telling how to configure or create  a topology of different types .. i
mean how we can decide the pipelining model in spark  as done in storm  for
  https://storm.apache.org/documentation/images/topology.png .


On Wed, May 6, 2015 at 2:47 PM, Juan Rodríguez Hortalá 
juan.rodriguez.hort...@gmail.com wrote:

 Hi,

 You can use the method repartition from DStream (for the Scala API) or
 JavaDStream (for the Java API)

 defrepartition(numPartitions: Int): DStream
 https://spark.apache.org/docs/latest/api/scala/org/apache/spark/streaming/dstream/DStream.html
 [T]

 Return a new DStream with an increased or decreased level of parallelism.
 Each RDD in the returned DStream has exactly numPartitions partitions.

 I think the post
 http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/
 on integration of Spark Streaming gives very interesting review on the
 subject, although the integration with Kafka it's not up to date with
 https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

 Hope that helps.

 Greetings,

 Juan

 2015-05-06 10:32 GMT+02:00 anshu shukla anshushuk...@gmail.com:

 But main problem is how to increase the level of parallelism  for any
 particular bolt logic .

 suppose i  want  this type of topology .

 https://storm.apache.org/documentation/images/topology.png

 How we can manage it .

 On Wed, May 6, 2015 at 1:36 PM, ayan guha guha.a...@gmail.com wrote:

 Every transformation on a dstream will create another dstream. You may
 want to take a look at foreachrdd? Also, kindly share your code so people
 can help better
 On 6 May 2015 17:54, anshu shukla anshushuk...@gmail.com wrote:

 Please help  guys, Even  After going through all the examples given i
 have not understood how to pass the  D-streams  from one bolt/logic to
 other (without writing it on HDFS etc.) just like emit function in storm .
 Suppose i have topology with 3  bolts(say)

 *BOLT1(parse the tweets nd emit tweet using given
 hashtags)=Bolt2(Complex logic for sentiment analysis over
 tweets)===BOLT3(submit tweets to the sql database using spark SQL)*


 Now  since Sentiment analysis will take most of the time ,we have to
 increase its level of parallelism for tuning latency. Howe to increase the
 levele of parallelism since the logic of topology is not clear .

 --
 Thanks  Regards,
 Anshu Shukla
 Indian Institute of Sciences




 --
 Thanks  Regards,
 Anshu Shukla





-- 
Thanks  Regards,
Anshu Shukla


Re: Creating topology in spark streaming

2015-05-06 Thread anshu shukla
But main problem is how to increase the level of parallelism  for any
particular bolt logic .

suppose i  want  this type of topology .

https://storm.apache.org/documentation/images/topology.png

How we can manage it .

On Wed, May 6, 2015 at 1:36 PM, ayan guha guha.a...@gmail.com wrote:

 Every transformation on a dstream will create another dstream. You may
 want to take a look at foreachrdd? Also, kindly share your code so people
 can help better
 On 6 May 2015 17:54, anshu shukla anshushuk...@gmail.com wrote:

 Please help  guys, Even  After going through all the examples given i
 have not understood how to pass the  D-streams  from one bolt/logic to
 other (without writing it on HDFS etc.) just like emit function in storm .
 Suppose i have topology with 3  bolts(say)

 *BOLT1(parse the tweets nd emit tweet using given
 hashtags)=Bolt2(Complex logic for sentiment analysis over
 tweets)===BOLT3(submit tweets to the sql database using spark SQL)*


 Now  since Sentiment analysis will take most of the time ,we have to
 increase its level of parallelism for tuning latency. Howe to increase the
 levele of parallelism since the logic of topology is not clear .

 --
 Thanks  Regards,
 Anshu Shukla
 Indian Institute of Sciences




-- 
Thanks  Regards,
Anshu Shukla


Re: Creating topology in spark streaming

2015-05-06 Thread ayan guha
Every transformation on a dstream will create another dstream. You may want
to take a look at foreachrdd? Also, kindly share your code so people can
help better
On 6 May 2015 17:54, anshu shukla anshushuk...@gmail.com wrote:

 Please help  guys, Even  After going through all the examples given i have
 not understood how to pass the  D-streams  from one bolt/logic to other
 (without writing it on HDFS etc.) just like emit function in storm .
 Suppose i have topology with 3  bolts(say)

 *BOLT1(parse the tweets nd emit tweet using given
 hashtags)=Bolt2(Complex logic for sentiment analysis over
 tweets)===BOLT3(submit tweets to the sql database using spark SQL)*


 Now  since Sentiment analysis will take most of the time ,we have to
 increase its level of parallelism for tuning latency. Howe to increase the
 levele of parallelism since the logic of topology is not clear .

 --
 Thanks  Regards,
 Anshu Shukla
 Indian Institute of Sciences



RE: Creating topology in spark streaming

2015-05-06 Thread Evo Eftimov
The “abstraction level” of Storm or shall we call it Architecture, is 
effectively Pipelines of Nodes/Agents – Pipelines is one of the standard 
Parallel Programming Patterns which you can use on multicore CPUs as well as 
Distributed Systems – the chaps  from Storm simply implemented it as a reusable 
framework for distributed systems and offered it for free. Effectively it you 
have a set of independent Agents chained in a pipeline as the output from the 
previous Agent feeds into the Input of the next Agent  

 

Spark Streaming (which is essentially Batch Spark but with some optimizations 
for Streaming) on the other hand is more like a Map Reduce framework where you 
always have to have a Central Job/Task Manager scheduling and submitting tasks 
to remote distributed nodes, collecting the results / statuses and then 
scheduling and sending some more tasks and so on 

 

“Map Reduce” is simply another Parallel Programming pattern known as Data 
Parallelism or Data Parallel Programming. Although you can also have Data 
Parallelism without a Central Scheduler 

 

From: Juan Rodríguez Hortalá [mailto:juan.rodriguez.hort...@gmail.com] 
Sent: Wednesday, May 6, 2015 11:20 AM
To: Evo Eftimov
Cc: anshu shukla; ayan guha; user@spark.apache.org
Subject: Re: Creating topology in spark streaming

 

Hi, 

 

I agree with Evo, Spark works at a different abstraction level than Storm, and 
there is not a direct translation from Storm topologies to Spark Streaming 
jobs. I think something remotely close is the notion of lineage of  DStreams or 
RDDs, which is similar to a logical plan of an engine like Apache Pig. Here  
https://github.com/JerryLead/SparkInternals/blob/master/pdf/2-JobLogicalPlan.pdf
 is a diagram of a spark logical plan by a third party. I would suggest you 
reading the book Learning Spark 
https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/foreword01.html
 for more on this. But in general I think that Storm has an abstraction level 
closer to MapReduce, and Spark has an abstraction level closer to Pig, so the 
correspondence between Storm and Spark notions cannot be perfect. 

 

Greetings, 

 

Juan 

 

 

 

 

2015-05-06 11:37 GMT+02:00 Evo Eftimov evo.efti...@isecc.com:

What is called Bolt in Storm is essentially a combination of 
[Transformation/Action and DStream RDD] in Spark – so to achieve a higher 
parallelism for specific Transformation/Action on specific Dstream RDD simply 
repartition it to the required number of partitions which directly relates to 
the corresponding number of Threads   

 

From: anshu shukla [mailto:anshushuk...@gmail.com] 
Sent: Wednesday, May 6, 2015 9:33 AM
To: ayan guha
Cc: user@spark.apache.org; d...@spark.apache.org
Subject: Re: Creating topology in spark streaming

 

But main problem is how to increase the level of parallelism  for any 
particular bolt logic .

 

suppose i  want  this type of topology .

 

https://storm.apache.org/documentation/images/topology.png

 

How we can manage it .

 

On Wed, May 6, 2015 at 1:36 PM, ayan guha guha.a...@gmail.com wrote:

Every transformation on a dstream will create another dstream. You may want to 
take a look at foreachrdd? Also, kindly share your code so people can help 
better

On 6 May 2015 17:54, anshu shukla anshushuk...@gmail.com wrote:

Please help  guys, Even  After going through all the examples given i have not 
understood how to pass the  D-streams  from one bolt/logic to other (without 
writing it on HDFS etc.) just like emit function in storm .

Suppose i have topology with 3  bolts(say) 

 

BOLT1(parse the tweets nd emit tweet using given hashtags)=Bolt2(Complex 
logic for sentiment analysis over tweets)===BOLT3(submit tweets to the sql 
database using spark SQL)


 

 

Now  since Sentiment analysis will take most of the time ,we have to increase 
its level of parallelism for tuning latency. Howe to increase the levele of 
parallelism since the logic of topology is not clear .

 

-- 

Thanks  Regards,
Anshu Shukla

Indian Institute of Sciences





 

-- 

Thanks  Regards,
Anshu Shukla