Re: spark streaming kafa best practices ?

2014-12-17 Thread Gerard Maas
Patrick,

I was wondering why one would choose for rdd.map vs rdd.foreach to execute
a side-effecting function on an RDD.

-kr, Gerard.

On Sat, Dec 6, 2014 at 12:57 AM, Patrick Wendell pwend...@gmail.com wrote:

 The second choice is better. Once you call collect() you are pulling
 all of the data onto a single node, you want to do most of the
 processing  in parallel on the cluster, which is what map() will do.
 Ideally you'd try to summarize the data or reduce it before calling
 collect().

 On Fri, Dec 5, 2014 at 5:26 AM, david david...@free.fr wrote:
  hi,
 
What is the bet way to process a batch window in SparkStreaming :
 
  kafkaStream.foreachRDD(rdd = {
rdd.collect().foreach(event = {
  // process the event
  process(event)
})
  })
 
 
  Or
 
  kafkaStream.foreachRDD(rdd = {
rdd.map(event = {
  // process the event
  process(event)
}).collect()
  })
 
 
  thank's
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafa-best-practices-tp20470.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: spark streaming kafa best practices ?

2014-12-17 Thread Patrick Wendell
Foreach is slightly more efficient because Spark doesn't bother to try
and collect results from each task since it's understood there will be
no return type. I think the difference is very marginal though - it's
mostly stylistic... typically you use foreach for something that is
intended to produce a side effect and map for something that will
return a new dataset.

On Wed, Dec 17, 2014 at 5:43 AM, Gerard Maas gerard.m...@gmail.com wrote:
 Patrick,

 I was wondering why one would choose for rdd.map vs rdd.foreach to execute a
 side-effecting function on an RDD.

 -kr, Gerard.

 On Sat, Dec 6, 2014 at 12:57 AM, Patrick Wendell pwend...@gmail.com wrote:

 The second choice is better. Once you call collect() you are pulling
 all of the data onto a single node, you want to do most of the
 processing  in parallel on the cluster, which is what map() will do.
 Ideally you'd try to summarize the data or reduce it before calling
 collect().

 On Fri, Dec 5, 2014 at 5:26 AM, david david...@free.fr wrote:
  hi,
 
What is the bet way to process a batch window in SparkStreaming :
 
  kafkaStream.foreachRDD(rdd = {
rdd.collect().foreach(event = {
  // process the event
  process(event)
})
  })
 
 
  Or
 
  kafkaStream.foreachRDD(rdd = {
rdd.map(event = {
  // process the event
  process(event)
}).collect()
  })
 
 
  thank's
 
 
 
  --
  View this message in context:
  http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafa-best-practices-tp20470.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark streaming kafa best practices ?

2014-12-17 Thread Tobias Pfeiffer
Hi,

On Thu, Dec 18, 2014 at 3:08 AM, Patrick Wendell pwend...@gmail.com wrote:

 On Wed, Dec 17, 2014 at 5:43 AM, Gerard Maas gerard.m...@gmail.com
 wrote:
  I was wondering why one would choose for rdd.map vs rdd.foreach to
 execute a
  side-effecting function on an RDD.


Personally, I like to get the count of processed items, so I do something
like
  rdd.map(item = processItem(item)).count()
instead of
  rdd.foreach(item = processItem(item))
but I would be happy to learn about a better way.

Tobias


spark streaming kafa best practices ?

2014-12-05 Thread david
hi,

  What is the bet way to process a batch window in SparkStreaming :

kafkaStream.foreachRDD(rdd = {
  rdd.collect().foreach(event = {
// process the event
process(event)
  })
})


Or 

kafkaStream.foreachRDD(rdd = {
  rdd.map(event = {
// process the event
process(event)
  }).collect()
})


thank's



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafa-best-practices-tp20470.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org