Re: Spark Streaming with Flume or Kafka?

2014-11-19 Thread Hari Shreedharan
As of now, you can feed Spark Streaming from both kafka and flume.
Currently though there is no API to write data back to either of the two
directly.

I sent a PR which should eventually add something like this:
https://github.com/harishreedharan/spark/blob/Kafka-output/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaOutputWriter.scala
that would allow Spark Streaming to write back to Kafka. This will likely
be reviewed and committed after 1.2.

I would consider writing something similar to push data to Flume as well,
if there is a sufficient use-case for it. I have seen people talk about
writing back to kafka quite a bit - hence the above patch.

Which one is better is upto your use-case and existing infrastructure and
preference. Both would work as is, but writing back to Flume would usually
be if you want to write to HDFS/HBase/Solr etc -- which you could write
back directly from Spark Streaming (of course, there are benefits of
writing back using Flume like the additional buffering etc Flume gives),
but it is still possible to do so from Spark Streaming itself.

But for Kafka, the usual use-case is a variety of custom applications
reading the same data -- for which it makes a whole lot of sense to write
back to Kafka. An example is to sanitize incoming data in Spark Streaming
(from Flume or Kafka or something else) and make it available for a variety
of apps via Kafka.

Hope this helps!

Hari


On Wed, Nov 19, 2014 at 8:10 AM, Guillermo Ortiz konstt2...@gmail.com
wrote:

 Hi,

 I'm starting with Spark and I just trying to understand if I want to
 use Spark Streaming, should I use to feed it Flume or Kafka? I think
 there's not a official Sink for Flume to Spark Streaming and it seems
 that Kafka it fits better since gives you readibility.

 Could someone give a good scenario for each alternative? When would it
 make sense to use Kafka and when Flume for Spark Streaming?

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark Streaming with Flume or Kafka?

2014-11-19 Thread Hari Shreedharan
Btw, if you want to write to Spark Streaming from Flume -- there is a sink
(it is a part of Spark, not Flume). See Approach 2 here:
http://spark.apache.org/docs/latest/streaming-flume-integration.html



On Wed, Nov 19, 2014 at 12:41 PM, Hari Shreedharan 
hshreedha...@cloudera.com wrote:

 As of now, you can feed Spark Streaming from both kafka and flume.
 Currently though there is no API to write data back to either of the two
 directly.

 I sent a PR which should eventually add something like this:
 https://github.com/harishreedharan/spark/blob/Kafka-output/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaOutputWriter.scala
 that would allow Spark Streaming to write back to Kafka. This will likely
 be reviewed and committed after 1.2.

 I would consider writing something similar to push data to Flume as well,
 if there is a sufficient use-case for it. I have seen people talk about
 writing back to kafka quite a bit - hence the above patch.

 Which one is better is upto your use-case and existing infrastructure and
 preference. Both would work as is, but writing back to Flume would usually
 be if you want to write to HDFS/HBase/Solr etc -- which you could write
 back directly from Spark Streaming (of course, there are benefits of
 writing back using Flume like the additional buffering etc Flume gives),
 but it is still possible to do so from Spark Streaming itself.

 But for Kafka, the usual use-case is a variety of custom applications
 reading the same data -- for which it makes a whole lot of sense to write
 back to Kafka. An example is to sanitize incoming data in Spark Streaming
 (from Flume or Kafka or something else) and make it available for a variety
 of apps via Kafka.

 Hope this helps!

 Hari


 On Wed, Nov 19, 2014 at 8:10 AM, Guillermo Ortiz konstt2...@gmail.com
 wrote:

 Hi,

 I'm starting with Spark and I just trying to understand if I want to
 use Spark Streaming, should I use to feed it Flume or Kafka? I think
 there's not a official Sink for Flume to Spark Streaming and it seems
 that Kafka it fits better since gives you readibility.

 Could someone give a good scenario for each alternative? When would it
 make sense to use Kafka and when Flume for Spark Streaming?

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Spark Streaming with Flume or Kafka?

2014-11-19 Thread Guillermo Ortiz
Thank you for your answer, I don't know if I typed the question
correctly. But your nswer helps me.

I'm going to make the question again for knowing if you understood me.

I have this topology:

DataSource1,  , DataSourceN -- Kafka -- SparkStreaming -- HDFS

DataSource1,  , DataSourceN -- Flume -- SparkStreaming -- HDFS

All data are going to be pro


2014-11-19 21:50 GMT+01:00 Hari Shreedharan hshreedha...@cloudera.com:
 Btw, if you want to write to Spark Streaming from Flume -- there is a sink
 (it is a part of Spark, not Flume). See Approach 2 here:
 http://spark.apache.org/docs/latest/streaming-flume-integration.html



 On Wed, Nov 19, 2014 at 12:41 PM, Hari Shreedharan
 hshreedha...@cloudera.com wrote:

 As of now, you can feed Spark Streaming from both kafka and flume.
 Currently though there is no API to write data back to either of the two
 directly.

 I sent a PR which should eventually add something like this:
 https://github.com/harishreedharan/spark/blob/Kafka-output/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaOutputWriter.scala
 that would allow Spark Streaming to write back to Kafka. This will likely be
 reviewed and committed after 1.2.

 I would consider writing something similar to push data to Flume as well,
 if there is a sufficient use-case for it. I have seen people talk about
 writing back to kafka quite a bit - hence the above patch.

 Which one is better is upto your use-case and existing infrastructure and
 preference. Both would work as is, but writing back to Flume would usually
 be if you want to write to HDFS/HBase/Solr etc -- which you could write back
 directly from Spark Streaming (of course, there are benefits of writing back
 using Flume like the additional buffering etc Flume gives), but it is still
 possible to do so from Spark Streaming itself.

 But for Kafka, the usual use-case is a variety of custom applications
 reading the same data -- for which it makes a whole lot of sense to write
 back to Kafka. An example is to sanitize incoming data in Spark Streaming
 (from Flume or Kafka or something else) and make it available for a variety
 of apps via Kafka.

 Hope this helps!

 Hari


 On Wed, Nov 19, 2014 at 8:10 AM, Guillermo Ortiz konstt2...@gmail.com
 wrote:

 Hi,

 I'm starting with Spark and I just trying to understand if I want to
 use Spark Streaming, should I use to feed it Flume or Kafka? I think
 there's not a official Sink for Flume to Spark Streaming and it seems
 that Kafka it fits better since gives you readibility.

 Could someone give a good scenario for each alternative? When would it
 make sense to use Kafka and when Flume for Spark Streaming?

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming with Flume or Kafka?

2014-11-19 Thread Guillermo Ortiz
Thank you for your answer, I don't know if I typed the question
correctly. But your nswer helps me.

I'm going to make the question again for knowing if you understood me.

I have this topology:

DataSource1,  , DataSourceN -- Kafka -- SparkStreaming -- HDFS
  Kafka --
HDFS (raw data)

DataSource1,  , DataSourceN -- Flume -- SparkStreaming -- HDFS
  Flume --
HDFS (raw data)


All data are going to be processed and going to HDFS as raw and
processed data. I don't know if it makes sense to use Kafka in this
case if data are just going to HDFS. I guess that before this
FlumeSpark Sink has more sense to feed SparkStream with a real-time
flow of data.. It doesn't look too much sense to have SparkStreaming
and get the data from HDFS.

2014-11-19 22:55 GMT+01:00 Guillermo Ortiz konstt2...@gmail.com:
 Thank you for your answer, I don't know if I typed the question
 correctly. But your nswer helps me.

 I'm going to make the question again for knowing if you understood me.

 I have this topology:

 DataSource1,  , DataSourceN -- Kafka -- SparkStreaming -- HDFS

 DataSource1,  , DataSourceN -- Flume -- SparkStreaming -- HDFS

 All data are going to be pro


 2014-11-19 21:50 GMT+01:00 Hari Shreedharan hshreedha...@cloudera.com:
 Btw, if you want to write to Spark Streaming from Flume -- there is a sink
 (it is a part of Spark, not Flume). See Approach 2 here:
 http://spark.apache.org/docs/latest/streaming-flume-integration.html



 On Wed, Nov 19, 2014 at 12:41 PM, Hari Shreedharan
 hshreedha...@cloudera.com wrote:

 As of now, you can feed Spark Streaming from both kafka and flume.
 Currently though there is no API to write data back to either of the two
 directly.

 I sent a PR which should eventually add something like this:
 https://github.com/harishreedharan/spark/blob/Kafka-output/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaOutputWriter.scala
 that would allow Spark Streaming to write back to Kafka. This will likely be
 reviewed and committed after 1.2.

 I would consider writing something similar to push data to Flume as well,
 if there is a sufficient use-case for it. I have seen people talk about
 writing back to kafka quite a bit - hence the above patch.

 Which one is better is upto your use-case and existing infrastructure and
 preference. Both would work as is, but writing back to Flume would usually
 be if you want to write to HDFS/HBase/Solr etc -- which you could write back
 directly from Spark Streaming (of course, there are benefits of writing back
 using Flume like the additional buffering etc Flume gives), but it is still
 possible to do so from Spark Streaming itself.

 But for Kafka, the usual use-case is a variety of custom applications
 reading the same data -- for which it makes a whole lot of sense to write
 back to Kafka. An example is to sanitize incoming data in Spark Streaming
 (from Flume or Kafka or something else) and make it available for a variety
 of apps via Kafka.

 Hope this helps!

 Hari


 On Wed, Nov 19, 2014 at 8:10 AM, Guillermo Ortiz konstt2...@gmail.com
 wrote:

 Hi,

 I'm starting with Spark and I just trying to understand if I want to
 use Spark Streaming, should I use to feed it Flume or Kafka? I think
 there's not a official Sink for Flume to Spark Streaming and it seems
 that Kafka it fits better since gives you readibility.

 Could someone give a good scenario for each alternative? When would it
 make sense to use Kafka and when Flume for Spark Streaming?

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming with Flume or Kafka?

2014-11-19 Thread Akhil Das
You can also look at the Amazon's kinesis if you don't want to handle the
pain of maintaining kafka/flume infra.

Thanks
Best Regards

On Thu, Nov 20, 2014 at 3:32 AM, Guillermo Ortiz konstt2...@gmail.com
wrote:

 Thank you for your answer, I don't know if I typed the question
 correctly. But your nswer helps me.

 I'm going to make the question again for knowing if you understood me.

 I have this topology:

 DataSource1,  , DataSourceN -- Kafka -- SparkStreaming -- HDFS
   Kafka --
 HDFS (raw data)

 DataSource1,  , DataSourceN -- Flume -- SparkStreaming -- HDFS
   Flume --
 HDFS (raw data)


 All data are going to be processed and going to HDFS as raw and
 processed data. I don't know if it makes sense to use Kafka in this
 case if data are just going to HDFS. I guess that before this
 FlumeSpark Sink has more sense to feed SparkStream with a real-time
 flow of data.. It doesn't look too much sense to have SparkStreaming
 and get the data from HDFS.

 2014-11-19 22:55 GMT+01:00 Guillermo Ortiz konstt2...@gmail.com:
  Thank you for your answer, I don't know if I typed the question
  correctly. But your nswer helps me.
 
  I'm going to make the question again for knowing if you understood me.
 
  I have this topology:
 
  DataSource1,  , DataSourceN -- Kafka -- SparkStreaming -- HDFS
 
  DataSource1,  , DataSourceN -- Flume -- SparkStreaming -- HDFS
 
  All data are going to be pro
 
 
  2014-11-19 21:50 GMT+01:00 Hari Shreedharan hshreedha...@cloudera.com:
  Btw, if you want to write to Spark Streaming from Flume -- there is a
 sink
  (it is a part of Spark, not Flume). See Approach 2 here:
  http://spark.apache.org/docs/latest/streaming-flume-integration.html
 
 
 
  On Wed, Nov 19, 2014 at 12:41 PM, Hari Shreedharan
  hshreedha...@cloudera.com wrote:
 
  As of now, you can feed Spark Streaming from both kafka and flume.
  Currently though there is no API to write data back to either of the
 two
  directly.
 
  I sent a PR which should eventually add something like this:
 
 https://github.com/harishreedharan/spark/blob/Kafka-output/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaOutputWriter.scala
  that would allow Spark Streaming to write back to Kafka. This will
 likely be
  reviewed and committed after 1.2.
 
  I would consider writing something similar to push data to Flume as
 well,
  if there is a sufficient use-case for it. I have seen people talk about
  writing back to kafka quite a bit - hence the above patch.
 
  Which one is better is upto your use-case and existing infrastructure
 and
  preference. Both would work as is, but writing back to Flume would
 usually
  be if you want to write to HDFS/HBase/Solr etc -- which you could
 write back
  directly from Spark Streaming (of course, there are benefits of
 writing back
  using Flume like the additional buffering etc Flume gives), but it is
 still
  possible to do so from Spark Streaming itself.
 
  But for Kafka, the usual use-case is a variety of custom applications
  reading the same data -- for which it makes a whole lot of sense to
 write
  back to Kafka. An example is to sanitize incoming data in Spark
 Streaming
  (from Flume or Kafka or something else) and make it available for a
 variety
  of apps via Kafka.
 
  Hope this helps!
 
  Hari
 
 
  On Wed, Nov 19, 2014 at 8:10 AM, Guillermo Ortiz konstt2...@gmail.com
 
  wrote:
 
  Hi,
 
  I'm starting with Spark and I just trying to understand if I want to
  use Spark Streaming, should I use to feed it Flume or Kafka? I think
  there's not a official Sink for Flume to Spark Streaming and it seems
  that Kafka it fits better since gives you readibility.
 
  Could someone give a good scenario for each alternative? When would it
  make sense to use Kafka and when Flume for Spark Streaming?
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark Streaming with Flume or Kafka?

2014-11-19 Thread Guillermo Ortiz
Thank you, but I'm just considering a free options.


2014-11-20 7:53 GMT+01:00 Akhil Das ak...@sigmoidanalytics.com:
 You can also look at the Amazon's kinesis if you don't want to handle the
 pain of maintaining kafka/flume infra.

 Thanks
 Best Regards

 On Thu, Nov 20, 2014 at 3:32 AM, Guillermo Ortiz konstt2...@gmail.com
 wrote:

 Thank you for your answer, I don't know if I typed the question
 correctly. But your nswer helps me.

 I'm going to make the question again for knowing if you understood me.

 I have this topology:

 DataSource1,  , DataSourceN -- Kafka -- SparkStreaming -- HDFS
   Kafka --
 HDFS (raw data)

 DataSource1,  , DataSourceN -- Flume -- SparkStreaming -- HDFS
   Flume --
 HDFS (raw data)


 All data are going to be processed and going to HDFS as raw and
 processed data. I don't know if it makes sense to use Kafka in this
 case if data are just going to HDFS. I guess that before this
 FlumeSpark Sink has more sense to feed SparkStream with a real-time
 flow of data.. It doesn't look too much sense to have SparkStreaming
 and get the data from HDFS.

 2014-11-19 22:55 GMT+01:00 Guillermo Ortiz konstt2...@gmail.com:
  Thank you for your answer, I don't know if I typed the question
  correctly. But your nswer helps me.
 
  I'm going to make the question again for knowing if you understood me.
 
  I have this topology:
 
  DataSource1,  , DataSourceN -- Kafka -- SparkStreaming -- HDFS
 
  DataSource1,  , DataSourceN -- Flume -- SparkStreaming -- HDFS
 
  All data are going to be pro
 
 
  2014-11-19 21:50 GMT+01:00 Hari Shreedharan hshreedha...@cloudera.com:
  Btw, if you want to write to Spark Streaming from Flume -- there is a
  sink
  (it is a part of Spark, not Flume). See Approach 2 here:
  http://spark.apache.org/docs/latest/streaming-flume-integration.html
 
 
 
  On Wed, Nov 19, 2014 at 12:41 PM, Hari Shreedharan
  hshreedha...@cloudera.com wrote:
 
  As of now, you can feed Spark Streaming from both kafka and flume.
  Currently though there is no API to write data back to either of the
  two
  directly.
 
  I sent a PR which should eventually add something like this:
 
  https://github.com/harishreedharan/spark/blob/Kafka-output/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaOutputWriter.scala
  that would allow Spark Streaming to write back to Kafka. This will
  likely be
  reviewed and committed after 1.2.
 
  I would consider writing something similar to push data to Flume as
  well,
  if there is a sufficient use-case for it. I have seen people talk
  about
  writing back to kafka quite a bit - hence the above patch.
 
  Which one is better is upto your use-case and existing infrastructure
  and
  preference. Both would work as is, but writing back to Flume would
  usually
  be if you want to write to HDFS/HBase/Solr etc -- which you could
  write back
  directly from Spark Streaming (of course, there are benefits of
  writing back
  using Flume like the additional buffering etc Flume gives), but it is
  still
  possible to do so from Spark Streaming itself.
 
  But for Kafka, the usual use-case is a variety of custom applications
  reading the same data -- for which it makes a whole lot of sense to
  write
  back to Kafka. An example is to sanitize incoming data in Spark
  Streaming
  (from Flume or Kafka or something else) and make it available for a
  variety
  of apps via Kafka.
 
  Hope this helps!
 
  Hari
 
 
  On Wed, Nov 19, 2014 at 8:10 AM, Guillermo Ortiz
  konstt2...@gmail.com
  wrote:
 
  Hi,
 
  I'm starting with Spark and I just trying to understand if I want to
  use Spark Streaming, should I use to feed it Flume or Kafka? I think
  there's not a official Sink for Flume to Spark Streaming and it seems
  that Kafka it fits better since gives you readibility.
 
  Could someone give a good scenario for each alternative? When would
  it
  make sense to use Kafka and when Flume for Spark Streaming?
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org