Re: Spark Streaming with Flume or Kafka?

2014-11-19 Thread Guillermo Ortiz
Thank you, but I'm just considering a free options.


2014-11-20 7:53 GMT+01:00 Akhil Das :
> You can also look at the Amazon's kinesis if you don't want to handle the
> pain of maintaining kafka/flume infra.
>
> Thanks
> Best Regards
>
> On Thu, Nov 20, 2014 at 3:32 AM, Guillermo Ortiz 
> wrote:
>>
>> Thank you for your answer, I don't know if I typed the question
>> correctly. But your nswer helps me.
>>
>> I'm going to make the question again for knowing if you understood me.
>>
>> I have this topology:
>>
>> DataSource1,  , DataSourceN --> Kafka --> SparkStreaming --> HDFS
>>   Kafka -->
>> HDFS (raw data)
>>
>> DataSource1,  , DataSourceN --> Flume --> SparkStreaming --> HDFS
>>   Flume -->
>> HDFS (raw data)
>>
>>
>> All data are going to be processed and going to HDFS as raw and
>> processed data. I don't know if it makes sense to use Kafka in this
>> case if data are just going to HDFS. I guess that before this
>> FlumeSpark Sink has more sense to feed SparkStream with a real-time
>> flow of data.. It doesn't look too much sense to have SparkStreaming
>> and get the data from HDFS.
>>
>> 2014-11-19 22:55 GMT+01:00 Guillermo Ortiz :
>> > Thank you for your answer, I don't know if I typed the question
>> > correctly. But your nswer helps me.
>> >
>> > I'm going to make the question again for knowing if you understood me.
>> >
>> > I have this topology:
>> >
>> > DataSource1,  , DataSourceN --> Kafka --> SparkStreaming --> HDFS
>> >
>> > DataSource1,  , DataSourceN --> Flume --> SparkStreaming --> HDFS
>> >
>> > All data are going to be pro
>> >
>> >
>> > 2014-11-19 21:50 GMT+01:00 Hari Shreedharan :
>> >> Btw, if you want to write to Spark Streaming from Flume -- there is a
>> >> sink
>> >> (it is a part of Spark, not Flume). See Approach 2 here:
>> >> http://spark.apache.org/docs/latest/streaming-flume-integration.html
>> >>
>> >>
>> >>
>> >> On Wed, Nov 19, 2014 at 12:41 PM, Hari Shreedharan
>> >>  wrote:
>> >>>
>> >>> As of now, you can feed Spark Streaming from both kafka and flume.
>> >>> Currently though there is no API to write data back to either of the
>> >>> two
>> >>> directly.
>> >>>
>> >>> I sent a PR which should eventually add something like this:
>> >>>
>> >>> https://github.com/harishreedharan/spark/blob/Kafka-output/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaOutputWriter.scala
>> >>> that would allow Spark Streaming to write back to Kafka. This will
>> >>> likely be
>> >>> reviewed and committed after 1.2.
>> >>>
>> >>> I would consider writing something similar to push data to Flume as
>> >>> well,
>> >>> if there is a sufficient use-case for it. I have seen people talk
>> >>> about
>> >>> writing back to kafka quite a bit - hence the above patch.
>> >>>
>> >>> Which one is better is upto your use-case and existing infrastructure
>> >>> and
>> >>> preference. Both would work as is, but writing back to Flume would
>> >>> usually
>> >>> be if you want to write to HDFS/HBase/Solr etc -- which you could
>> >>> write back
>> >>> directly from Spark Streaming (of course, there are benefits of
>> >>> writing back
>> >>> using Flume like the additional buffering etc Flume gives), but it is
>> >>> still
>> >>> possible to do so from Spark Streaming itself.
>> >>>
>> >>> But for Kafka, the usual use-case is a variety of custom applications
>> >>> reading the same data -- for which it makes a whole lot of sense to
>> >>> write
>> >>> back to Kafka. An example is to sanitize incoming data in Spark
>> >>> Streaming
>> >>> (from Flume or Kafka or something else) and make it available for a
>> >>> variety
>> >>> of apps via Kafka.
>> >>>
>> >>> Hope this helps!
>> >>>
>> >>> Hari
>> >>>
>> >>>
>> >>> On Wed, Nov 19, 2014 at 8:10 AM, Guillermo Ortiz
>> >>> 
>> >>> wrote:
>> 
>>  Hi,
>> 
>>  I'm starting with Spark and I just trying to understand if I want to
>>  use Spark Streaming, should I use to feed it Flume or Kafka? I think
>>  there's not a official Sink for Flume to Spark Streaming and it seems
>>  that Kafka it fits better since gives you readibility.
>> 
>>  Could someone give a good scenario for each alternative? When would
>>  it
>>  make sense to use Kafka and when Flume for Spark Streaming?
>> 
>>  -
>>  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>  For additional commands, e-mail: user-h...@spark.apache.org
>> 
>> >>>
>> >>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: use

Re: Spark Streaming with Flume or Kafka?

2014-11-19 Thread Akhil Das
You can also look at the Amazon's kinesis if you don't want to handle the
pain of maintaining kafka/flume infra.

Thanks
Best Regards

On Thu, Nov 20, 2014 at 3:32 AM, Guillermo Ortiz 
wrote:

> Thank you for your answer, I don't know if I typed the question
> correctly. But your nswer helps me.
>
> I'm going to make the question again for knowing if you understood me.
>
> I have this topology:
>
> DataSource1,  , DataSourceN --> Kafka --> SparkStreaming --> HDFS
>   Kafka -->
> HDFS (raw data)
>
> DataSource1,  , DataSourceN --> Flume --> SparkStreaming --> HDFS
>   Flume -->
> HDFS (raw data)
>
>
> All data are going to be processed and going to HDFS as raw and
> processed data. I don't know if it makes sense to use Kafka in this
> case if data are just going to HDFS. I guess that before this
> FlumeSpark Sink has more sense to feed SparkStream with a real-time
> flow of data.. It doesn't look too much sense to have SparkStreaming
> and get the data from HDFS.
>
> 2014-11-19 22:55 GMT+01:00 Guillermo Ortiz :
> > Thank you for your answer, I don't know if I typed the question
> > correctly. But your nswer helps me.
> >
> > I'm going to make the question again for knowing if you understood me.
> >
> > I have this topology:
> >
> > DataSource1,  , DataSourceN --> Kafka --> SparkStreaming --> HDFS
> >
> > DataSource1,  , DataSourceN --> Flume --> SparkStreaming --> HDFS
> >
> > All data are going to be pro
> >
> >
> > 2014-11-19 21:50 GMT+01:00 Hari Shreedharan :
> >> Btw, if you want to write to Spark Streaming from Flume -- there is a
> sink
> >> (it is a part of Spark, not Flume). See Approach 2 here:
> >> http://spark.apache.org/docs/latest/streaming-flume-integration.html
> >>
> >>
> >>
> >> On Wed, Nov 19, 2014 at 12:41 PM, Hari Shreedharan
> >>  wrote:
> >>>
> >>> As of now, you can feed Spark Streaming from both kafka and flume.
> >>> Currently though there is no API to write data back to either of the
> two
> >>> directly.
> >>>
> >>> I sent a PR which should eventually add something like this:
> >>>
> https://github.com/harishreedharan/spark/blob/Kafka-output/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaOutputWriter.scala
> >>> that would allow Spark Streaming to write back to Kafka. This will
> likely be
> >>> reviewed and committed after 1.2.
> >>>
> >>> I would consider writing something similar to push data to Flume as
> well,
> >>> if there is a sufficient use-case for it. I have seen people talk about
> >>> writing back to kafka quite a bit - hence the above patch.
> >>>
> >>> Which one is better is upto your use-case and existing infrastructure
> and
> >>> preference. Both would work as is, but writing back to Flume would
> usually
> >>> be if you want to write to HDFS/HBase/Solr etc -- which you could
> write back
> >>> directly from Spark Streaming (of course, there are benefits of
> writing back
> >>> using Flume like the additional buffering etc Flume gives), but it is
> still
> >>> possible to do so from Spark Streaming itself.
> >>>
> >>> But for Kafka, the usual use-case is a variety of custom applications
> >>> reading the same data -- for which it makes a whole lot of sense to
> write
> >>> back to Kafka. An example is to sanitize incoming data in Spark
> Streaming
> >>> (from Flume or Kafka or something else) and make it available for a
> variety
> >>> of apps via Kafka.
> >>>
> >>> Hope this helps!
> >>>
> >>> Hari
> >>>
> >>>
> >>> On Wed, Nov 19, 2014 at 8:10 AM, Guillermo Ortiz  >
> >>> wrote:
> 
>  Hi,
> 
>  I'm starting with Spark and I just trying to understand if I want to
>  use Spark Streaming, should I use to feed it Flume or Kafka? I think
>  there's not a official Sink for Flume to Spark Streaming and it seems
>  that Kafka it fits better since gives you readibility.
> 
>  Could someone give a good scenario for each alternative? When would it
>  make sense to use Kafka and when Flume for Spark Streaming?
> 
>  -
>  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>  For additional commands, e-mail: user-h...@spark.apache.org
> 
> >>>
> >>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark Streaming with Flume or Kafka?

2014-11-19 Thread Guillermo Ortiz
Thank you for your answer, I don't know if I typed the question
correctly. But your nswer helps me.

I'm going to make the question again for knowing if you understood me.

I have this topology:

DataSource1,  , DataSourceN --> Kafka --> SparkStreaming --> HDFS
  Kafka -->
HDFS (raw data)

DataSource1,  , DataSourceN --> Flume --> SparkStreaming --> HDFS
  Flume -->
HDFS (raw data)


All data are going to be processed and going to HDFS as raw and
processed data. I don't know if it makes sense to use Kafka in this
case if data are just going to HDFS. I guess that before this
FlumeSpark Sink has more sense to feed SparkStream with a real-time
flow of data.. It doesn't look too much sense to have SparkStreaming
and get the data from HDFS.

2014-11-19 22:55 GMT+01:00 Guillermo Ortiz :
> Thank you for your answer, I don't know if I typed the question
> correctly. But your nswer helps me.
>
> I'm going to make the question again for knowing if you understood me.
>
> I have this topology:
>
> DataSource1,  , DataSourceN --> Kafka --> SparkStreaming --> HDFS
>
> DataSource1,  , DataSourceN --> Flume --> SparkStreaming --> HDFS
>
> All data are going to be pro
>
>
> 2014-11-19 21:50 GMT+01:00 Hari Shreedharan :
>> Btw, if you want to write to Spark Streaming from Flume -- there is a sink
>> (it is a part of Spark, not Flume). See Approach 2 here:
>> http://spark.apache.org/docs/latest/streaming-flume-integration.html
>>
>>
>>
>> On Wed, Nov 19, 2014 at 12:41 PM, Hari Shreedharan
>>  wrote:
>>>
>>> As of now, you can feed Spark Streaming from both kafka and flume.
>>> Currently though there is no API to write data back to either of the two
>>> directly.
>>>
>>> I sent a PR which should eventually add something like this:
>>> https://github.com/harishreedharan/spark/blob/Kafka-output/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaOutputWriter.scala
>>> that would allow Spark Streaming to write back to Kafka. This will likely be
>>> reviewed and committed after 1.2.
>>>
>>> I would consider writing something similar to push data to Flume as well,
>>> if there is a sufficient use-case for it. I have seen people talk about
>>> writing back to kafka quite a bit - hence the above patch.
>>>
>>> Which one is better is upto your use-case and existing infrastructure and
>>> preference. Both would work as is, but writing back to Flume would usually
>>> be if you want to write to HDFS/HBase/Solr etc -- which you could write back
>>> directly from Spark Streaming (of course, there are benefits of writing back
>>> using Flume like the additional buffering etc Flume gives), but it is still
>>> possible to do so from Spark Streaming itself.
>>>
>>> But for Kafka, the usual use-case is a variety of custom applications
>>> reading the same data -- for which it makes a whole lot of sense to write
>>> back to Kafka. An example is to sanitize incoming data in Spark Streaming
>>> (from Flume or Kafka or something else) and make it available for a variety
>>> of apps via Kafka.
>>>
>>> Hope this helps!
>>>
>>> Hari
>>>
>>>
>>> On Wed, Nov 19, 2014 at 8:10 AM, Guillermo Ortiz 
>>> wrote:

 Hi,

 I'm starting with Spark and I just trying to understand if I want to
 use Spark Streaming, should I use to feed it Flume or Kafka? I think
 there's not a official Sink for Flume to Spark Streaming and it seems
 that Kafka it fits better since gives you readibility.

 Could someone give a good scenario for each alternative? When would it
 make sense to use Kafka and when Flume for Spark Streaming?

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

>>>
>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming with Flume or Kafka?

2014-11-19 Thread Guillermo Ortiz
Thank you for your answer, I don't know if I typed the question
correctly. But your nswer helps me.

I'm going to make the question again for knowing if you understood me.

I have this topology:

DataSource1,  , DataSourceN --> Kafka --> SparkStreaming --> HDFS

DataSource1,  , DataSourceN --> Flume --> SparkStreaming --> HDFS

All data are going to be pro


2014-11-19 21:50 GMT+01:00 Hari Shreedharan :
> Btw, if you want to write to Spark Streaming from Flume -- there is a sink
> (it is a part of Spark, not Flume). See Approach 2 here:
> http://spark.apache.org/docs/latest/streaming-flume-integration.html
>
>
>
> On Wed, Nov 19, 2014 at 12:41 PM, Hari Shreedharan
>  wrote:
>>
>> As of now, you can feed Spark Streaming from both kafka and flume.
>> Currently though there is no API to write data back to either of the two
>> directly.
>>
>> I sent a PR which should eventually add something like this:
>> https://github.com/harishreedharan/spark/blob/Kafka-output/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaOutputWriter.scala
>> that would allow Spark Streaming to write back to Kafka. This will likely be
>> reviewed and committed after 1.2.
>>
>> I would consider writing something similar to push data to Flume as well,
>> if there is a sufficient use-case for it. I have seen people talk about
>> writing back to kafka quite a bit - hence the above patch.
>>
>> Which one is better is upto your use-case and existing infrastructure and
>> preference. Both would work as is, but writing back to Flume would usually
>> be if you want to write to HDFS/HBase/Solr etc -- which you could write back
>> directly from Spark Streaming (of course, there are benefits of writing back
>> using Flume like the additional buffering etc Flume gives), but it is still
>> possible to do so from Spark Streaming itself.
>>
>> But for Kafka, the usual use-case is a variety of custom applications
>> reading the same data -- for which it makes a whole lot of sense to write
>> back to Kafka. An example is to sanitize incoming data in Spark Streaming
>> (from Flume or Kafka or something else) and make it available for a variety
>> of apps via Kafka.
>>
>> Hope this helps!
>>
>> Hari
>>
>>
>> On Wed, Nov 19, 2014 at 8:10 AM, Guillermo Ortiz 
>> wrote:
>>>
>>> Hi,
>>>
>>> I'm starting with Spark and I just trying to understand if I want to
>>> use Spark Streaming, should I use to feed it Flume or Kafka? I think
>>> there's not a official Sink for Flume to Spark Streaming and it seems
>>> that Kafka it fits better since gives you readibility.
>>>
>>> Could someone give a good scenario for each alternative? When would it
>>> make sense to use Kafka and when Flume for Spark Streaming?
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming with Flume or Kafka?

2014-11-19 Thread Hari Shreedharan
Btw, if you want to write to Spark Streaming from Flume -- there is a sink
(it is a part of Spark, not Flume). See Approach 2 here:
http://spark.apache.org/docs/latest/streaming-flume-integration.html



On Wed, Nov 19, 2014 at 12:41 PM, Hari Shreedharan <
hshreedha...@cloudera.com> wrote:

> As of now, you can feed Spark Streaming from both kafka and flume.
> Currently though there is no API to write data back to either of the two
> directly.
>
> I sent a PR which should eventually add something like this:
> https://github.com/harishreedharan/spark/blob/Kafka-output/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaOutputWriter.scala
> that would allow Spark Streaming to write back to Kafka. This will likely
> be reviewed and committed after 1.2.
>
> I would consider writing something similar to push data to Flume as well,
> if there is a sufficient use-case for it. I have seen people talk about
> writing back to kafka quite a bit - hence the above patch.
>
> Which one is better is upto your use-case and existing infrastructure and
> preference. Both would work as is, but writing back to Flume would usually
> be if you want to write to HDFS/HBase/Solr etc -- which you could write
> back directly from Spark Streaming (of course, there are benefits of
> writing back using Flume like the additional buffering etc Flume gives),
> but it is still possible to do so from Spark Streaming itself.
>
> But for Kafka, the usual use-case is a variety of custom applications
> reading the same data -- for which it makes a whole lot of sense to write
> back to Kafka. An example is to sanitize incoming data in Spark Streaming
> (from Flume or Kafka or something else) and make it available for a variety
> of apps via Kafka.
>
> Hope this helps!
>
> Hari
>
>
> On Wed, Nov 19, 2014 at 8:10 AM, Guillermo Ortiz 
> wrote:
>
>> Hi,
>>
>> I'm starting with Spark and I just trying to understand if I want to
>> use Spark Streaming, should I use to feed it Flume or Kafka? I think
>> there's not a official Sink for Flume to Spark Streaming and it seems
>> that Kafka it fits better since gives you readibility.
>>
>> Could someone give a good scenario for each alternative? When would it
>> make sense to use Kafka and when Flume for Spark Streaming?
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Spark Streaming with Flume or Kafka?

2014-11-19 Thread Hari Shreedharan
As of now, you can feed Spark Streaming from both kafka and flume.
Currently though there is no API to write data back to either of the two
directly.

I sent a PR which should eventually add something like this:
https://github.com/harishreedharan/spark/blob/Kafka-output/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaOutputWriter.scala
that would allow Spark Streaming to write back to Kafka. This will likely
be reviewed and committed after 1.2.

I would consider writing something similar to push data to Flume as well,
if there is a sufficient use-case for it. I have seen people talk about
writing back to kafka quite a bit - hence the above patch.

Which one is better is upto your use-case and existing infrastructure and
preference. Both would work as is, but writing back to Flume would usually
be if you want to write to HDFS/HBase/Solr etc -- which you could write
back directly from Spark Streaming (of course, there are benefits of
writing back using Flume like the additional buffering etc Flume gives),
but it is still possible to do so from Spark Streaming itself.

But for Kafka, the usual use-case is a variety of custom applications
reading the same data -- for which it makes a whole lot of sense to write
back to Kafka. An example is to sanitize incoming data in Spark Streaming
(from Flume or Kafka or something else) and make it available for a variety
of apps via Kafka.

Hope this helps!

Hari


On Wed, Nov 19, 2014 at 8:10 AM, Guillermo Ortiz 
wrote:

> Hi,
>
> I'm starting with Spark and I just trying to understand if I want to
> use Spark Streaming, should I use to feed it Flume or Kafka? I think
> there's not a official Sink for Flume to Spark Streaming and it seems
> that Kafka it fits better since gives you readibility.
>
> Could someone give a good scenario for each alternative? When would it
> make sense to use Kafka and when Flume for Spark Streaming?
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>