Re: Spark Streaming with Flume or Kafka?
Thank you, but I'm just considering a free options. 2014-11-20 7:53 GMT+01:00 Akhil Das : > You can also look at the Amazon's kinesis if you don't want to handle the > pain of maintaining kafka/flume infra. > > Thanks > Best Regards > > On Thu, Nov 20, 2014 at 3:32 AM, Guillermo Ortiz > wrote: >> >> Thank you for your answer, I don't know if I typed the question >> correctly. But your nswer helps me. >> >> I'm going to make the question again for knowing if you understood me. >> >> I have this topology: >> >> DataSource1, , DataSourceN --> Kafka --> SparkStreaming --> HDFS >> Kafka --> >> HDFS (raw data) >> >> DataSource1, , DataSourceN --> Flume --> SparkStreaming --> HDFS >> Flume --> >> HDFS (raw data) >> >> >> All data are going to be processed and going to HDFS as raw and >> processed data. I don't know if it makes sense to use Kafka in this >> case if data are just going to HDFS. I guess that before this >> FlumeSpark Sink has more sense to feed SparkStream with a real-time >> flow of data.. It doesn't look too much sense to have SparkStreaming >> and get the data from HDFS. >> >> 2014-11-19 22:55 GMT+01:00 Guillermo Ortiz : >> > Thank you for your answer, I don't know if I typed the question >> > correctly. But your nswer helps me. >> > >> > I'm going to make the question again for knowing if you understood me. >> > >> > I have this topology: >> > >> > DataSource1, , DataSourceN --> Kafka --> SparkStreaming --> HDFS >> > >> > DataSource1, , DataSourceN --> Flume --> SparkStreaming --> HDFS >> > >> > All data are going to be pro >> > >> > >> > 2014-11-19 21:50 GMT+01:00 Hari Shreedharan : >> >> Btw, if you want to write to Spark Streaming from Flume -- there is a >> >> sink >> >> (it is a part of Spark, not Flume). See Approach 2 here: >> >> http://spark.apache.org/docs/latest/streaming-flume-integration.html >> >> >> >> >> >> >> >> On Wed, Nov 19, 2014 at 12:41 PM, Hari Shreedharan >> >> wrote: >> >>> >> >>> As of now, you can feed Spark Streaming from both kafka and flume. >> >>> Currently though there is no API to write data back to either of the >> >>> two >> >>> directly. >> >>> >> >>> I sent a PR which should eventually add something like this: >> >>> >> >>> https://github.com/harishreedharan/spark/blob/Kafka-output/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaOutputWriter.scala >> >>> that would allow Spark Streaming to write back to Kafka. This will >> >>> likely be >> >>> reviewed and committed after 1.2. >> >>> >> >>> I would consider writing something similar to push data to Flume as >> >>> well, >> >>> if there is a sufficient use-case for it. I have seen people talk >> >>> about >> >>> writing back to kafka quite a bit - hence the above patch. >> >>> >> >>> Which one is better is upto your use-case and existing infrastructure >> >>> and >> >>> preference. Both would work as is, but writing back to Flume would >> >>> usually >> >>> be if you want to write to HDFS/HBase/Solr etc -- which you could >> >>> write back >> >>> directly from Spark Streaming (of course, there are benefits of >> >>> writing back >> >>> using Flume like the additional buffering etc Flume gives), but it is >> >>> still >> >>> possible to do so from Spark Streaming itself. >> >>> >> >>> But for Kafka, the usual use-case is a variety of custom applications >> >>> reading the same data -- for which it makes a whole lot of sense to >> >>> write >> >>> back to Kafka. An example is to sanitize incoming data in Spark >> >>> Streaming >> >>> (from Flume or Kafka or something else) and make it available for a >> >>> variety >> >>> of apps via Kafka. >> >>> >> >>> Hope this helps! >> >>> >> >>> Hari >> >>> >> >>> >> >>> On Wed, Nov 19, 2014 at 8:10 AM, Guillermo Ortiz >> >>> >> >>> wrote: >> >> Hi, >> >> I'm starting with Spark and I just trying to understand if I want to >> use Spark Streaming, should I use to feed it Flume or Kafka? I think >> there's not a official Sink for Flume to Spark Streaming and it seems >> that Kafka it fits better since gives you readibility. >> >> Could someone give a good scenario for each alternative? When would >> it >> make sense to use Kafka and when Flume for Spark Streaming? >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >>> >> >> >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: use
Re: Spark Streaming with Flume or Kafka?
You can also look at the Amazon's kinesis if you don't want to handle the pain of maintaining kafka/flume infra. Thanks Best Regards On Thu, Nov 20, 2014 at 3:32 AM, Guillermo Ortiz wrote: > Thank you for your answer, I don't know if I typed the question > correctly. But your nswer helps me. > > I'm going to make the question again for knowing if you understood me. > > I have this topology: > > DataSource1, , DataSourceN --> Kafka --> SparkStreaming --> HDFS > Kafka --> > HDFS (raw data) > > DataSource1, , DataSourceN --> Flume --> SparkStreaming --> HDFS > Flume --> > HDFS (raw data) > > > All data are going to be processed and going to HDFS as raw and > processed data. I don't know if it makes sense to use Kafka in this > case if data are just going to HDFS. I guess that before this > FlumeSpark Sink has more sense to feed SparkStream with a real-time > flow of data.. It doesn't look too much sense to have SparkStreaming > and get the data from HDFS. > > 2014-11-19 22:55 GMT+01:00 Guillermo Ortiz : > > Thank you for your answer, I don't know if I typed the question > > correctly. But your nswer helps me. > > > > I'm going to make the question again for knowing if you understood me. > > > > I have this topology: > > > > DataSource1, , DataSourceN --> Kafka --> SparkStreaming --> HDFS > > > > DataSource1, , DataSourceN --> Flume --> SparkStreaming --> HDFS > > > > All data are going to be pro > > > > > > 2014-11-19 21:50 GMT+01:00 Hari Shreedharan : > >> Btw, if you want to write to Spark Streaming from Flume -- there is a > sink > >> (it is a part of Spark, not Flume). See Approach 2 here: > >> http://spark.apache.org/docs/latest/streaming-flume-integration.html > >> > >> > >> > >> On Wed, Nov 19, 2014 at 12:41 PM, Hari Shreedharan > >> wrote: > >>> > >>> As of now, you can feed Spark Streaming from both kafka and flume. > >>> Currently though there is no API to write data back to either of the > two > >>> directly. > >>> > >>> I sent a PR which should eventually add something like this: > >>> > https://github.com/harishreedharan/spark/blob/Kafka-output/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaOutputWriter.scala > >>> that would allow Spark Streaming to write back to Kafka. This will > likely be > >>> reviewed and committed after 1.2. > >>> > >>> I would consider writing something similar to push data to Flume as > well, > >>> if there is a sufficient use-case for it. I have seen people talk about > >>> writing back to kafka quite a bit - hence the above patch. > >>> > >>> Which one is better is upto your use-case and existing infrastructure > and > >>> preference. Both would work as is, but writing back to Flume would > usually > >>> be if you want to write to HDFS/HBase/Solr etc -- which you could > write back > >>> directly from Spark Streaming (of course, there are benefits of > writing back > >>> using Flume like the additional buffering etc Flume gives), but it is > still > >>> possible to do so from Spark Streaming itself. > >>> > >>> But for Kafka, the usual use-case is a variety of custom applications > >>> reading the same data -- for which it makes a whole lot of sense to > write > >>> back to Kafka. An example is to sanitize incoming data in Spark > Streaming > >>> (from Flume or Kafka or something else) and make it available for a > variety > >>> of apps via Kafka. > >>> > >>> Hope this helps! > >>> > >>> Hari > >>> > >>> > >>> On Wed, Nov 19, 2014 at 8:10 AM, Guillermo Ortiz > > >>> wrote: > > Hi, > > I'm starting with Spark and I just trying to understand if I want to > use Spark Streaming, should I use to feed it Flume or Kafka? I think > there's not a official Sink for Flume to Spark Streaming and it seems > that Kafka it fits better since gives you readibility. > > Could someone give a good scenario for each alternative? When would it > make sense to use Kafka and when Flume for Spark Streaming? > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > >>> > >> > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: Spark Streaming with Flume or Kafka?
Thank you for your answer, I don't know if I typed the question correctly. But your nswer helps me. I'm going to make the question again for knowing if you understood me. I have this topology: DataSource1, , DataSourceN --> Kafka --> SparkStreaming --> HDFS Kafka --> HDFS (raw data) DataSource1, , DataSourceN --> Flume --> SparkStreaming --> HDFS Flume --> HDFS (raw data) All data are going to be processed and going to HDFS as raw and processed data. I don't know if it makes sense to use Kafka in this case if data are just going to HDFS. I guess that before this FlumeSpark Sink has more sense to feed SparkStream with a real-time flow of data.. It doesn't look too much sense to have SparkStreaming and get the data from HDFS. 2014-11-19 22:55 GMT+01:00 Guillermo Ortiz : > Thank you for your answer, I don't know if I typed the question > correctly. But your nswer helps me. > > I'm going to make the question again for knowing if you understood me. > > I have this topology: > > DataSource1, , DataSourceN --> Kafka --> SparkStreaming --> HDFS > > DataSource1, , DataSourceN --> Flume --> SparkStreaming --> HDFS > > All data are going to be pro > > > 2014-11-19 21:50 GMT+01:00 Hari Shreedharan : >> Btw, if you want to write to Spark Streaming from Flume -- there is a sink >> (it is a part of Spark, not Flume). See Approach 2 here: >> http://spark.apache.org/docs/latest/streaming-flume-integration.html >> >> >> >> On Wed, Nov 19, 2014 at 12:41 PM, Hari Shreedharan >> wrote: >>> >>> As of now, you can feed Spark Streaming from both kafka and flume. >>> Currently though there is no API to write data back to either of the two >>> directly. >>> >>> I sent a PR which should eventually add something like this: >>> https://github.com/harishreedharan/spark/blob/Kafka-output/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaOutputWriter.scala >>> that would allow Spark Streaming to write back to Kafka. This will likely be >>> reviewed and committed after 1.2. >>> >>> I would consider writing something similar to push data to Flume as well, >>> if there is a sufficient use-case for it. I have seen people talk about >>> writing back to kafka quite a bit - hence the above patch. >>> >>> Which one is better is upto your use-case and existing infrastructure and >>> preference. Both would work as is, but writing back to Flume would usually >>> be if you want to write to HDFS/HBase/Solr etc -- which you could write back >>> directly from Spark Streaming (of course, there are benefits of writing back >>> using Flume like the additional buffering etc Flume gives), but it is still >>> possible to do so from Spark Streaming itself. >>> >>> But for Kafka, the usual use-case is a variety of custom applications >>> reading the same data -- for which it makes a whole lot of sense to write >>> back to Kafka. An example is to sanitize incoming data in Spark Streaming >>> (from Flume or Kafka or something else) and make it available for a variety >>> of apps via Kafka. >>> >>> Hope this helps! >>> >>> Hari >>> >>> >>> On Wed, Nov 19, 2014 at 8:10 AM, Guillermo Ortiz >>> wrote: Hi, I'm starting with Spark and I just trying to understand if I want to use Spark Streaming, should I use to feed it Flume or Kafka? I think there's not a official Sink for Flume to Spark Streaming and it seems that Kafka it fits better since gives you readibility. Could someone give a good scenario for each alternative? When would it make sense to use Kafka and when Flume for Spark Streaming? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org >>> >> - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Streaming with Flume or Kafka?
Thank you for your answer, I don't know if I typed the question correctly. But your nswer helps me. I'm going to make the question again for knowing if you understood me. I have this topology: DataSource1, , DataSourceN --> Kafka --> SparkStreaming --> HDFS DataSource1, , DataSourceN --> Flume --> SparkStreaming --> HDFS All data are going to be pro 2014-11-19 21:50 GMT+01:00 Hari Shreedharan : > Btw, if you want to write to Spark Streaming from Flume -- there is a sink > (it is a part of Spark, not Flume). See Approach 2 here: > http://spark.apache.org/docs/latest/streaming-flume-integration.html > > > > On Wed, Nov 19, 2014 at 12:41 PM, Hari Shreedharan > wrote: >> >> As of now, you can feed Spark Streaming from both kafka and flume. >> Currently though there is no API to write data back to either of the two >> directly. >> >> I sent a PR which should eventually add something like this: >> https://github.com/harishreedharan/spark/blob/Kafka-output/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaOutputWriter.scala >> that would allow Spark Streaming to write back to Kafka. This will likely be >> reviewed and committed after 1.2. >> >> I would consider writing something similar to push data to Flume as well, >> if there is a sufficient use-case for it. I have seen people talk about >> writing back to kafka quite a bit - hence the above patch. >> >> Which one is better is upto your use-case and existing infrastructure and >> preference. Both would work as is, but writing back to Flume would usually >> be if you want to write to HDFS/HBase/Solr etc -- which you could write back >> directly from Spark Streaming (of course, there are benefits of writing back >> using Flume like the additional buffering etc Flume gives), but it is still >> possible to do so from Spark Streaming itself. >> >> But for Kafka, the usual use-case is a variety of custom applications >> reading the same data -- for which it makes a whole lot of sense to write >> back to Kafka. An example is to sanitize incoming data in Spark Streaming >> (from Flume or Kafka or something else) and make it available for a variety >> of apps via Kafka. >> >> Hope this helps! >> >> Hari >> >> >> On Wed, Nov 19, 2014 at 8:10 AM, Guillermo Ortiz >> wrote: >>> >>> Hi, >>> >>> I'm starting with Spark and I just trying to understand if I want to >>> use Spark Streaming, should I use to feed it Flume or Kafka? I think >>> there's not a official Sink for Flume to Spark Streaming and it seems >>> that Kafka it fits better since gives you readibility. >>> >>> Could someone give a good scenario for each alternative? When would it >>> make sense to use Kafka and when Flume for Spark Streaming? >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >> > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Streaming with Flume or Kafka?
Btw, if you want to write to Spark Streaming from Flume -- there is a sink (it is a part of Spark, not Flume). See Approach 2 here: http://spark.apache.org/docs/latest/streaming-flume-integration.html On Wed, Nov 19, 2014 at 12:41 PM, Hari Shreedharan < hshreedha...@cloudera.com> wrote: > As of now, you can feed Spark Streaming from both kafka and flume. > Currently though there is no API to write data back to either of the two > directly. > > I sent a PR which should eventually add something like this: > https://github.com/harishreedharan/spark/blob/Kafka-output/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaOutputWriter.scala > that would allow Spark Streaming to write back to Kafka. This will likely > be reviewed and committed after 1.2. > > I would consider writing something similar to push data to Flume as well, > if there is a sufficient use-case for it. I have seen people talk about > writing back to kafka quite a bit - hence the above patch. > > Which one is better is upto your use-case and existing infrastructure and > preference. Both would work as is, but writing back to Flume would usually > be if you want to write to HDFS/HBase/Solr etc -- which you could write > back directly from Spark Streaming (of course, there are benefits of > writing back using Flume like the additional buffering etc Flume gives), > but it is still possible to do so from Spark Streaming itself. > > But for Kafka, the usual use-case is a variety of custom applications > reading the same data -- for which it makes a whole lot of sense to write > back to Kafka. An example is to sanitize incoming data in Spark Streaming > (from Flume or Kafka or something else) and make it available for a variety > of apps via Kafka. > > Hope this helps! > > Hari > > > On Wed, Nov 19, 2014 at 8:10 AM, Guillermo Ortiz > wrote: > >> Hi, >> >> I'm starting with Spark and I just trying to understand if I want to >> use Spark Streaming, should I use to feed it Flume or Kafka? I think >> there's not a official Sink for Flume to Spark Streaming and it seems >> that Kafka it fits better since gives you readibility. >> >> Could someone give a good scenario for each alternative? When would it >> make sense to use Kafka and when Flume for Spark Streaming? >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Re: Spark Streaming with Flume or Kafka?
As of now, you can feed Spark Streaming from both kafka and flume. Currently though there is no API to write data back to either of the two directly. I sent a PR which should eventually add something like this: https://github.com/harishreedharan/spark/blob/Kafka-output/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaOutputWriter.scala that would allow Spark Streaming to write back to Kafka. This will likely be reviewed and committed after 1.2. I would consider writing something similar to push data to Flume as well, if there is a sufficient use-case for it. I have seen people talk about writing back to kafka quite a bit - hence the above patch. Which one is better is upto your use-case and existing infrastructure and preference. Both would work as is, but writing back to Flume would usually be if you want to write to HDFS/HBase/Solr etc -- which you could write back directly from Spark Streaming (of course, there are benefits of writing back using Flume like the additional buffering etc Flume gives), but it is still possible to do so from Spark Streaming itself. But for Kafka, the usual use-case is a variety of custom applications reading the same data -- for which it makes a whole lot of sense to write back to Kafka. An example is to sanitize incoming data in Spark Streaming (from Flume or Kafka or something else) and make it available for a variety of apps via Kafka. Hope this helps! Hari On Wed, Nov 19, 2014 at 8:10 AM, Guillermo Ortiz wrote: > Hi, > > I'm starting with Spark and I just trying to understand if I want to > use Spark Streaming, should I use to feed it Flume or Kafka? I think > there's not a official Sink for Flume to Spark Streaming and it seems > that Kafka it fits better since gives you readibility. > > Could someone give a good scenario for each alternative? When would it > make sense to use Kafka and when Flume for Spark Streaming? > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >