In general, (and I am prototyping), I have a better idea :) - Consume kafka in Spark from topic-A - transform data in Spark (normalize, enrich etc etc) - Feed it back to Kafka (in a different topic-B) - Have flume->HDFS (for M/R, Impala, Spark batch) or Spark-streaming or any other compute framework subscribe to B
On Mon, Aug 11, 2014 at 5:57 PM, Tobias Pfeiffer <t...@preferred.jp> wrote: > Hi, > > On Mon, Aug 11, 2014 at 9:41 PM, Gwenhael Pasquiers > <gwenhael.pasqui...@ericsson.com> wrote: >> >> We intend to apply other operations on the data later in the same spark >> context, but our first step is to archive it. >> >> >> >> Our goal is somth like this >> >> Step 1 : consume kafka >> Step 2 : archive to hdfs AND send to step 3 >> Step 3 : transform data >> >> Step 4 : save transformed data to HDFS as input for M/R > > > I see. Well I think Spark Streaming may be well suited for that purpose. > >> >> To us it looks like a great flaw if, in streaming mode, spark-streaming >> cannot slow down it’s consumption depending on the available resources. > > > On Mon, Aug 11, 2014 at 10:10 PM, Gwenhael Pasquiers > <gwenhael.pasqui...@ericsson.com> wrote: >> >> I think the kind of self-regulating system you describe would be too >> difficult to implement and probably unreliable (even more with the fact that >> we have multiple slaves). > > > Isn't "slow down its consumption depending on the available resources" a > "self-regulating system"? I don't see how you can adapt to available > resources without measuring your execution time and then change how much you > consume. Did you have any particular form of adaption in mind? > > Tobias --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org