Hi Dib,
For our usecase I want my spark job1 to read from hdfs/cache and write to
kafka queues. Similarly spark job2 should read from kafka queues and write
to kafka queues.
Is writing to kafka queues from spark job supported in your code ?
Thanks
Deb
On Jan 15, 2015 11:21 PM, Akhil Das
My code handles the Kafka Consumer part. But writing to Kafka may not be a
big challenge which you can easily do in your driver code.
dibyendu
On Sat, Jan 17, 2015 at 9:43 AM, Debasish Das debasish.da...@gmail.com
wrote:
Hi Dib,
For our usecase I want my spark job1 to read from hdfs/cache
Hi Dibyendu,
I am using kafka 0.8.1.1 and spark 1.2.0.
After modifying these version of your pom, I have rebuilt your codes.
But I have not got any messages from ssc.receiverStream(new
KafkaReceiver(_props, i)).
I have found, in your codes, all the messages are retrieved correctly, but
Hi Kidong,
No , I have not tried yet with Spark 1.2 yet. I will try this out and let
you know how this goes.
By the way, is there any change in Receiver Store method happened in Spark
1.2 ?
Regards,
Dibyendu
On Fri, Jan 16, 2015 at 11:25 AM, mykidong mykid...@gmail.com wrote:
Hi
Hi Kidong,
Just now I tested the Low Level Consumer with Spark 1.2 and I did not see
any issue with Receiver.Store method . It is able to fetch messages form
Kafka.
Can you cross check other configurations in your setup like Kafka broker IP
, topic name, zk host details, consumer id etc.
Dib
There was a simple example
https://github.com/dibbhatt/kafka-spark-consumer/blob/master/examples/scala/LowLevelKafkaConsumer.scala#L45
which you can run after changing few lines of configurations.
Thanks
Best Regards
On Fri, Jan 16, 2015 at 12:23 PM, Dibyendu Bhattacharya
mentioned is
on our scheduler.
Thanks
Jerry
-Original Message-
From: RodrigoB [mailto:rodrigo.boav...@aspect.com]
Sent: Wednesday, December 3, 2014 5:44 AM
To: u...@spark.incubator.apache.org
Subject: Re: Low Level Kafka Consumer for Spark
Dibyendu,
Just to make sure I
scheduler.
Thanks
Jerry
-Original Message-
From: RodrigoB [mailto:rodrigo.boav...@aspect.com]
Sent: Wednesday, December 3, 2014 5:44 AM
To: u...@spark.incubator.apache.org
Subject: Re: Low Level Kafka Consumer for Spark
Dibyendu,
Just to make sure I will not be misunderstood
Hi Dibyendu,What are your thoughts on keeping this solution (or not),
considering that Spark Streaming v1.2 will have built-in recoverability of
the received data?https://issues.apache.org/jira/browse/SPARK-1647I'm
concerned about the complexity of this solution with regards the added
complexity
like what you mentioned is on our scheduler.
Thanks
Jerry
-Original Message-
From: RodrigoB [mailto:rodrigo.boav...@aspect.com]
Sent: Wednesday, December 3, 2014 5:44 AM
To: u...@spark.incubator.apache.org
Subject: Re: Low Level Kafka Consumer for Spark
Dibyendu,
Just to make sure I
Hi Dibyendu,
Thanks for your great work!
I'm new to Spark Streaming, so I just want to make sure I understand Driver
failure issue correctly.
In my use case, I want to make sure that messages coming in from Kafka are
always broken into the same set of RDDs, meaning that if a set of messages
are
Hi Alon,
No this will not be guarantee that same set of messages will come in same
RDD. This fix just re-play the messages from last processed offset in same
order. Again this is just a interim fix we needed to solve our use case .
If you do not need this message re-play feature, just do not
Hi Dibyendu,
I am a little confused about the need for rate limiting input from
kafka. If the stream coming in from kafka has higher message/second
rate than what a Spark job can process then it should simply build a
backlog in Spark if the RDDs are cached on disk using persist().
Right?
Thanks,
Hi Tim,
I have not tried persist the RDD.
Here are some discussion on Rate Limiting Spark Streaming is there in this
thread.
http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-rate-limiting-from-kafka-td8590.html
There is a Pull Request
Hi ,
The latest changes with Kafka message re-play by manipulating ZK offset
seems to be working fine for us. This gives us some relief till actual
issue is fixed in Spark 1.2 .
I have some question on how Spark process the Received data. The logic I
used is basically to pull messages form
Thanks TD. Someone already pointed out to me that /repartition(...)/ isn't
the right way. You have to /val partedStream = repartition(...)/. Would be
nice to have it fixed in the docs.
On Fri, Sep 5, 2014 at 10:44 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Some thoughts on this
Hi Tathagata,
I have managed to implement the logic into the Kafka-Spark consumer to
recover from Driver failure. This is just a interim fix till actual fix is
done from Spark side.
The logic is something like this.
1. When the Individual Receivers starts for every Topic partition, it
writes
Hi,
Sorry for little delay . As discussed in this thread, I have modified the
Kafka-Spark-Consumer ( https://github.com/dibbhatt/kafka-spark-consumer)
code to have dedicated Receiver for every Topic Partition. You can see the
example howto create Union of these receivers
in
Just a comment on the recovery part.
Is it correct to say that currently Spark Streaming recovery design does not
consider re-computations (upon metadata lineage recovery) that depend on
blocks of data of the received stream?
https://issues.apache.org/jira/browse/SPARK-1647
Just to illustrate a
I'm no expert. But as I understand, yes you create multiple streams to
consume multiple partitions in parallel. If they're all in the same
Kafka consumer group, you'll get exactly one copy of the message so
yes if you have 10 consumers and 3 Kafka partitions I believe only 3
will be getting
I have this same question. Isn't there somewhere that the Kafka range
metadata can be saved? From my naive perspective, it seems like it should
be very similar to HDFS lineage. The original HDFS blocks are kept
somewhere (in the driver?) so that if an RDD partition is lost, it can be
I'd be interested to understand this mechanism as well. But this is the
error recovery part of the equation. Consuming from Kafka has two aspects -
parallelism and error recovery and I am not sure how either works. For
error recovery, I would like to understand how:
- A failed receiver gets
Chris,
I did the Dstream.repartition mentioned in the document on parallelism in
receiving, as well as set spark.default.parallelism and it still uses only
2 nodes in my cluster. I notice there is another email thread on the same
topic:
'this 2-node replication is mainly for failover in case the receiver dies
while data is in flight. there's still chance for data loss as there's no
write ahead log on the hot path, but this is being addressed.'
Can you comment a little on how this will be addressed, will there be a
durable WAL?
Good to see I am not the only one who cannot get incoming Dstreams to
repartition. I tried repartition(512) but still no luck - the app
stubbornly runs only on two nodes. Now this is 1.0.0 but looking at
release notes for 1.0.1 and 1.0.2, I don't see anything that says this
was an issue and has
I create my DStream very simply as:
val kInMsg =
KafkaUtils.createStream(ssc,zkhost1:2181/zk_kafka,testApp,Map(rawunstruct
- 8))
.
.
eventually, before I operate on the DStream, I repartition it:
kInMsg.repartition(512)
Are you saying that ^^ repartition doesn't split by dstream into multiple
Ok, so I did this:
val kInStreams = (1 to 10).map{_ =
KafkaUtils.createStream(ssc,zkhost1:2181/zk_kafka,testApp,Map(rawunstruct
- 1)) }
val kInMsg = ssc.union(kInStreams)
val outdata = kInMsg.map(x=normalizeLog(x._2,configMap))
This has improved parallelism. Earlier I would only get a Stream 0.
@bharat-
overall, i've noticed a lot of confusion about how Spark Streaming scales -
as well as how it handles failover and checkpointing, but we can discuss
that separately.
there's actually 2 dimensions to scaling here: receiving and processing.
*Receiving*
receiving can be scaled out by
Hi Dibyendu,
That would be great. One of the biggest drawback of Kafka utils as well as
your implementation is I am unable to scale out processing. I am
relatively new to Spark and Spark Streaming - from what I read and what I
observe with my deployment is that having the RDD created on one
I agree. This issue should be fixed in Spark rather rely on replay of Kafka
messages.
Dib
On Aug 28, 2014 6:45 AM, RodrigoB rodrigo.boav...@aspect.com wrote:
Dibyendu,
Tnks for getting back.
I believe you are absolutely right. We were under the assumption that the
raw data was being
Hi,
As I understand, your problem is similar to this JIRA.
https://issues.apache.org/jira/browse/SPARK-1647
The issue in this case, Kafka can not replay the message as offsets are
already committed. Also I think existing KafkaUtils ( The Default High
Level Kafka Consumer) also have this issue.
Hi Bharat,
Thanks for your email. If the Kafka Reader worker process dies, it will
be replaced by different machine, and it will start consuming from the
offset where it left over ( for each partition). Same case can happen even
if I tried to have individual Receiver for every partition.
great work, Dibyendu. looks like this would be a popular contribution.
expanding on bharat's question a bit:
what happens if you submit multiple receivers to the cluster by creating
and unioning multiple DStreams as in the kinesis example here:
Hi Dibyendu,
My colleague has taken a look at the spark kafka consumer github you have
provided and started experimenting.
We found that somehow when Spark has a failure after a data checkpoint, the
expected re-computations correspondent to the metadata checkpoints are not
recovered so we loose
I like this consumer for what it promises - better control over offset and
recovery from failures. If I understand this right, it still uses single
worker process to read from Kafka (one thread per partition) - is there a
way to specify multiple worker processes (on different machines) to read
Thanks Jonathan,
Yes, till non-ZK based offset management is available in Kafka, I need to
maintain the offset in ZK. And yes, both cases explicit commit is
necessary. I modified the Low Level Kafka Spark Consumer little bit to have
Receiver spawns threads for every partition of the topic and
Hi Yan,
That is a good suggestion. I believe non-Zookeeper offset management will
be a feature in the upcoming Kafka 0.8.2 release tentatively scheduled for
September.
https://cwiki.apache.org/confluence/display/KAFKA/Inbuilt+Consumer+Offset+Management
That should make this fairly easy to
37 matches
Mail list logo