Hi,
I have implemented a Low Level Kafka Consumer for Spark Streaming using
Kafka Simple Consumer API. This API will give better control over the Kafka
offset management and recovery from failures. As the present Spark
KafkaUtils uses HighLevel Kafka Consumer API, I wanted to have a better
You can try this Kafka Spark Consumer which I recently wrote. This uses the
Low Level Kafka Consumer
https://github.com/dibbhatt/kafka-spark-consumer
Dibyendu
On Tue, Aug 5, 2014 at 12:52 PM, rafeeq s rafeeq.ec...@gmail.com wrote:
Hi,
I am new to Apache Spark and Trying to Develop spark
. It's great to see community effort on adding new
streams/receivers, adding a Java API for receivers was something we did
specifically to allow this :)
- Patrick
On Sat, Aug 2, 2014 at 10:09 AM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
Hi,
I have implemented a Low Level
Dear All,
Recently I have written a Spark Kafka Consumer to solve this problem. Even
we have seen issues with KafkaUtils which is using Highlevel Kafka Consumer
and consumer code has no handle to offset management.
The below code solves this problem, and this has is being tested in our
Spark
Hi,
As I understand, your problem is similar to this JIRA.
https://issues.apache.org/jira/browse/SPARK-1647
The issue in this case, Kafka can not replay the message as offsets are
already committed. Also I think existing KafkaUtils ( The Default High
Level Kafka Consumer) also have this issue.
Hi Bharat,
Thanks for your email. If the Kafka Reader worker process dies, it will
be replaced by different machine, and it will start consuming from the
offset where it left over ( for each partition). Same case can happen even
if I tried to have individual Receiver for every partition.
I agree. This issue should be fixed in Spark rather rely on replay of Kafka
messages.
Dib
On Aug 28, 2014 6:45 AM, RodrigoB rodrigo.boav...@aspect.com wrote:
Dibyendu,
Tnks for getting back.
I believe you are absolutely right. We were under the assumption that the
raw data was being
Hi,
Sorry for little delay . As discussed in this thread, I have modified the
Kafka-Spark-Consumer ( https://github.com/dibbhatt/kafka-spark-consumer)
code to have dedicated Receiver for every Topic Partition. You can see the
example howto create Union of these receivers
in
, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
Hi,
Sorry for little delay . As discussed in this thread, I have modified the
Kafka-Spark-Consumer ( https://github.com/dibbhatt/kafka-spark-consumer)
code to have dedicated Receiver for every Topic Partition. You can see the
example
in very very slow
pace.
Regards,
Dibyendu
On Mon, Sep 8, 2014 at 12:08 AM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
Hi Tathagata,
I have managed to implement the logic into the Kafka-Spark consumer to
recover from Driver failure. This is just a interim fix till
Hi,
You can use this Kafka Spark Consumer.
https://github.com/dibbhatt/kafka-spark-consumer
This is exactly does that . It creates parallel Receivers for every Kafka
topic partitions. You can see the Consumer.java under consumer.kafka.client
package to see an example how to use it.
There is
I agree Gerard. Thanks for pointing this..
Dib
On Thu, Sep 11, 2014 at 5:28 PM, Gerard Maas gerard.m...@gmail.com wrote:
This pattern works.
One note, thought: Use 'union' only if you need to group the data from all
RDDs into one RDD for processing (like count distinct or need a groupby).
ContextCleaner?
I met very similar issue before…but haven’t get resolved
Best,
--
Nan Zhu
On Thursday, September 11, 2014 at 10:13 AM, Dibyendu Bhattacharya wrote:
Dear All,
Not sure if this is a false alarm. But wanted to raise to this to
understand what is happening.
I am testing
to prevent the
MemoryStore/BlockManager of dropping the block inputs? And should they be
logged in the level WARN/ERROR?
Thanks.
On Fri, Sep 12, 2014 at 4:45 PM, Dibyendu Bhattacharya [via Apache Spark
User List] [hidden email]
http://user/SendEmail.jtp?type=nodenode=14081i=0 wrote:
Dear all
Hi Alon,
No this will not be guarantee that same set of messages will come in same
RDD. This fix just re-play the messages from last processed offset in same
order. Again this is just a interim fix we needed to solve our use case .
If you do not need this message re-play feature, just do not
?
Thanks,
Tim
On Mon, Sep 15, 2014 at 4:33 AM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
Hi Alon,
No this will not be guarantee that same set of messages will come in same
RDD. This fix just re-play the messages from last processed offset in
same
order. Again
I believe this is something to do with how Kafka High Level API manages
consumers within a Consumer group and how it re-balance during failure. You
can find some mention in this Kafka wiki.
https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Client+Re-Design
Due to various issues in Kafka
Hi,
Yes, as Jerry mentioned, the Spark -3129 (
https://issues.apache.org/jira/browse/SPARK-3129) enabled the WAL feature
which solves the Driver failure problem. The way 3129 is designed , it
solved the driver failure problem agnostic of the source of the stream (
like Kafka or Flume etc) But
Or you can use this Low Level Kafka Consumer for Spark :
https://github.com/dibbhatt/kafka-spark-consumer
This is now part of http://spark-packages.org/ and is running successfully
for past few months in Pearson production environment . Being Low Level
consumer, it does not have this re-balancing
working nicely. I think there is merit in make it
the default Kafka Receiver for spark streaming.
-neelesh
On Mon, Feb 2, 2015 at 5:25 PM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
Or you can use this Low Level Kafka Consumer for Spark :
https://github.com/dibbhatt/kafka
Hi Kidong,
No , I have not tried yet with Spark 1.2 yet. I will try this out and let
you know how this goes.
By the way, is there any change in Receiver Store method happened in Spark
1.2 ?
Regards,
Dibyendu
On Fri, Jan 16, 2015 at 11:25 AM, mykidong mykid...@gmail.com wrote:
Hi
On Fri, Jan 16, 2015 at 11:50 AM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
Hi Kidong,
No , I have not tried yet with Spark 1.2 yet. I will try this out and let
you know how this goes.
By the way, is there any change in Receiver Store method happened in Spark
1.2
://github.com/dibbhatt/kafka-spark-consumer/blob/master/examples/scala/LowLevelKafkaConsumer.scala#L45
which you can run after changing few lines of configurations.
Thanks
Best Regards
On Fri, Jan 16, 2015 at 12:23 PM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
Hi Kidong
You can probably try the Low Level Consumer option with Spark 1.2
https://github.com/dibbhatt/kafka-spark-consumer
This Consumer can recover from any underlying failure of Spark Platform or
Kafka and either retry or restart the receiver. This is being working
nicely for us.
Regards,
Dibyendu
Thanks Akhil for mentioning this Low Level Consumer (
https://github.com/dibbhatt/kafka-spark-consumer ) . Yes it has better
fault tolerant mechanism than any existing Kafka consumer available . This
has no data loss on receiver failure and have ability to reply or restart
itself in-case of
Which version of Spark you are running ?
You can try this Low Level Consumer :
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer
This is designed to recover from various failures and have very good fault
recovery mechanism built in. This is being used by many users and at
present
PM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
Which version of Spark you are running ?
You can try this Low Level Consumer :
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer
This is designed to recover from various failures and have very good
fault recovery
You can probably try the Low Level Consumer from spark-packages (
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) .
How many partitions are there for your topics ? Let say you have 10 topics
, and each having 3 partition , ideally you can create max 30 parallel
Receiver and 30
to do your
transformation on the iterator inside a foreachPartition. Again, this
isn't a con of the direct stream api, this is just a need to understand
how Spark works.
On Tue, May 12, 2015 at 10:30 PM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
The low level consumer
or you can use this Receiver as well :
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer
Where you can specify how many Receivers you need for your topic and it
will divides the partitions among the Receiver and return the joined stream
for you .
Say you specified 20 receivers , in
Hi,
Just figured out that if I want to perform graceful shutdown of Spark
Streaming 1.4 ( from master ) , the Runtime.getRuntime().addShutdownHook no
longer works . As in Spark 1.4 there is Utils.addShutdownHook defined for
Spark Core, that gets anyway called , which leads to graceful shutdown
By the way this happens when I stooped the Driver process ...
On Tue, May 19, 2015 at 12:29 PM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
You mean to say within Runtime.getRuntime().addShutdownHook I call
ssc.stop(stopSparkContext = true, stopGracefully = true
PM, Sean Owen so...@cloudera.com wrote:
I don't think you should rely on a shutdown hook. Ideally you try to
stop it in the main exit path of your program, even in case of an
exception.
On Tue, May 19, 2015 at 7:59 AM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
You mean
On Mon, May 18, 2015 at 9:43 PM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
Hi,
Just figured out that if I want to perform graceful shutdown of Spark
Streaming 1.4 ( from master ) , the Runtime.getRuntime().addShutdownHook no
longer works . As in Spark 1.4
The low level consumer which Akhil mentioned , has been running in Pearson
for last 4-5 months without any downtime. I think this one is the reliable
Receiver Based Kafka consumer as of today for Spark .. if you say it that
way ..
Prior to Spark 1.3 other Receiver based consumers have used Kafka
Hi,
Just to let you know, I have made some enhancement in Low Level Reliable
Receiver based Kafka Consumer (
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) .
Earlier version uses as many Receiver task for number of partitions of your
kafka topic . Now you can configure desired
Just to add, there is a Receiver based Kafka consumer which uses Kafka Low
Level Consumer API.
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer
Regards,
Dibyendu
On Tue, May 19, 2015 at 9:00 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
On Tue, May 19, 2015 at 8:10 PM,
://github.com/apache/spark/pull/6307
On Tue, May 19, 2015 at 12:51 AM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
Thenka Sean . you are right. If driver program is running then I can
handle shutdown in main exit path . But if Driver machine is crashed (if
you just stop
Hi Snehal
Are you running the latest kafka consumer from github/spark-packages ? If
not can you take the latest changes. This low level receiver will make
attempt to keep trying if underlying BlockManager gives error. Are you see
those retry cycle in log ? If yes then there is issue writing
Seems to be related to this JIRA :
https://issues.apache.org/jira/browse/SPARK-3612 ?
On Tue, Jun 9, 2015 at 7:39 AM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
Hi Snehal
Are you running the latest kafka consumer from github/spark-packages ? If
not can you take the latest
Hi,
Can you please little detail stack trace from your receiver logs and also
the consumer settings you used ? I have never tested the consumer with
Kafka 0.7.3 ..not sure if Kafka Version is the issue . Have you tried
building the consumer using Kafka 0.7.3 ?
Regards,
Dibyendu
On Wed, Jun 10,
Hi,
There is another option to try for Receiver Based Low Level Kafka Consumer
which is part of Spark-Packages (
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) . This can
be used with WAL as well for end to end zero data loss.
This is also Reliable Receiver and Commit offset to
FileSystem interface, is returning zero.
On Mon, May 11, 2015 at 4:38 AM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
Just to follow up this thread further .
I was doing some fault tolerant testing of Spark Streaming with Tachyon
as OFF_HEAP block store. As I said in earlier
Hi,
Sometime back I played with Distributed Rule processing by integrating
Drool with HBase Co-Processors ..and invoke Rules on any incoming data ..
https://github.com/dibbhatt/hbase-rule-engine
You can get some idea how to use Drools rules if you see this
RegionObserverCoprocessor ..
I think you also can give a try to this consumer :
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer in your
environment. This has been running fine for topic with large number of
Kafka partition ( 200 ) like yours without any issue.. no issue with
connection as this consumer re-use
Hi,
You can try This Kafka Consumer for Spark which is also part of Spark
Packages . https://github.com/dibbhatt/kafka-spark-consumer
Regards,
Dibyendu
On Thu, Aug 6, 2015 at 6:48 AM, Sourabh Chandak sourabh3...@gmail.com
wrote:
Thanks Tathagata. I tried that but BlockGenerator internally
Hi,
I have raised a JIRA ( https://issues.apache.org/jira/browse/SPARK-11045)
to track the discussion but also mailing user group .
This Kafka consumer is around for a while in spark-packages (
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer ) and I see
many started using it , I
If you do not need one to one semantics and does not want strict ordering
guarantee , you can very well use the Receiver based approach, and this
consumer from Spark-Packages (
https://github.com/dibbhatt/kafka-spark-consumer) can give much better
alternatives in terms of performance and
Hi ,
Just would like to clarify few doubts I have how BlockManager behaves .
This is mostly in regards to Spark Streaming Context .
There are two possible cases Blocks may get dropped / not stored in memory
Case 1. While writing the Block for MEMORY_ONLY settings , if Node's
BlockManager does
Sometime back I was playing with Spark and Tachyon and I also found this
issue . The issue here is TachyonBlockManager put the blocks in
WriteType.TRY_CACHE configuration . And because of this Blocks ate evicted
from Tachyon Cache when Memory is full and when Spark try to find the block
it throws
The URL seems to have changed .. here is the one ..
http://tachyon-project.org/documentation/Tiered-Storage-on-Tachyon.html
On Wed, Aug 26, 2015 at 12:32 PM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
Sometime back I was playing with Spark and Tachyon and I also found
Dear All,
Just now released the 1.0.4 version of Low Level Receiver based
Kafka-Spark-Consumer in spark-packages.org . You can find the latest
release here :
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer
Here is github location :
Hi,
This is being running in Production in many organization who has adopted
this consumer as an alternative option. The Consumer will run with spark
1.3.1 .
This is being running in Pearson for sometime in production.
This is part of spark packages and you can see how to include it in your
Hi,
Just to clarify one point which may not be clear to many. If someone
decides to use Receiver based approach , the best options at this point is
to use https://github.com/dibbhatt/kafka-spark-consumer. This will also
work with WAL like any other receiver based consumer. The major issue with
Hi Michal,
If you use https://github.com/dibbhatt/kafka-spark-consumer , it comes
with int own built-in back pressure mechanism. By default this is disabled,
you need to enable it to use this feature with this consumer. It does
control the rate based on Scheduling Delay at runtime..
Regards,
byendu,
>
> How does one go about configuring spark streaming to use tachyon as its
> place for storing checkpoints? Also, can one do this with tachyon running
> on a completely different node than where spark processes are running?
>
> Thanks
> Nikunj
>
>
> On Thu
> see one node receiving the kafka flow.
> (I use spark 1.3.1)
>
> Tks
> Nicolas
>
>
> ----- Mail original -
> De: "Dibyendu Bhattacharya" <dibyendu.bhattach...@gmail.com>
> À: nib...@free.fr
> Cc: "Cody Koeninger" <c...@koeninger.or
Hi,
If you need to use Receiver based approach , you can try this one :
https://github.com/dibbhatt/kafka-spark-consumer
This is also part of Spark packages :
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer
You just need to specify the number of Receivers you want for desired
s (number of nodes), how RDD will be
> distributed over the nodes/core.
> For example in my example I have 4 nodes (with 2 cores)
>
> Tks
> Nicolas
>
>
> - Mail original -
> De: "Dibyendu Bhattacharya" <dibyendu.bhattach...@gmail.com>
> À: nib...@fre
In direct stream checkpoint location is not recoverable if you modify your
driver code. So if you just rely on checkpoint to commit offset , you can
possibly loose messages if you modify driver code and you select offset
from "largest" offset. If you do not want to loose messages, you need to
, silently throwing away data because your system
> isn't working right is not the same as "recover from any Kafka leader
> changes and offset out of ranges issue".
>
>
>
> On Tue, Dec 1, 2015 at 11:27 PM, Dibyendu Bhattacharya <
> dibyendu.bhattach...@gmail.com>
the problem without actually addressing it".
>
> The only really correct way to deal with that situation is to recognize
> why it's happening, and either increase your Kafka retention or increase
> the speed at which you are consuming.
>
> On Wed, Dec 2, 2015 at 7:13 PM, D
Hi, if you use Receiver based consumer which is available in spark-packages
( http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) , this
has all built in failure recovery and it can recover from any Kafka leader
changes and offset out of ranges issue.
Here is the package form github
0
>
> Cheers
>
> On Thu, Jan 7, 2016 at 2:39 AM, Dibyendu Bhattacharya <
> dibyendu.bhattach...@gmail.com> wrote:
>
>> You are using low level spark kafka consumer . I am the author of the
>> same.
>>
>> Are you using the spark-packages version ? if y
Thu, Jan 7, 2016 at 11:39 AM, Dibyendu Bhattacharya
> <dibyendu.bhattach...@gmail.com> wrote:
> > You are using low level spark kafka consumer . I am the author of the
> same.
>
> If I may ask, what are the differences between this and the direct
> version shipped
In Spark Streaming job, I see a Batch failed with following error. Haven't
seen anything like this earlier.
This has happened during Shuffle for one Batch (haven't reoccurred after
that).. Just curious to know what can cause this error. I am running Spark
1.5.1
Regards,
Dibyendu
Job aborted
You can get some good pointers in this JIRA
https://issues.apache.org/jira/browse/SPARK-15796
Dibyendu
On Thu, Jul 14, 2016 at 12:53 AM, Sunita wrote:
> I am facing the same issue. Upgrading to Spark1.6 is causing hugh
> performance
> loss. Could you solve this issue?
Hi ,
Released latest version of Receiver based Kafka Consumer for Spark Streaming
.
Available at Spark Packages : https://spark-packages.org/package/dibbhatt/
kafka-spark-consumer
Also at github : https://github.com/dibbhatt/kafka-spark-consumer
Some key features
- Tuned for better
Hi ,
Released latest version of Receiver based Kafka Consumer for Spark
Streaming.
Receiver is compatible with Kafka versions 0.8.x, 0.9.x and 0.10.x and All
Spark Versions
Available at Spark Packages :
https://spark-packages.org/package/dibbhatt/kafka-spark-consumer
Also at github :
from my iPhone
>
> On Aug 25, 2016, at 6:33 AM, Dibyendu Bhattacharya <
> dibyendu.bhattach...@gmail.com> wrote:
>
> Hi ,
>
> Released latest version of Receiver based Kafka Consumer for Spark
> Streaming.
>
> Receiver is compatible with Kafka versions 0.8.x, 0.
Hi,
You could also use this Receiver :
https://github.com/dibbhatt/kafka-spark-consumer
This is part of spark-packages also :
https://spark-packages.org/package/dibbhatt/kafka-spark-consumer
You do not need to enable WAL in this and still recover from Driver failure
with no data loss. You can
Hi,
I am trying something like this..
val sesDS: Dataset[XXX] = hiveContext.sql(select).as[XXX]
The select statement is something like this : "select * from sometable
DISTRIBUTE by col1, col2, col3"
Then comes groupByKey...
val gpbyDS = sesDS .groupByKey(x => (x.col1, x.col2, x.col3))
72 matches
Mail list logo