Here's one reason I might want to be able to tell whether a given
offset is a transactional marker:
https://issues.apache.org/jira/browse/SPARK-24720
Alternatively, is there any efficient way to tell what the offset of
the last actual record in a topicpartition is (i.e. like endOffsets)
On Thu,
I can't speak for committers, but my guess is it's more likely for
DStreams in general to stop being supported before that particular
integration is removed.
On Sun, Feb 18, 2018 at 9:34 PM, naresh Goud wrote:
> Thanks Ted.
>
> I see createDirectStream is experimental as annotated with
> "org.ap
https://issues.apache.org/jira/browse/SPARK-19680
and
https://issues.apache.org/jira/browse/KAFKA-3370
has a good explanation.
Verify that it works correctly with auto offset set to latest, to rule
out other issues.
Then try providing explicit starting offsets reasonably near the
beginning of
Can anyone clarify what (other than the known cases of compaction or
transactions) could be causing non-contiguous offsets?
That sounds like a potential defect, given that I ran billions of
messages a day through kafka 0.8.x series for years without seeing
that.
On Tue, Jan 23, 2018 at 3:35 PM, J
rn… or secure for that matter… ;-)
>
>> On Sep 29, 2016, at 11:18 AM, Cody Koeninger wrote:
>>
>> Spark streaming helps with aggregation because
>>
>> A. raw kafka consumers have no built in framework for shuffling
>> amongst nodes, short of writing into an i
ective table in Cassandra / Postgres. (select .. from data where user =
>> ? and date between and and some_field = ?)
>>
>> How will Spark Streaming help w/ aggregation? Couldn't the data be queried
>> from Cassandra / Postgres via the Kafka consumer and aggregated
Spark streaming helps with aggregation because
A. raw kafka consumers have no built in framework for shuffling
amongst nodes, short of writing into an intermediate topic (I'm not
touching Kafka Streams here, I don't have experience), and
B. it deals with batches, so you can transactionally decide
lt tolerant as
>> possible.
>>
>> What's the advantage of using Spark for reading Kafka instead of direct
>> Kafka consumers?
>>
>> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger
>> wrote:
>>>
>>> I wouldn't give up the flexibility an
ill be
> idempotent but not inserts.
>
> Data should not be lost. The system should be as fault tolerant as possible.
>
> What's the advantage of using Spark for reading Kafka instead of direct
> Kafka consumers?
>
> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger wrote:
&g
> Spark direct stream is just fine for this use case.
> But why postgres and not cassandra?
> Is there anything specific here that i may not be aware?
>
> Thanks
> Deepak
>
> On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger wrote:
>>
>> How are you going to handle et
How are you going to handle etl failures? Do you care about lost /
duplicated data? Are your writes idempotent?
Absent any other information about the problem, I'd stay away from
cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
feeding postgres.
On Thu, Sep 29, 2016 at 10:04 A
poll() between the seek since position() will block to get
> the new offset.
>
> -Jason
>
> On Mon, Mar 14, 2016 at 2:37 PM, Cody Koeninger wrote:
>
>> Sorry, by metadata I also meant the equivalent of the old
>> OffsetRequest api, which partitionsFor doesn't giv
We'd probably want to understand why those are insufficient
> before considering new APIs.
>
> -Jason
>
> On Mon, Mar 14, 2016 at 12:17 PM, Cody Koeninger wrote:
>
>> Regarding the rebalance listener, in the case of the spark
>> integration, it is possible a job can
to solve. At this point, with 0.10 shortly on the way, it
> seems unlikely that incompatible changes to the API will be accepted.
> However, if someone can propose a compatible solution which addresses some
> of the concerns mentioned, we'd love to hear about it!
>
> -Jason
>
nge on the consumer API, please
> feel free to create a new KIP with the detailed motivation and proposed
> modifications.
>
> Guozhang
>
> On Fri, Mar 11, 2016 at 12:28 PM, Cody Koeninger wrote:
>
>> Is there a KIP or Jira related to " working on improving these cases
&g
sed to rebalance
>> callback. I have ran into issues several times because of that.
>>
>> About - subscribe vs assign - I have not read through your spark code yet
>> (will do by eod), so I am not sure what you mean (other than I do agree
>> that new partitions sh
order to do anything meaningful with the consumer itself in rebalance
>> callback (e.g. commit offset), you would need to hold on the consumer
>> reference; admittedly it sounds a bit awkward, but by design we choose to
>> not enforce it in the interface itself.
>>
&
mmit offset), you would need to hold on the consumer
> reference; admittedly it sounds a bit awkward, but by design we choose to
> not enforce it in the interface itself.
>
> Guozhang
>
> On Wed, Mar 9, 2016 at 3:39 PM, Cody Koeninger wrote:
>
>> So what about my comments reg
11 PM, Guozhang Wang wrote:
>
> > Filed https://issues.apache.org/jira/browse/KAFKA-3370.
> >
> > On Wed, Mar 9, 2016 at 1:11 PM, Cody Koeninger
> wrote:
> >
> >> That sounds like an interesting way of addressing the problem, can
> >> continue fu
he reason is that the
> seekToXX was actually not designed to do such initialization but for
> calling during the lifetime of the consumer, and we'd better provide the
> right solution to do so.
>
> I can file the JIRA right away and start further discussions there. But let
> me know i
to get all the partitions assigned
> to itself (i.e. you are only running a single instance).
>
> Guozhang
>
>
> On Wed, Mar 9, 2016 at 6:22 AM, Cody Koeninger wrote:
>
>> Another unfortunate thing about ConsumerRebalanceListener is that in
>> order to do meaningful work
o the consumer. Seems like this makes it unnecessarily
awkward to serialize or provide a 0 arg constructor for the listener.
On Wed, Mar 9, 2016 at 7:28 AM, Cody Koeninger wrote:
> I thought about ConsumerRebalanceListener, but seeking to the
> beginning any time there's a rebalance for wh
public void onPartitionsAssigned(Collection
> partitions) {
> consumer.seekToBeginning(partitions.toArray(new
> TopicPartition[0]));
> }
> };
>
> consumer.subscribe(topics, listener);
>
> On Wed, Mar 9, 2016 at 12:05 PM, Cody Koenin
upon calling seekToEnd/Beginning with no parameter, while no assigned is
> done yet, do the coordination behind the scene; it will though change the
> behavior of the functions as they are no longer always lazily evaluated.
>
>
> Guozhang
>
>
> On Tue, Mar 8, 2016 at 2:08 PM
Using the 0.9 consumer, I would like to start consuming at the
beginning or end, without specifying auto.offset.reset.
This does not seem to be possible:
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> conf.getString("kafka.brokers"),
"key.deserializer" -> classOf[St
Spark's direct stream kafka integration should take advantage of data
locality if you're running Spark executors on the same nodes as Kafka
brokers.
On Wed, Nov 25, 2015 at 9:50 AM, Dave Ariens wrote:
> I just finished reading up on Kafka Connect<
> http://kafka.apache.org/documentation.html#con
Also, if you actually want to use kafka, you're much better off with a
replication factor greater than 1, so you get leader re-election.
On Fri, Nov 20, 2015 at 9:20 AM, Cody Koeninger wrote:
> Spark specific questions are better directed to the Spark user list.
>
> Spark wil
Spark specific questions are better directed to the Spark user list.
Spark will retry failed tasks automatically up to a configurable number of
times. The direct stream will retry failures on the driver up to a
configurable number of times.
See
http://spark.apache.org/docs/latest/configuration.
Questions about Spark-kafka integration are better directed to the Spark
user mailing list.
I'm not 100% sure what you're asking. The spark createDirectStream api
will not store any offsets internally, unless you enable checkpointing.
On Sun, Nov 1, 2015 at 10:26 PM, Charan Ganga Phani Adabala
Questions about spark's kafka integration should probably be directed to
the spark user mailing list, not this one. I don't monitor kafka mailing
lists as closely, for instance.
For the direct stream, Spark doesn't keep any state regarding offsets,
unless you enable checkpointing. Have you read
scala:75)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> Thanks,
> Sourabh
>
> On Thu, Sep 24, 2015 at 2:04 PM, Cody Koeninger
> wrote:
>
>> That looks like the OOM is in the driver, when getting partition metadata
>> to create the d
That looks like the OOM is in the driver, when getting partition metadata
to create the direct stream. In that case, executor memory allocation
doesn't matter.
Allocate more driver memory, or put a profiler on it to see what's taking
up heap.
On Thu, Sep 24, 2015 at 3:51 PM, Sourabh Chandak
w
Yeah, the direct api uses the simple consumer
On Fri, Aug 28, 2015 at 1:32 PM, Cassa L wrote:
> Hi I am using below Spark jars with Direct Stream API.
> spark-streaming-kafka_2.10
>
> When I look at its pom.xml, Kafka libraries that its pulling in is
>org.apache.kafka
>kafka_${scal
33 matches
Mail list logo