Re: Viewing transactional markers in client

2018-08-04 Thread Cody Koeninger
Here's one reason I might want to be able to tell whether a given offset is a transactional marker: https://issues.apache.org/jira/browse/SPARK-24720 Alternatively, is there any efficient way to tell what the offset of the last actual record in a topicpartition is (i.e. like endOffsets) On Thu,

Re: KafkaUtils.createStream(..) is removed for API

2018-02-19 Thread Cody Koeninger
I can't speak for committers, but my guess is it's more likely for DStreams in general to stop being supported before that particular integration is removed. On Sun, Feb 18, 2018 at 9:34 PM, naresh Goud wrote: > Thanks Ted. > > I see createDirectStream is experimental as annotated with > "org.ap

Re: org.apache.kafka.clients.consumer.OffsetOutOfRangeException

2018-02-12 Thread Cody Koeninger
https://issues.apache.org/jira/browse/SPARK-19680 and https://issues.apache.org/jira/browse/KAFKA-3370 has a good explanation. Verify that it works correctly with auto offset set to latest, to rule out other issues. Then try providing explicit starting offsets reasonably near the beginning of

Re: Contiguous Offsets on non-compacted topics

2018-01-24 Thread Cody Koeninger
Can anyone clarify what (other than the known cases of compaction or transactions) could be causing non-contiguous offsets? That sounds like a potential defect, given that I ran billions of messages a day through kafka 0.8.x series for years without seeing that. On Tue, Jan 23, 2018 at 3:35 PM, J

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
rn… or secure for that matter… ;-) > >> On Sep 29, 2016, at 11:18 AM, Cody Koeninger wrote: >> >> Spark streaming helps with aggregation because >> >> A. raw kafka consumers have no built in framework for shuffling >> amongst nodes, short of writing into an i

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
ective table in Cassandra / Postgres. (select .. from data where user = >> ? and date between and and some_field = ?) >> >> How will Spark Streaming help w/ aggregation? Couldn't the data be queried >> from Cassandra / Postgres via the Kafka consumer and aggregated

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
Spark streaming helps with aggregation because A. raw kafka consumers have no built in framework for shuffling amongst nodes, short of writing into an intermediate topic (I'm not touching Kafka Streams here, I don't have experience), and B. it deals with batches, so you can transactionally decide

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
lt tolerant as >> possible. >> >> What's the advantage of using Spark for reading Kafka instead of direct >> Kafka consumers? >> >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger >> wrote: >>> >>> I wouldn't give up the flexibility an

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
ill be > idempotent but not inserts. > > Data should not be lost. The system should be as fault tolerant as possible. > > What's the advantage of using Spark for reading Kafka instead of direct > Kafka consumers? > > On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger wrote: &g

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
> Spark direct stream is just fine for this use case. > But why postgres and not cassandra? > Is there anything specific here that i may not be aware? > > Thanks > Deepak > > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger wrote: >> >> How are you going to handle et

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
How are you going to handle etl failures? Do you care about lost / duplicated data? Are your writes idempotent? Absent any other information about the problem, I'd stay away from cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream feeding postgres. On Thu, Sep 29, 2016 at 10:04 A

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-15 Thread Cody Koeninger
poll() between the seek since position() will block to get > the new offset. > > -Jason > > On Mon, Mar 14, 2016 at 2:37 PM, Cody Koeninger wrote: > >> Sorry, by metadata I also meant the equivalent of the old >> OffsetRequest api, which partitionsFor doesn't giv

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-14 Thread Cody Koeninger
We'd probably want to understand why those are insufficient > before considering new APIs. > > -Jason > > On Mon, Mar 14, 2016 at 12:17 PM, Cody Koeninger wrote: > >> Regarding the rebalance listener, in the case of the spark >> integration, it is possible a job can

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-14 Thread Cody Koeninger
to solve. At this point, with 0.10 shortly on the way, it > seems unlikely that incompatible changes to the API will be accepted. > However, if someone can propose a compatible solution which addresses some > of the concerns mentioned, we'd love to hear about it! > > -Jason >

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-14 Thread Cody Koeninger
nge on the consumer API, please > feel free to create a new KIP with the detailed motivation and proposed > modifications. > > Guozhang > > On Fri, Mar 11, 2016 at 12:28 PM, Cody Koeninger wrote: > >> Is there a KIP or Jira related to " working on improving these cases &g

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-11 Thread Cody Koeninger
sed to rebalance >> callback. I have ran into issues several times because of that. >> >> About - subscribe vs assign - I have not read through your spark code yet >> (will do by eod), so I am not sure what you mean (other than I do agree >> that new partitions sh

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-10 Thread Cody Koeninger
order to do anything meaningful with the consumer itself in rebalance >> callback (e.g. commit offset), you would need to hold on the consumer >> reference; admittedly it sounds a bit awkward, but by design we choose to >> not enforce it in the interface itself. >> &

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-10 Thread Cody Koeninger
mmit offset), you would need to hold on the consumer > reference; admittedly it sounds a bit awkward, but by design we choose to > not enforce it in the interface itself. > > Guozhang > > On Wed, Mar 9, 2016 at 3:39 PM, Cody Koeninger wrote: > >> So what about my comments reg

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-09 Thread Cody Koeninger
11 PM, Guozhang Wang wrote: > > > Filed https://issues.apache.org/jira/browse/KAFKA-3370. > > > > On Wed, Mar 9, 2016 at 1:11 PM, Cody Koeninger > wrote: > > > >> That sounds like an interesting way of addressing the problem, can > >> continue fu

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-09 Thread Cody Koeninger
he reason is that the > seekToXX was actually not designed to do such initialization but for > calling during the lifetime of the consumer, and we'd better provide the > right solution to do so. > > I can file the JIRA right away and start further discussions there. But let > me know i

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-09 Thread Cody Koeninger
to get all the partitions assigned > to itself (i.e. you are only running a single instance). > > Guozhang > > > On Wed, Mar 9, 2016 at 6:22 AM, Cody Koeninger wrote: > >> Another unfortunate thing about ConsumerRebalanceListener is that in >> order to do meaningful work

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-09 Thread Cody Koeninger
o the consumer. Seems like this makes it unnecessarily awkward to serialize or provide a 0 arg constructor for the listener. On Wed, Mar 9, 2016 at 7:28 AM, Cody Koeninger wrote: > I thought about ConsumerRebalanceListener, but seeking to the > beginning any time there's a rebalance for wh

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-09 Thread Cody Koeninger
public void onPartitionsAssigned(Collection > partitions) { > consumer.seekToBeginning(partitions.toArray(new > TopicPartition[0])); > } > }; > > consumer.subscribe(topics, listener); > > On Wed, Mar 9, 2016 at 12:05 PM, Cody Koenin

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-08 Thread Cody Koeninger
upon calling seekToEnd/Beginning with no parameter, while no assigned is > done yet, do the coordination behind the scene; it will though change the > behavior of the functions as they are no longer always lazily evaluated. > > > Guozhang > > > On Tue, Mar 8, 2016 at 2:08 PM

seekToBeginning doesn't work without auto.offset.reset

2016-03-08 Thread Cody Koeninger
Using the 0.9 consumer, I would like to start consuming at the beginning or end, without specifying auto.offset.reset. This does not seem to be possible: val kafkaParams = Map[String, Object]( "bootstrap.servers" -> conf.getString("kafka.brokers"), "key.deserializer" -> classOf[St

Re: Kafka Connect and Spark/Storm Comparisons

2015-11-25 Thread Cody Koeninger
Spark's direct stream kafka integration should take advantage of data locality if you're running Spark executors on the same nodes as Kafka brokers. On Wed, Nov 25, 2015 at 9:50 AM, Dave Ariens wrote: > I just finished reading up on Kafka Connect< > http://kafka.apache.org/documentation.html#con

Re: Couldn't find leaders for Set([TOPICNNAME,0])) When we are uisng in Apache Saprk.

2015-11-20 Thread Cody Koeninger
Also, if you actually want to use kafka, you're much better off with a replication factor greater than 1, so you get leader re-election. On Fri, Nov 20, 2015 at 9:20 AM, Cody Koeninger wrote: > Spark specific questions are better directed to the Spark user list. > > Spark wil

Re: Couldn't find leaders for Set([TOPICNNAME,0])) When we are uisng in Apache Saprk.

2015-11-20 Thread Cody Koeninger
Spark specific questions are better directed to the Spark user list. Spark will retry failed tasks automatically up to a configurable number of times. The direct stream will retry failures on the driver up to a configurable number of times. See http://spark.apache.org/docs/latest/configuration.

Re: Regarding The Kafka Offset Management Issue In Direct Stream Approach.

2015-11-06 Thread Cody Koeninger
Questions about Spark-kafka integration are better directed to the Spark user mailing list. I'm not 100% sure what you're asking. The spark createDirectStream api will not store any offsets internally, unless you enable checkpointing. On Sun, Nov 1, 2015 at 10:26 PM, Charan Ganga Phani Adabala

Re: Regarding the Kafka offset management issue in Direct Stream Approach.

2015-10-26 Thread Cody Koeninger
Questions about spark's kafka integration should probably be directed to the spark user mailing list, not this one. I don't monitor kafka mailing lists as closely, for instance. For the direct stream, Spark doesn't keep any state regarding offsets, unless you enable checkpointing. Have you read

Re: ERROR BoundedByteBufferReceive: OOME with size 352518400

2015-09-25 Thread Cody Koeninger
scala:75) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > > Thanks, > Sourabh > > On Thu, Sep 24, 2015 at 2:04 PM, Cody Koeninger > wrote: > >> That looks like the OOM is in the driver, when getting partition metadata >> to create the d

Re: ERROR BoundedByteBufferReceive: OOME with size 352518400

2015-09-24 Thread Cody Koeninger
That looks like the OOM is in the driver, when getting partition metadata to create the direct stream. In that case, executor memory allocation doesn't matter. Allocate more driver memory, or put a profiler on it to see what's taking up heap. On Thu, Sep 24, 2015 at 3:51 PM, Sourabh Chandak w

Re: SSL between Kafka and Spark Streaming API

2015-08-28 Thread Cody Koeninger
Yeah, the direct api uses the simple consumer On Fri, Aug 28, 2015 at 1:32 PM, Cassa L wrote: > Hi I am using below Spark jars with Direct Stream API. > spark-streaming-kafka_2.10 > > When I look at its pom.xml, Kafka libraries that its pulling in is >org.apache.kafka >kafka_${scal