Re: Structured streaming from Kafka by timestamp

2019-02-05 Thread Cody Koeninger
To be more explicit, the easiest thing to do in the short term is use your own instance of KafkaConsumer to get the offsets for the timestamps you're interested in, using offsetsForTimes, and use those for the start / end offsets. See

Re: Back pressure not working on streaming

2019-02-05 Thread Cody Koeninger
That article is pretty old, If you click through the link to the jira mentioned in it, https://issues.apache.org/jira/browse/SPARK-18580 , it's been resolved. On Wed, Jan 2, 2019 at 12:42 AM JF Chen wrote: > > yes, 10 is a very low value for testing initial rate. > And from this article >

Re: ConcurrentModificationExceptions with CachedKafkaConsumers

2018-08-31 Thread Cody Koeninger
id, as well > as simply moving to use only a single one of our clusters. Neither of these > were successful. I am not able to run a test against master now. > > Regards, > > Bryan Jeffrey > > > > > On Thu, Aug 30, 2018 at 2:56 PM Cody Koeninger wrote: >&g

Re: Spark Streaming - Kafka. java.lang.IllegalStateException: This consumer has already been closed.

2018-08-30 Thread Cody Koeninger
Ortiz Fernández wrote: > I can't... do you think that it's a possible bug of this version?? from > Spark or Kafka? > > El mié., 29 ago. 2018 a las 22:28, Cody Koeninger () > escribió: >> >> Are you able to try a recent version of spark? >> >> On Wed, Aug 29, 201

Re: ConcurrentModificationExceptions with CachedKafkaConsumers

2018-08-30 Thread Cody Koeninger
I doubt that fix will get backported to 2.3.x Are you able to test against master? 2.4 with the fix you linked to is likely to hit code freeze soon. >From a quick look at your code, I'm not sure why you're mapping over an array of brokers. It seems like that would result in different streams

Re: Spark Streaming - Kafka. java.lang.IllegalStateException: This consumer has already been closed.

2018-08-29 Thread Cody Koeninger
Are you able to try a recent version of spark? On Wed, Aug 29, 2018 at 2:10 AM, Guillermo Ortiz Fernández wrote: > I'm using Spark Streaming 2.0.1 with Kafka 0.10, sometimes I get this > exception and Spark dies. > > I couldn't see any error or problem among the machines, anybody has the >

Re: Kafka Offset Storage: Fetching Offsets

2018-06-14 Thread Cody Koeninger
at 3:33 PM, Bryan Jeffrey wrote: > Cody, > > Where is that called in the driver? The only call I see from Subscribe is to > load the offset from checkpoint. > > Get Outlook for Android > > > From: Cody Koeninger > Sent: Thursday, June 14,

Re: Kafka Offset Storage: Fetching Offsets

2018-06-14 Thread Cody Koeninger
m checkpoint. > > Thank you! > > Bryan > > Get Outlook for Android > > ____ > From: Cody Koeninger > Sent: Thursday, June 14, 2018 4:00:31 PM > To: Bryan Jeffrey > Cc: user > Subject: Re: Kafka Offset Storage: Fetching Off

Re: Kafka Offset Storage: Fetching Offsets

2018-06-14 Thread Cody Koeninger
The expectation is that you shouldn't have to manually load offsets from kafka, because the underlying kafka consumer on the driver will start at the offsets associated with the given group id. That's the behavior I see with this example:

Re: [Structured-Streaming][Beginner] Out of order messages with Spark kafka readstream from a specific partition

2018-05-10 Thread Cody Koeninger
As long as you aren't doing any spark operations that involve a shuffle, the order you see in spark should be the same as the order in the partition. Can you link to a minimal code example that reproduces the issue? On Wed, May 9, 2018 at 7:05 PM, karthikjay wrote: > On the

Re: Structured streaming: Tried to fetch $offset but the returned record offset was ${record.offset}"

2018-04-17 Thread Cody Koeninger
Is this possibly related to the recent post on https://issues.apache.org/jira/browse/SPARK-18057 ? On Mon, Apr 16, 2018 at 11:57 AM, ARAVIND SETHURATHNAM < asethurath...@homeaway.com.invalid> wrote: > Hi, > > We have several structured streaming jobs (spark version 2.2.0) consuming > from kafka

Re: is it possible to use Spark 2.3.0 along with Kafka 0.9.0.1?

2018-03-16 Thread Cody Koeninger
Should be able to use the 0.8 kafka dstreams with a kafka 0.9 broker On Fri, Mar 16, 2018 at 7:52 AM, kant kodali wrote: > Hi All, > > is it possible to use Spark 2.3.0 along with Kafka 0.9.0.1? > > Thanks, > kant

Re: KafkaUtils.createStream(..) is removed for API

2018-02-19 Thread Cody Koeninger
I can't speak for committers, but my guess is it's more likely for DStreams in general to stop being supported before that particular integration is removed. On Sun, Feb 18, 2018 at 9:34 PM, naresh Goud wrote: > Thanks Ted. > > I see createDirectStream is

Re: Providing Kafka configuration as Map of Strings

2018-01-24 Thread Cody Koeninger
Have you tried passing in a Map that happens to have string for all the values? I haven't tested this, but the underlying kafka consumer constructor is documented to take either strings or objects as values, despite the static type. On Wed, Jan 24, 2018 at 2:48 PM, Tecno Brain

Re: uncontinuous offset in kafka will cause the spark streamingfailure

2018-01-24 Thread Cody Koeninger
wise.com> > 抄送: user<user@spark.apache.org>; Cody Koeninger<c...@koeninger.org> > 发送时间: 2018年1月24日(周三) 14:45 > 主题: Re: uncontinuous offset in kafka will cause the spark streamingfailure > > Yes. My spark streaming application works with uncompacted topic. I will >

Re: "Got wrong record after seeking to offset" issue

2018-01-18 Thread Cody Koeninger
nd I can run to > check compaction I’m happy to give that a shot too. > > I’ll try consuming from the failed offset if/when the problem manifests > itself again. > > Thanks! > Justin > > > On Wednesday, January 17, 2018, Cody Koeninger <c...@koeninger.org> wrote: >

Re: "Got wrong record after seeking to offset" issue

2018-01-17 Thread Cody Koeninger
That means the consumer on the executor tried to seek to the specified offset, but the message that was returned did not have a matching offset. If the executor can't get the messages the driver told it to get, something's generally wrong. What happens when you try to consume the particular

Re: Which kafka client to use with spark streaming

2017-12-26 Thread Cody Koeninger
Do not add a dependency on kafka-clients, the spark-streaming-kafka library has appropriate transitive dependencies. Either version of the spark-streaming-kafka library should work with 1.0 brokers; what problems were you having? On Mon, Dec 25, 2017 at 7:58 PM, Diogo Munaro Vieira

Re: How to properly execute `foreachPartition` in Spark 2.2

2017-12-18 Thread Cody Koeninger
You can't create a network connection to kafka on the driver and then serialize it to send it the executor. That's likely why you're getting serialization errors. Kafka producers are thread safe and designed for use as a singleton. Use a lazy singleton instance of the producer on the executor,

Re: bulk upsert data batch from Kafka dstream into Postgres db

2017-12-14 Thread Cody Koeninger
Modern versions of postgres have upsert, ie insert into ... on conflict ... do update On Thu, Dec 14, 2017 at 11:26 AM, salemi wrote: > Thank you for your respond. > The approach loads just the data into the DB. I am looking for an approach > that allows me to update

Re: bulk upsert data batch from Kafka dstream into Postgres db

2017-12-14 Thread Cody Koeninger
use foreachPartition(), get a connection from a jdbc connection pool, and insert the data the same way you would in a non-spark program. If you're only doing inserts, postgres COPY will be faster (e.g. https://discuss.pivotal.io/hc/en-us/articles/204237003), but if you're doing updates that's not

Re: [Spark streaming] No assigned partition error during seek

2017-12-01 Thread Cody Koeninger
d the spark async commit useful for our needs”, do you > mean to say the code like below? > > kafkaDStream.asInstanceOf[CanCommitOffsets].commitAsync(ranges) > > > > > > Best Regards > > Richard > > > > > > From: venkat <meven...@gma

Re: Kafka version support

2017-11-30 Thread Cody Koeninger
Are you talking about the broker version, or the kafka-clients artifact version? On Thu, Nov 30, 2017 at 12:17 AM, Raghavendra Pandey wrote: > Just wondering if anyone has tried spark structured streaming kafka > connector (2.2) with Kafka 0.11 or Kafka 1.0 version

Re: [Spark streaming] No assigned partition error during seek

2017-11-30 Thread Cody Koeninger
You mentioned 0.11 version; the latest version of org.apache.kafka kafka-clients artifact supported by DStreams is 0.10.0.1, for which it has an appropriate dependency. Are you manually depending on a different version of the kafka-clients artifact? On Fri, Nov 24, 2017 at 7:39 PM, venks61176

Re: Spark Streaming fails with unable to get records after polling for 512 ms

2017-11-17 Thread Cody Koeninger
eam ? > > Thanks > Jagadish > > On Wed, Nov 15, 2017 at 11:23 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> spark.streaming.kafka.consumer.poll.ms is a spark configuration, not >> a kafka parameter. >> >> see http://spark.apache.org/docs/l

Re: Spark Streaming fails with unable to get records after polling for 512 ms

2017-11-15 Thread Cody Koeninger
spark.streaming.kafka.consumer.poll.ms is a spark configuration, not a kafka parameter. see http://spark.apache.org/docs/latest/configuration.html On Tue, Nov 14, 2017 at 8:56 PM, jkagitala wrote: > Hi, > > I'm trying to add spark-streaming to our kafka topic. But, I keep

Re: FW: Kafka Direct Stream - dynamic topic subscription

2017-10-29 Thread Cody Koeninger
As it says in SPARK-10320 and in the docs at http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#consumerstrategies , you can use SubscribePattern On Sun, Oct 29, 2017 at 3:56 PM, Ramanan, Buvana (Nokia - US/Murray Hill) wrote: > Hello

Re: How to read from multiple kafka topics using structured streaming (spark 2.2.0)?

2017-09-19 Thread Cody Koeninger
You should be able to pass a comma separated string of topics to subscribe. subscribePattern isn't necessary On Tue, Sep 19, 2017 at 2:54 PM, kant kodali wrote: > got it! Sorry. > > On Tue, Sep 19, 2017 at 12:52 PM, Jacek Laskowski wrote: >> >> Hi, >> >>

Re: ConcurrentModificationException using Kafka Direct Stream

2017-09-18 Thread Cody Koeninger
Have you searched in jira, e.g. https://issues.apache.org/jira/browse/SPARK-19185 On Mon, Sep 18, 2017 at 1:56 AM, HARSH TAKKAR wrote: > Hi > > Changing spark version if my last resort, is there any other workaround for > this problem. > > > On Mon, Sep 18, 2017 at 11:43

Re: Multiple Kafka topics processing in Spark 2.2

2017-09-11 Thread Cody Koeninger
If you want an "easy" but not particularly performant way to do it, each org.apache.kafka.clients.consumer.ConsumerRecord has a topic. The topic is going to be the same for the entire partition as long as you haven't shuffled, hence the examples on how to deal with it at a partition level. On

Re: Slower performance while running Spark Kafka Direct Streaming with Kafka 10 cluster

2017-08-29 Thread Cody Koeninger
ms) > ) > > val kafkaStreamRdd = kafkaStream.transform { rdd => > rdd.map(consumerRecord => (consumerRecord.key(), consumerRecord.value())) > } > > On Mon, Aug 28, 2017 at 11:56 AM, swetha kasireddy < > swethakasire...@gmail.com> wrote: > >> There is no

Re: Slower performance while running Spark Kafka Direct Streaming with Kafka 10 cluster

2017-08-28 Thread Cody Koeninger
xCapacity" -> Integer.valueOf(96), > "spark.streaming.kafka.consumer.cache.maxCapacity" -> Integer.valueOf(96), > > "spark.streaming.kafka.consumer.poll.ms" -> Integer.valueOf(1024), > > > http://markmail.org/message/n4cdxwurlhf44q5x > > https://issues.

Re: Kafka Consumer Pre Fetch Messages + Async commits

2017-08-28 Thread Cody Koeninger
1. No, prefetched message offsets aren't exposed. 2. No, I'm not aware of any plans for sync commit, and I'm not sure that makes sense. You have to be able to deal with repeat messages in the event of failure in any case, so the only difference sync commit would make would be (possibly) slower

Re: Slower performance while running Spark Kafka Direct Streaming with Kafka 10 cluster

2017-08-25 Thread Cody Koeninger
Why are you setting consumer.cache.enabled to false? On Fri, Aug 25, 2017 at 2:19 PM, SRK wrote: > Hi, > > What would be the appropriate settings to run Spark with Kafka 10? My job > works fine with Spark with Kafka 8 and with Kafka 8 cluster. But its very > slow with

Re: How to force Spark Kafka Direct to start from the latest offset when the lag is huge in kafka 10?

2017-08-22 Thread Cody Koeninger
o.offset.reset" -> "latest",. > > > Thanks! > > On Mon, Aug 21, 2017 at 9:06 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Yes, you can start from specified offsets. See ConsumerStrategy, >> specifically Assign >> >> >> http:

Re: How to force Spark Kafka Direct to start from the latest offset when the lag is huge in kafka 10?

2017-08-21 Thread Cody Koeninger
Yes, you can start from specified offsets. See ConsumerStrategy, specifically Assign http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#your-own-data-store On Tue, Aug 15, 2017 at 1:18 PM, SRK wrote: > Hi, > > How to force Spark Kafka Direct to

Re: KafkaUtils.createRDD , How do I read all the data from kafka in a batch program for a given topic?

2017-08-09 Thread Cody Koeninger
org.apache.spark.streaming.kafka.KafkaCluster has methods getLatestLeaderOffsets and getEarliestLeaderOffsets On Mon, Aug 7, 2017 at 11:37 PM, shyla deshpande wrote: > Thanks TD. > > On Mon, Aug 7, 2017 at 8:59 PM, Tathagata Das > wrote: >>

Re: kafka settting, enable.auto.commit to false is being overridden and I lose data. please help!

2017-08-07 Thread Cody Koeninger
saved into Cassandra. It is working as expected 99% of the time except that > when there is an exception, I did not want the offsets to be committed. > > By Filtering for unsuccessful attempts, do you mean filtering the bad > records... > > > > > > > On Mon, Aug 7, 2017

Re: kafka settting, enable.auto.commit to false is being overridden and I lose data. please help!

2017-08-07 Thread Cody Koeninger
raised, I was thinking I won't be committing the >> offsets, but the offsets are committed all the time independent of whether >> an exception was raised or not. >> >> It will be helpful if you can explain this behavior. >> >> >> On Sun, Aug 6, 2017 at 5:19

Re: kafka settting, enable.auto.commit to false is being overridden and I lose data. please help!

2017-08-06 Thread Cody Koeninger
sould be doing, then what is the right way? > > On Sun, Aug 6, 2017 at 9:21 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> If your complaint is about offsets being committed that you didn't >> expect... auto commit being false on executors shouldn't have anythin

Re: kafka settting, enable.auto.commit to false is being overridden and I lose data. please help!

2017-08-06 Thread Cody Koeninger
If your complaint is about offsets being committed that you didn't expect... auto commit being false on executors shouldn't have anything to do with that. Executors shouldn't be auto-committing, that's why it's being overridden. What you've said and the code you posted isn't really enough to

Re: Spark streaming giving me a bunch of WARNINGS, please help meunderstand them

2017-07-10 Thread Cody Koeninger
The warnings regarding configuration on the executor are for the executor kafka consumer, not the driver kafka consumer. In general, the executor kafka consumers should consume only exactly the offsets the driver told them to, and not be involved in committing offsets / part of the same group as

Re: The stability of Spark Stream Kafka 010

2017-06-29 Thread Cody Koeninger
Given the emphasis on structured streaming, I don't personally expect a lot more work being put into DStreams-based projects, outside of bugfixes. Stable designation is kind of arbitrary at that point. That 010 version wasn't developed until spark 2.0 timeframe, but you can always try

Re: Spark-Kafka integration - build failing with sbt

2017-06-19 Thread Cody Koeninger
en i provide the > appropriate packages in LibraryDependencies ? > which ones would have helped compile this ? > > > > On Sat, Jun 17, 2017 at 2:53 PM, karan alang <karan.al...@gmail.com> wrote: >> >> Thanks, Cody .. yes, was able to fix that. >>

Re: Spark-Kafka integration - build failing with sbt

2017-06-17 Thread Cody Koeninger
There are different projects for different versions of kafka, spark-streaming-kafka-0-8 and spark-streaming-kafka-0-10 See http://spark.apache.org/docs/latest/streaming-kafka-integration.html On Fri, Jun 16, 2017 at 6:51 PM, karan alang wrote: > I'm trying to compile

Re: Message getting lost in Kafka + Spark Streaming

2017-05-30 Thread Cody Koeninger
First thing I noticed, you should be using a singleton kafka producer, not recreating one every partition. It's threadsafe. On Tue, May 30, 2017 at 7:59 AM, Vikash Pareek wrote: > I am facing an issue related to spark streaming with kafka, my use case is as >

Re: do we need to enable writeAheadLogs for DirectStream as well or is it only for indirect stream?

2017-05-02 Thread Cody Koeninger
You don't need write ahead logs for direct stream. On Tue, May 2, 2017 at 11:32 AM, kant kodali wrote: > Hi All, > > I need some fault tolerance for my stateful computations and I am wondering > why we need to enable writeAheadLogs for DirectStream like Kafka (for > Indirect

Re: Exactly-once semantics with kakfa CanCommitOffsets.commitAsync?

2017-04-28 Thread Cody Koeninger
t the semantics here - i.e., calls to commitAsync are not > actually guaranteed to succeed? If that's the case, the docs could really > be a *lot* clearer about that. > > Thanks, > > DR > > On Fri, Apr 28, 2017 at 11:34 AM, Cody Koeninger <c...@koeninger.org> wrote:

Re: Exactly-once semantics with kakfa CanCommitOffsets.commitAsync?

2017-04-28 Thread Cody Koeninger
>From that doc: " However, Kafka is not transactional, so your outputs must still be idempotent. " On Fri, Apr 28, 2017 at 10:29 AM, David Rosenstrauch wrote: > I'm doing a POC to test recovery with spark streaming from Kafka. I'm using > the technique for storing the

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-27 Thread Cody Koeninger
il the 10th second of > executing the last consumed offset of the same partition was 200.000 - and so > forth. This is the information I seek to get. > >> On 27 Apr 2017, at 20:11, Cody Koeninger <c...@koeninger.org> wrote: >> >> Are you asking for commits for ev

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-27 Thread Cody Koeninger
sets of the intermediate batches - and hence the questions. > >> On 26 Apr 2017, at 21:42, Cody Koeninger <c...@koeninger.org> wrote: >> >> have you read >> >> http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#kafka-itself >>

Re: help/suggestions to setup spark cluster

2017-04-27 Thread Cody Koeninger
r this. Please >> provide me some links to help me get started. >> >> Thanks >> -Anna >> >> On Wed, Apr 26, 2017 at 7:46 PM, Cody Koeninger <c...@koeninger.org> >> wrote: >>> >>> The standalone cluster manager is fine for production. D

Re: help/suggestions to setup spark cluster

2017-04-26 Thread Cody Koeninger
The standalone cluster manager is fine for production. Don't use Yarn or Mesos unless you already have another need for it. On Wed, Apr 26, 2017 at 4:53 PM, anna stax wrote: > Hi Sam, > > Thank you for the reply. > > What do you mean by > I doubt people run spark in a.

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-26 Thread Cody Koeninger
t; enable.auto.commit property to false. > > Any idea onto how to use the KafkaConsumer’s auto offset commits? Keep in > mind that I do not care about exactly-once, hence having messages replayed is > perfectly fine. > >> On 26 Apr 2017, at 19:26, Cody Koeninger <c...@

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-26 Thread Cody Koeninger
What is it you're actually trying to accomplish? You can get topic, partition, and offset bounds from an offset range like http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#obtaining-offsets Timestamp isn't really a meaningful idea for a range of offsets. On Tue, Apr

Re: Spark streaming to kafka exactly once

2017-03-22 Thread Cody Koeninger
If you're talking about reading the same message multiple times in a failure situation, see https://github.com/koeninger/kafka-exactly-once If you're talking about producing the same message multiple times in a failure situation, keep an eye on

Re: [Spark Streaming+Kafka][How-to]

2017-03-22 Thread Cody Koeninger
Glad you got it worked out. That's cool as long as your use case doesn't actually require e.g. partition 0 to always be scheduled to the same executor across different batches. On Tue, Mar 21, 2017 at 7:35 PM, OUASSAIDI, Sami wrote: > So it worked quite well with a

Re: Spark Streaming from Kafka, deal with initial heavy load.

2017-03-20 Thread Cody Koeninger
You want spark.streaming.kafka.maxRatePerPartition for the direct stream. On Sat, Mar 18, 2017 at 3:37 PM, Mal Edwin wrote: > > Hi, > You can enable backpressure to handle this. > > spark.streaming.backpressure.enabled > spark.streaming.receiver.maxRate > > Thanks, >

Re: Streaming 2.1.0 - window vs. batch duration

2017-03-17 Thread Cody Koeninger
Probably easier if you show some more code, but if you just call dstream.window(Seconds(60)) you didn't specify a slide duration, so it's going to default to your batch duration of 1 second. So yeah, if you're just using e.g. foreachRDD to output every message in the window, every second it's

Re: [Spark Streaming+Kafka][How-to]

2017-03-16 Thread Cody Koeninger
Spark just really isn't a good fit for trying to pin particular computation to a particular executor, especially if you're relying on that for correctness. On Thu, Mar 16, 2017 at 7:16 AM, OUASSAIDI, Sami wrote: > > Hi all, > > So I need to specify how an executor

Re: [Spark Kafka] API Doc pages for Kafka 0.10 not current

2017-02-28 Thread Cody Koeninger
The kafka-0-8 and kafka-0-10 integrations have conflicting dependencies. Last time I checked, Spark's doc publication puts everything all in one classpath, so publishing them both together won't work. I thought there was already a Jira ticket related to this, but a quick look didn't turn it up.

Re: Why does Spark Streaming application with Kafka fail with “requirement failed: numRecords must not be negative”?

2017-02-22 Thread Cody Koeninger
mja...@seecs.edu.pk> wrote: > I just noticed that Spark version that I am using (2.0.2) is built with > Scala 2.11. However I am using Kafka 0.8.2 built with Scala 2.10. Could this > be the reason why we are getting this error? > > On Mon, Feb 20, 2017 at 5:50 PM, Cody Koeninger <c

Re: Why does Spark Streaming application with Kafka fail with “requirement failed: numRecords must not be negative”?

2017-02-20 Thread Cody Koeninger
+ inTime + ", " + outTime) > > } > > } > > })) > > } > > } > > > On Mon, Feb 20, 2017 at 3:10 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> That's an indication that the beginning offset for a given b

Re: Why does Spark Streaming application with Kafka fail with “requirement failed: numRecords must not be negative”?

2017-02-20 Thread Cody Koeninger
That's an indication that the beginning offset for a given batch is higher than the ending offset, i.e. something is seriously wrong. Are you doing anything at all odd with topics, i.e. deleting and recreating them, using compacted topics, etc? Start off with a very basic stream over the same

Re: streaming-kafka-0-8-integration (direct approach) and monitoring

2017-02-14 Thread Cody Koeninger
Not sure what to tell you at that point - maybe compare what is present in ZK to a known working group. On Tue, Feb 14, 2017 at 9:06 PM, Mohammad Kargar <mkar...@phemi.com> wrote: > Yes offset nodes are in zk and I can get the values. > > On Feb 14, 2017 6:54 PM, &quo

Re: streaming-kafka-0-8-integration (direct approach) and monitoring

2017-02-14 Thread Cody Koeninger
spark job. On Tue, Feb 14, 2017 at 8:40 PM, Mohammad Kargar <mkar...@phemi.com> wrote: > I'm running 0.10 version and > > ./bin/kafka-consumer-groups.sh --zookeeper localhost:2181 --list > > lists the group. > > On Feb 14, 2017 6:34 PM, "Cody Koeninger" <

Re: Write JavaDStream to Kafka (how?)

2017-02-14 Thread Cody Koeninger
It looks like you're creating a kafka producer on the driver, and attempting to write the string representation of stringIntegerJavaPairRDD. Instead, you probably want to be calling stringIntegerJavaPairRDD.foreachPartition, so that producing to kafka is being done on the executor. Read

Re: streaming-kafka-0-8-integration (direct approach) and monitoring

2017-02-14 Thread Cody Koeninger
Can you explain what wasn't successful and/or show code? On Tue, Feb 14, 2017 at 6:03 PM, Mohammad Kargar wrote: > As explained here, direct approach of integration between spark streaming > and kafka does not update offsets in Zookeeper, hence Zookeeper-based Kafka >

Re: Spark 2.0 Scala 2.11 and Kafka 0.10 Scala 2.10

2017-02-08 Thread Cody Koeninger
Pretty sure there was no 0.10.0.2 release of apache kafka. If that's a hortonworks modified version you may get better results asking in a hortonworks specific forum. Scala version of kafka shouldn't be relevant either way though. On Wed, Feb 8, 2017 at 5:30 PM, u...@moosheimer.com

Re: SSpark streaming: Could not initialize class kafka.consumer.FetchRequestAndResponseStatsRegistry$

2017-02-06 Thread Cody Koeninger
You should not need to include jars for Kafka, the spark connectors have the appropriate transitive dependency on the correct version. On Sat, Feb 4, 2017 at 3:25 PM, Marco Mistroni wrote: > Hi > not sure if this will help at all, and pls take it with a pinch of salt as > i

Re: Kafka dependencies in Eclipse project /Pls assist

2017-01-31 Thread Cody Koeninger
spark-streaming-kafka-0-10 has a transitive dependency on the kafka library, you shouldn't need to include kafka explicitly. What's your actual list of dependencies? On Tue, Jan 31, 2017 at 3:49 PM, Marco Mistroni wrote: > HI all > i am trying to run a sample spark code

Re: mapWithState question

2017-01-30 Thread Cody Koeninger
Keep an eye on https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging although it'll likely be a while On Mon, Jan 30, 2017 at 3:41 PM, Tathagata Das wrote: > If you care about the semantics of those writes to

Re: Spark streaming multiple kafka topic doesn't work at-least-once

2017-01-24 Thread Cody Koeninger
each consumer's coverage and lag status. On Tue, Jan 24, 2017 at 10:45 PM, Cody Koeninger <c...@koeninger.org> wrote: > When you said " I check the offset ranges from Kafka Manager and don't > see any significant deltas.", what were you comparing it against? The > offset

Re: Spark streaming multiple kafka topic doesn't work at-least-once

2017-01-24 Thread Cody Koeninger
arch. An > another legacy app also writes the same results to a database. There are > huge difference between DB and ES. I know how many records we process daily. > > Everything works fine if I run a job instance for each topic. > > On Tue, Jan 24, 2017 at 5:26 PM, Cody Koeninger <c

Re: Spark streaming multiple kafka topic doesn't work at-least-once

2017-01-24 Thread Cody Koeninger
am as one stream for all topics. I check the offset > ranges from Kafka Manager and don't see any significant deltas. > > On Tue, Jan 24, 2017 at 4:42 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Are you using receiver-based or direct stream? >> >> Are y

Re: Failure handling

2017-01-24 Thread Cody Koeninger
Can you identify the error case and call System.exit ? It'll get retried on another executor, but as long as that one fails the same way... If you can identify the error case at the time you're doing database interaction and just prevent data being written then, that's what I typically do. On

Re: Spark streaming multiple kafka topic doesn't work at-least-once

2017-01-23 Thread Cody Koeninger
Are you using receiver-based or direct stream? Are you doing 1 stream per topic, or 1 stream for all topics? If you're using the direct stream, the actual topics and offset ranges should be visible in the logs, so you should be able to see more detail about what's happening (e.g. all topics are

Re: Assembly for Kafka >= 0.10.0, Spark 2.2.0, Scala 2.11

2017-01-18 Thread Cody Koeninger
Spark 2.2 hasn't been released yet, has it? Python support in kafka dstreams for 0.10 is probably never, there's a jira ticket about this. Stable, hard to say. It was quite a few releases before 0.8 was marked stable, even though it underwent little change. On Wed, Jan 18, 2017 at 2:21 AM,

Re: Kafka 0.8 + Spark 2.0 Partition Issue

2017-01-06 Thread Cody Koeninger
Kafka is designed to only allow reads from leaders. You need to fix this at the kafka level not the spark level. On Fri, Jan 6, 2017 at 7:33 AM, Raghu Vadapalli wrote: > > My spark 2.0 + kafka 0.8 streaming job fails with error partition leaderset > exception. When I

Re: [Spark Kafka] How to update batch size of input dynamically for spark kafka consumer?

2017-01-03 Thread Cody Koeninger
You can't change the batch time, but you can limit the number of items in the batch http://spark.apache.org/docs/latest/configuration.html spark.streaming.backpressure.enabled spark.streaming.kafka.maxRatePerPartition On Tue, Jan 3, 2017 at 4:00 AM, 周家帅 wrote: > Hi, > > I am

Re: Can't access the data in Kafka Spark Streaming globally

2016-12-23 Thread Cody Koeninger
This doesn't sound like a question regarding Kafka streaming, it sounds like confusion about the scope of variables in spark generally. Is that right? If so, I'd suggest reading the documentation, starting with a simple rdd (e.g. using sparkContext.parallelize), and experimenting to confirm your

Re: Why foreachPartition function make duplicate invocation to map function for every message ? (Spark 2.0.2)

2016-12-16 Thread Cody Koeninger
Please post a minimal complete code example of what you are talking about On Thu, Dec 15, 2016 at 6:00 PM, Michael Nguyen wrote: > I have the following sequence of Spark Java API calls (Spark 2.0.2): > > Kafka stream that is processed via a map function, which returns

Re: Spark 2 or Spark 1.6.x?

2016-12-12 Thread Cody Koeninger
You certainly can use stable version of Kafka brokers with spark 2.0.2, why would you think otherwise? On Mon, Dec 12, 2016 at 8:53 AM, Amir Rahnama wrote: > Hi, > > You need to describe more. > > For example, in Spark 2.0.2, you can't use stable versions of Apache Kafka.

Re: Spark Streaming with Kafka

2016-12-12 Thread Cody Koeninger
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#creating-a-direct-stream Use a separate group id for each stream, like the docs say. If you're doing multiple output operations, and aren't caching, spark is going to read from kafka again each time, and if some of those

Re: problem with kafka createDirectStream ..

2016-12-09 Thread Cody Koeninger
ot;org.apache.spark" %% "spark-core" % spark % >> "provided", >> "org.apache.spark" %% "spark-streaming"% spark % >> "provided", >> "org.apache.spark" %%

Re: problem with kafka createDirectStream ..

2016-12-09 Thread Cody Koeninger
When you say 0.10.1 do you mean broker version only, or does your assembly contain classes from the 0.10.1 kafka consumer? On Fri, Dec 9, 2016 at 10:19 AM, debasishg wrote: > Hello - > > I am facing some issues with the following snippet of code that reads from > Kafka

Re: Reprocessing failed jobs in Streaming job

2016-12-07 Thread Cody Koeninger
r.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > > On Wed, Dec 7, 2016 at 12:16 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Personally I thin

Re: Reprocessing failed jobs in Streaming job

2016-12-07 Thread Cody Koeninger
Personally I think forcing the stream to fail (e.g. check offsets in downstream store and throw exception if they aren't as expected) is the safest thing to do. If you proceed after a failure, you need a place to reliably record the batches that failed for later processing. On Wed, Dec 7, 2016

Re: Spark Streaming - join streaming and static data

2016-12-06 Thread Cody Koeninger
You do not need recent versions of spark, kafka, or structured streaming in order to do this. Normal DStreams are sufficient. You can parallelize your static data from the database to an RDD, and there's a join method available on RDDs. Transforming a single given timestamp line into multiple

Re: Can spark support exactly once based kafka ? Due to these following question?

2016-12-05 Thread Cody Koeninger
Have you read / watched the materials linked from https://github.com/koeninger/kafka-exactly-once On Mon, Dec 5, 2016 at 4:17 AM, Jörn Franke wrote: > You need to do the book keeping of what has been processed yourself. This > may mean roughly the following (of course the

Re: Back-pressure to Spark Kafka Streaming?

2016-12-05 Thread Cody Koeninger
If you want finer-grained max rate setting, SPARK-17510 got merged a while ago. There's also SPARK-18580 which might help address the issue of starting backpressure rate for the first batch. On Mon, Dec 5, 2016 at 4:18 PM, Liren Ding wrote: > Hey all, > > Does

Re: Mac vs cluster Re: kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a single partition on a multi partition topic

2016-11-19 Thread Cody Koeninger
ither for a YARN cluster or spark standalone) we get this issue. > > Heji > > > On Sat, Nov 19, 2016 at 8:53 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> I ran your example using the versions of kafka and spark you are >> using, against a standalone clust

Re: using StreamingKMeans

2016-11-19 Thread Cody Koeninger
So I haven't played around with streaming k means at all, but given that no one responded to your message a couple of days ago, I'll say what I can. 1. Can you not sample out some % of the stream for training? 2. Can you run multiple streams at the same time with different values for k and

Re: Kafka direct approach,App UI shows wrong input rate

2016-11-19 Thread Cody Koeninger
There have definitely been issues with UI reporting for the direct stream in the past, but I'm not able to reproduce this with 2.0.2 and 0.8. See below: https://i.imgsafe.org/086019ae57.png On Fri, Nov 18, 2016 at 4:38 AM, Julian Keppel wrote: > Hello, > > I use

Re: kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a single partition on a multi partition topic

2016-11-19 Thread Cody Koeninger
I ran your example using the versions of kafka and spark you are using, against a standalone cluster. This is what I observed: (in kafka working directory) bash-3.2$ ./bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list 'localhost:9092' --topic simple_logtest --time -2

Re: Kafka segmentation

2016-11-19 Thread Cody Koeninger
you mean exactly? > > On Fri, Nov 18, 2016 at 1:50 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Ok, I don't think I'm clear on the issue then. Can you say what the >> expected behavior is, and what the observed behavior is? >> >> On Thu,

Re: Kafka segmentation

2016-11-17 Thread Cody Koeninger
limit the size of > batches, it could be any greater size as it does. > > Thien > > On Fri, Nov 18, 2016 at 1:17 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> If you want a consistent limit on the size of batches, use >> spark.streaming.kafka.maxRatePerP

Re: Kafka segmentation

2016-11-17 Thread Cody Koeninger
t.reset=largest, but I > don't know what I can do in this case. > > Do you and other ones could suggest me some solutions please as this seems > the normal situation with Kafka+SpartStreaming. > > Thanks. > Alex > > > > On Thu, Nov 17, 2016 at 2:32 AM, Cody Koeninger &

Re: Kafka segmentation

2016-11-16 Thread Cody Koeninger
ure implementation in Spark Streaming? > > On Wed, Nov 16, 2016 at 4:22 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Moved to user list. >> >> I'm not really clear on what you're trying to accomplish (why put the >> csv file through Kafka instead of read

Re: Kafka segmentation

2016-11-16 Thread Cody Koeninger
Moved to user list. I'm not really clear on what you're trying to accomplish (why put the csv file through Kafka instead of reading it directly with spark?) auto.offset.reset=largest just means that when starting the job without any defined offsets, it will start at the highest (most recent)

  1   2   3   4   5   6   7   >