Re: Spark streaming multiple kafka topic doesn't work at-least-once

2017-01-24 Thread Cody Koeninger
each consumer's coverage and lag status. On Tue, Jan 24, 2017 at 10:45 PM, Cody Koeninger <c...@koeninger.org> wrote: > When you said " I check the offset ranges from Kafka Manager and don't > see any significant deltas.", what were you comparing it against? The > offset

Re: Spark streaming multiple kafka topic doesn't work at-least-once

2017-01-24 Thread Cody Koeninger
arch. An > another legacy app also writes the same results to a database. There are > huge difference between DB and ES. I know how many records we process daily. > > Everything works fine if I run a job instance for each topic. > > On Tue, Jan 24, 2017 at 5:26 PM, Cody Koeninger <c

Re: welcoming Burak and Holden as committers

2017-01-24 Thread Cody Koeninger
Congrats, glad to hear it On Jan 24, 2017 12:47 PM, "Shixiong(Ryan) Zhu" wrote: > Congrats Burak & Holden! > > On Tue, Jan 24, 2017 at 10:39 AM, Joseph Bradley > wrote: > >> Congratulations Burak & Holden! >> >> On Tue, Jan 24, 2017 at 10:33 AM,

Re: Feedback on MLlib roadmap process proposal

2017-01-24 Thread Cody Koeninger
Totally agree with most of what Sean said, just wanted to give an alternate take on the "maintainers" thing On Tue, Jan 24, 2017 at 10:23 AM, Sean Owen wrote: > There is no such list because there's no formal notion of ownership or > access to subsets of the project. Tracking

Re: Spark streaming multiple kafka topic doesn't work at-least-once

2017-01-24 Thread Cody Koeninger
am as one stream for all topics. I check the offset > ranges from Kafka Manager and don't see any significant deltas. > > On Tue, Jan 24, 2017 at 4:42 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Are you using receiver-based or direct stream? >> >> Are y

Re: Failure handling

2017-01-24 Thread Cody Koeninger
Can you identify the error case and call System.exit ? It'll get retried on another executor, but as long as that one fails the same way... If you can identify the error case at the time you're doing database interaction and just prevent data being written then, that's what I typically do. On

Re: Spark streaming multiple kafka topic doesn't work at-least-once

2017-01-23 Thread Cody Koeninger
Are you using receiver-based or direct stream? Are you doing 1 stream per topic, or 1 stream for all topics? If you're using the direct stream, the actual topics and offset ranges should be visible in the logs, so you should be able to see more detail about what's happening (e.g. all topics are

Re: Assembly for Kafka >= 0.10.0, Spark 2.2.0, Scala 2.11

2017-01-18 Thread Cody Koeninger
Spark 2.2 hasn't been released yet, has it? Python support in kafka dstreams for 0.10 is probably never, there's a jira ticket about this. Stable, hard to say. It was quite a few releases before 0.8 was marked stable, even though it underwent little change. On Wed, Jan 18, 2017 at 2:21 AM,

[jira] [Commented] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing

2017-01-17 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826527#comment-15826527 ] Cody Koeninger commented on SPARK-19185: I'd expect setting cache capacity to zero to cause

[jira] [Commented] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing

2017-01-15 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823164#comment-15823164 ] Cody Koeninger commented on SPARK-19185: This is a good error report, sorry it's taken me a while

Re: Kafka 0.8 + Spark 2.0 Partition Issue

2017-01-06 Thread Cody Koeninger
Kafka is designed to only allow reads from leaders. You need to fix this at the kafka level not the spark level. On Fri, Jan 6, 2017 at 7:33 AM, Raghu Vadapalli wrote: > > My spark 2.0 + kafka 0.8 streaming job fails with error partition leaderset > exception. When I

Re: Spark Improvement Proposals

2017-01-03 Thread Cody Koeninger
g some items mentioned above + a new one >> w.r.t. Reynold's draft >> <https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#> >> : >> * Reinstate the "Where" section with links to current and past SIPs >> * Add field for statin

Re: [Spark Kafka] How to update batch size of input dynamically for spark kafka consumer?

2017-01-03 Thread Cody Koeninger
You can't change the batch time, but you can limit the number of items in the batch http://spark.apache.org/docs/latest/configuration.html spark.streaming.backpressure.enabled spark.streaming.kafka.maxRatePerPartition On Tue, Jan 3, 2017 at 4:00 AM, 周家帅 wrote: > Hi, > > I am

Re: Can't access the data in Kafka Spark Streaming globally

2016-12-23 Thread Cody Koeninger
This doesn't sound like a question regarding Kafka streaming, it sounds like confusion about the scope of variables in spark generally. Is that right? If so, I'd suggest reading the documentation, starting with a simple rdd (e.g. using sparkContext.parallelize), and experimenting to confirm your

Re: Why foreachPartition function make duplicate invocation to map function for every message ? (Spark 2.0.2)

2016-12-16 Thread Cody Koeninger
Please post a minimal complete code example of what you are talking about On Thu, Dec 15, 2016 at 6:00 PM, Michael Nguyen wrote: > I have the following sequence of Spark Java API calls (Spark 2.0.2): > > Kafka stream that is processed via a map function, which returns

Re: Spark 2 or Spark 1.6.x?

2016-12-12 Thread Cody Koeninger
You certainly can use stable version of Kafka brokers with spark 2.0.2, why would you think otherwise? On Mon, Dec 12, 2016 at 8:53 AM, Amir Rahnama wrote: > Hi, > > You need to describe more. > > For example, in Spark 2.0.2, you can't use stable versions of Apache Kafka.

Re: Spark Streaming with Kafka

2016-12-12 Thread Cody Koeninger
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#creating-a-direct-stream Use a separate group id for each stream, like the docs say. If you're doing multiple output operations, and aren't caching, spark is going to read from kafka again each time, and if some of those

[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2016-12-12 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15742285#comment-15742285 ] Cody Koeninger commented on SPARK-17147: If compacted topics are important to you, then you

Re: [VOTE] Apache Spark 2.1.0 (RC2)

2016-12-09 Thread Cody Koeninger
Agree that frequent topic deletion is not a very Kafka-esque thing to do On Fri, Dec 9, 2016 at 12:09 PM, Shixiong(Ryan) Zhu wrote: > Sean, "stress test for failOnDataLoss=false" is because Kafka consumer may > be thrown NPE when a topic is deleted. I added some logic to

Re: problem with kafka createDirectStream ..

2016-12-09 Thread Cody Koeninger
ot;org.apache.spark" %% "spark-core" % spark % >> "provided", >> "org.apache.spark" %% "spark-streaming"% spark % >> "provided", >> "org.apache.spark" %%

Re: problem with kafka createDirectStream ..

2016-12-09 Thread Cody Koeninger
When you say 0.10.1 do you mean broker version only, or does your assembly contain classes from the 0.10.1 kafka consumer? On Fri, Dec 9, 2016 at 10:19 AM, debasishg wrote: > Hello - > > I am facing some issues with the following snippet of code that reads from > Kafka

Re: Reprocessing failed jobs in Streaming job

2016-12-07 Thread Cody Koeninger
r.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > > On Wed, Dec 7, 2016 at 12:16 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Personally I thin

Re: Reprocessing failed jobs in Streaming job

2016-12-07 Thread Cody Koeninger
Personally I think forcing the stream to fail (e.g. check offsets in downstream store and throw exception if they aren't as expected) is the safest thing to do. If you proceed after a failure, you need a place to reliably record the batches that failed for later processing. On Wed, Dec 7, 2016

[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2016-12-07 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15729055#comment-15729055 ] Cody Koeninger commented on SPARK-17147: This ticket is about createDirectStream. The question

Re: Spark Streaming - join streaming and static data

2016-12-06 Thread Cody Koeninger
You do not need recent versions of spark, kafka, or structured streaming in order to do this. Normal DStreams are sufficient. You can parallelize your static data from the database to an RDD, and there's a join method available on RDDs. Transforming a single given timestamp line into multiple

Re: Can spark support exactly once based kafka ? Due to these following question?

2016-12-05 Thread Cody Koeninger
Have you read / watched the materials linked from https://github.com/koeninger/kafka-exactly-once On Mon, Dec 5, 2016 at 4:17 AM, Jörn Franke wrote: > You need to do the book keeping of what has been processed yourself. This > may mean roughly the following (of course the

Re: Back-pressure to Spark Kafka Streaming?

2016-12-05 Thread Cody Koeninger
If you want finer-grained max rate setting, SPARK-17510 got merged a while ago. There's also SPARK-18580 which might help address the issue of starting backpressure rate for the first batch. On Mon, Dec 5, 2016 at 4:18 PM, Liren Ding wrote: > Hey all, > > Does

Re: Back-pressure to Spark Kafka Streaming?

2016-12-05 Thread Cody Koeninger
If you want finer-grained max rate setting, SPARK-17510 got merged a while ago. There's also SPARK-18580 which might help address the issue of starting backpressure rate for the first batch. On Mon, Dec 5, 2016 at 4:18 PM, Liren Ding wrote: > Hey all, > > Does

[jira] [Commented] (SPARK-18682) Batch Source for Kafka

2016-12-04 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15720146#comment-15720146 ] Cody Koeninger commented on SPARK-18682: Isn't this a duplicate of https://issues.apache.org/jira

[jira] [Issue Comment Deleted] (SPARK-18682) Batch Source for Kafka

2016-12-04 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Koeninger updated SPARK-18682: --- Comment: was deleted (was: Isn't this a duplicate of https://issues.apache.org/jira/browse

[jira] [Commented] (SPARK-18682) Batch Source for Kafka

2016-12-04 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15720145#comment-15720145 ] Cody Koeninger commented on SPARK-18682: Isn't this a duplicate of https://issues.apache.org/jira

[jira] [Commented] (SPARK-18506) kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a single partition on a multi partition topic

2016-12-01 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712457#comment-15712457 ] Cody Koeninger commented on SPARK-18506: Yes, amazon linux. No, not spark-ec2, just a spark

[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2016-11-29 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706814#comment-15706814 ] Cody Koeninger commented on SPARK-18475: Glad you agree it shouldn't be enabled by default

[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2016-11-29 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706758#comment-15706758 ] Cody Koeninger commented on SPARK-18475: Burak hasn't empirically shown that it is of benefit

[jira] [Commented] (SPARK-18506) kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a single partition on a multi partition topic

2016-11-28 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15704372#comment-15704372 ] Cody Koeninger commented on SPARK-18506: 1 x spark master is m3 medium 2 x spark workers are m3

[jira] [Commented] (SPARK-18506) kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a single partition on a multi partition topic

2016-11-23 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15691386#comment-15691386 ] Cody Koeninger commented on SPARK-18506: I'm really confused by that - did you try a completely

[jira] [Commented] (SPARK-18525) Kafka DirectInputStream cannot be aware of new partition

2016-11-23 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15690915#comment-15690915 ] Cody Koeninger commented on SPARK-18525: Easiest thing to do is just restart your streaming job

[jira] [Commented] (SPARK-18525) Kafka DirectInputStream cannot be aware of new partition

2016-11-22 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15686746#comment-15686746 ] Cody Koeninger commented on SPARK-18525: 0.8 works only against defined partitions. Use 0.10

[jira] [Commented] (SPARK-18506) kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a single partition on a multi partition topic

2016-11-21 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15685190#comment-15685190 ] Cody Koeninger commented on SPARK-18506: I'd try to isolate aws vs gce as a possible cause before

[jira] [Commented] (SPARK-18506) kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a single partition on a multi partition topic

2016-11-20 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682216#comment-15682216 ] Cody Koeninger commented on SPARK-18506: I tried your example code on an AWS 2-node spark

[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2016-11-20 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682171#comment-15682171 ] Cody Koeninger commented on SPARK-18475: An iterator certainly does have an ordering guarantee

Re: Mac vs cluster Re: kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a single partition on a multi partition topic

2016-11-19 Thread Cody Koeninger
ither for a YARN cluster or spark standalone) we get this issue. > > Heji > > > On Sat, Nov 19, 2016 at 8:53 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> I ran your example using the versions of kafka and spark you are >> using, against a standalone clust

Re: using StreamingKMeans

2016-11-19 Thread Cody Koeninger
So I haven't played around with streaming k means at all, but given that no one responded to your message a couple of days ago, I'll say what I can. 1. Can you not sample out some % of the stream for training? 2. Can you run multiple streams at the same time with different values for k and

Re: Kafka direct approach,App UI shows wrong input rate

2016-11-19 Thread Cody Koeninger
There have definitely been issues with UI reporting for the direct stream in the past, but I'm not able to reproduce this with 2.0.2 and 0.8. See below: https://i.imgsafe.org/086019ae57.png On Fri, Nov 18, 2016 at 4:38 AM, Julian Keppel wrote: > Hello, > > I use

Re: kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a single partition on a multi partition topic

2016-11-19 Thread Cody Koeninger
I ran your example using the versions of kafka and spark you are using, against a standalone cluster. This is what I observed: (in kafka working directory) bash-3.2$ ./bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list 'localhost:9092' --topic simple_logtest --time -2

Re: Kafka segmentation

2016-11-19 Thread Cody Koeninger
you mean exactly? > > On Fri, Nov 18, 2016 at 1:50 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Ok, I don't think I'm clear on the issue then. Can you say what the >> expected behavior is, and what the observed behavior is? >> >> On Thu,

[jira] [Comment Edited] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2016-11-19 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15679472#comment-15679472 ] Cody Koeninger edited comment on SPARK-18475 at 11/19/16 4:02 PM: -- Yes

[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2016-11-19 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15679472#comment-15679472 ] Cody Koeninger commented on SPARK-18475: Yes, an RDD does have an ordering guarantee, it's

Re: Kafka segmentation

2016-11-17 Thread Cody Koeninger
limit the size of > batches, it could be any greater size as it does. > > Thien > > On Fri, Nov 18, 2016 at 1:17 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> If you want a consistent limit on the size of batches, use >> spark.streaming.kafka.maxRatePerP

[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2016-11-17 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15674459#comment-15674459 ] Cody Koeninger commented on SPARK-18475: This has come up several times, and my answer

Re: Kafka segmentation

2016-11-17 Thread Cody Koeninger
t.reset=largest, but I > don't know what I can do in this case. > > Do you and other ones could suggest me some solutions please as this seems > the normal situation with Kafka+SpartStreaming. > > Thanks. > Alex > > > > On Thu, Nov 17, 2016 at 2:32 AM, Cody Koeninger &

Re: Kafka segmentation

2016-11-16 Thread Cody Koeninger
ure implementation in Spark Streaming? > > On Wed, Nov 16, 2016 at 4:22 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Moved to user list. >> >> I'm not really clear on what you're trying to accomplish (why put the >> csv file through Kafka instead of read

Re: Kafka segmentation

2016-11-16 Thread Cody Koeninger
Moved to user list. I'm not really clear on what you're trying to accomplish (why put the csv file through Kafka instead of reading it directly with spark?) auto.offset.reset=largest just means that when starting the job without any defined offsets, it will start at the highest (most recent)

Re: using Spark Streaming with Kafka 0.9/0.10

2016-11-15 Thread Cody Koeninger
tion says : Input streams that can generate > RDDs from new data by running a service/thread only on the driver node (that > is, without running a receiver on worker nodes) > > Thanks and regards, > Aakash Pradeep > > > On Tue, Nov 15, 2016 at 2:55 PM, Cody Koeninger <

Re: using Spark Streaming with Kafka 0.9/0.10

2016-11-15 Thread Cody Koeninger
It'd probably be worth no longer marking the 0.8 interface as experimental. I don't think it's likely to be subject to active development at this point. You can use the 0.8 artifact to consume from a 0.9 broker Where are you reading documentation indicating that the direct stream only runs on

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-13 Thread Cody Koeninger
titions) are being handled by the executors at any given time? > > Thanks, > > Ivan > > On Sat, Nov 12, 2016 at 1:25 PM, Cody Koeninger <c...@koeninger.org> > wrote: > >> You should not be getting consumer churn on executors at all, that's >> the whole

Re: Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread Cody Koeninger
s and stages are all successful and even the speculation > is turned off . > > On Sat, Nov 12, 2016 at 9:55 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Are you certain you aren't getting any failed tasks or other errors? >> Output actions like foreach aren't ex

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-12 Thread Cody Koeninger
c and a consumer >> group that is made from the topic (each RDD uses a different topic and >> group), process the data and write to Parquet files. >> >> Per my Nov 10th post, we still get polling timeouts unless the poll.ms is >> set to something like 10 seconds. We also

Re: Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread Cody Koeninger
Are you certain you aren't getting any failed tasks or other errors? Output actions like foreach aren't exactly once and will be retried on failures. On Nov 12, 2016 06:36, "dev loper" wrote: > Dear fellow Spark Users, > > My Spark Streaming application (Spark 2.0 , on AWS

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-11 Thread Cody Koeninger
e() >> } >> >> I am not sure why the concurrent issue is there as I have tried to debug >> and also looked at the KafkaConsumer code as well, but everything looks >> like it should not occur. The things to figure out is why when running in >> parallel do

[jira] [Commented] (SPARK-18386) Batch mode SQL source for Kafka

2016-11-10 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15654308#comment-15654308 ] Cody Koeninger commented on SPARK-18386: That should work. There may be dependency conflicts

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.1.0

2016-11-10 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15654295#comment-15654295 ] Cody Koeninger commented on SPARK-18057: I definitely do not want another copy-paste situation

Re: Akka Stream as the source for Spark Streaming. Please advice...

2016-11-10 Thread Cody Koeninger
The basic structured streaming source for Kafka is already committed to master, build it and try it out. If you're already using Kafka I don't really see much point in trying to put Akka in between it and Spark. On Nov 10, 2016 02:25, "vincent gromakowski" wrote:

Re: Connectors using new Kafka consumer API

2016-11-09 Thread Cody Koeninger
t;m...@apache.org> wrote: > I think they are open to others helping, in fact, more than one person has > worked on the JIRA so far. And, it's been crawling really slowly and that's > preventing adoption of Spark's new connector in secure Kafka environments. > > On Tue, Nov 8, 2016 at 7:59

[jira] [Created] (SPARK-18386) Batch mode SQL source for Kafka

2016-11-09 Thread Cody Koeninger (JIRA)
Cody Koeninger created SPARK-18386: -- Summary: Batch mode SQL source for Kafka Key: SPARK-18386 URL: https://issues.apache.org/jira/browse/SPARK-18386 Project: Spark Issue Type: Improvement

Re: Connectors using new Kafka consumer API

2016-11-08 Thread Cody Koeninger
Have you asked the assignee on the Kafka jira whether they'd be willing to accept help on it? On Tue, Nov 8, 2016 at 5:26 PM, Mark Grover wrote: > Hi all, > We currently have a new direct stream connector, thanks to work by Cody and > others on SPARK-12177. > > However, that

[jira] [Commented] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2016-11-08 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649638#comment-15649638 ] Cody Koeninger commented on SPARK-18371: Thanks for digging into this. The other thing I noticed

Re: Spark Improvement Proposals

2016-11-08 Thread Cody Koeninger
gt; Oops. Let me try figure that out. >> >> >> On Monday, November 7, 2016, Cody Koeninger <c...@koeninger.org> wrote: >>> >>> Thanks for picking up on this. >>> >>> Maybe I fail at google docs, but I can't see any edits on the documen

Re: Kafka stream offset management question

2016-11-08 Thread Cody Koeninger
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html specifically http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#storing-offsets Have you set enable.auto.commit to false? The new consumer stores offsets in kafka, so the idea of specifically

Re: Using Apache Spark Streaming - how to handle changing data format within stream

2016-11-07 Thread Cody Koeninger
I may be misunderstanding, but you need to take each kafka message, and turn it into multiple items in the transformed rdd? so something like (pseudocode): stream.flatMap { message => val items = new ArrayBuffer var parser = null message.split("\n").foreach { line => if // it's a

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-07 Thread Cody Koeninger
ed to set poll.ms to 30 seconds and still get the issue. > Something is not right here and just not seem right. As I mentioned with the > streaming application, with Spark 1.6 and Kafka 0.8.x we never saw this > issue. We have been running the same basic logic for over a year now without >

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-07 Thread Cody Koeninger
ce > spark.task.maxFailures is reached? Has anyone else seen this behavior of a > streaming application skipping over failed microbatches? > > Thanks, > Sean > > >> On Nov 4, 2016, at 2:48 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> So basicall

Re: Odp.: Spark Improvement Proposals

2016-11-07 Thread Cody Koeninger
nice) would also be nice, >>> but that can be done at any time. >>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a candidate >>> for a SIP, given the scope of the work. The document attached even >>> somewhat matches the proposed format. So

Re: expected behavior of Kafka dynamic topic subscription

2016-11-07 Thread Cody Koeninger
;hw...@qilinsoft.com> wrote: > Cody, thanks for the response. Do you think it's a Spark issue or Kafka > issue? Can you please let me know the jira ticket number? > > -Original Message- > From: Cody Koeninger [mailto:c...@koeninger.org] > Sent: 2016年11月4日 22:35 >

[jira] [Commented] (SPARK-18258) Sinks need access to offset representation

2016-11-04 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638555#comment-15638555 ] Cody Koeninger commented on SPARK-18258: Sure, added, let me know if I'm missing something or can

[jira] [Updated] (SPARK-18258) Sinks need access to offset representation

2016-11-04 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Koeninger updated SPARK-18258: --- Description: Transactional "exactly-once" semantics for output require storing

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-04 Thread Cody Koeninger
e since Spark now requires one to assign or subscribe to a topic in > order to even update the offsets. In 0.8.2.x you did not have to worry about > that. This approach limits your exposure to duplicate data since idempotent > records are not entirely possible in our scenario. At least

[jira] [Commented] (SPARK-18258) Sinks need access to offset representation

2016-11-04 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637621#comment-15637621 ] Cody Koeninger commented on SPARK-18258: So one obvious one is that if wherever checkpoint data

[jira] [Commented] (SPARK-18258) Sinks need access to offset representation

2016-11-04 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637576#comment-15637576 ] Cody Koeninger commented on SPARK-18258: The sink doesn't have to reason about equality

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-04 Thread Cody Koeninger
So just to be clear, the answers to my questions are - you are not using different group ids, you're using the same group id everywhere - you are committing offsets manually Right? If you want to eliminate network or kafka misbehavior as a source, tune poll.ms upwards even higher. You must

Anyone want to weigh in on a Kafka DStreams api change?

2016-11-04 Thread Cody Koeninger
SPARK-17510 https://github.com/apache/spark/pull/15132 It's for allowing tweaking of rate limiting on a per-partition basis - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Continuous warning while consuming using new kafka-spark010 API

2016-11-04 Thread Cody Koeninger
I answered the duplicate post on the user mailing list, I'd say keep the discussion there. On Fri, Nov 4, 2016 at 12:14 PM, vonnagy wrote: > Nitin, > > I am getting the similar issues using Spark 2.0.1 and Kafka 0.10. I have to > jobs, one that uses a Kafka stream and one that

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-04 Thread Cody Koeninger
- are you using different group ids for the different streams? - are you manually committing offsets? - what are the values of your kafka-related settings? On Fri, Nov 4, 2016 at 12:20 PM, vonnagy wrote: > I am getting the issues using Spark 2.0.1 and Kafka 0.10. I have two jobs,

[jira] [Created] (SPARK-18272) Test topic addition for subscribePattern on Kafka DStream and Structured Stream

2016-11-04 Thread Cody Koeninger (JIRA)
Cody Koeninger created SPARK-18272: -- Summary: Test topic addition for subscribePattern on Kafka DStream and Structured Stream Key: SPARK-18272 URL: https://issues.apache.org/jira/browse/SPARK-18272

Re: expected behavior of Kafka dynamic topic subscription

2016-11-04 Thread Cody Koeninger
That's not what I would expect from the underlying kafka consumer, no. But this particular case (no matching topics, then add a topic after SubscribePattern stream starts) actually isn't part of unit tests for either the DStream or the structured stream. I'll make a jira ticket. On Thu, Nov 3,

[jira] [Updated] (SPARK-18258) Sinks need access to offset representation

2016-11-03 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Koeninger updated SPARK-18258: --- Description: Transactional "exactly-once" semantics for output require storing

[jira] [Created] (SPARK-18258) Sinks need access to offset representation

2016-11-03 Thread Cody Koeninger (JIRA)
Cody Koeninger created SPARK-18258: -- Summary: Sinks need access to offset representation Key: SPARK-18258 URL: https://issues.apache.org/jira/browse/SPARK-18258 Project: Spark Issue Type

[jira] [Commented] (SPARK-17938) Backpressure rate not adjusting

2016-11-02 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15629332#comment-15629332 ] Cody Koeninger commented on SPARK-17938: Direct stream isn't a receiver, receiver settings don't

Re: Handling questions in the mailing lists

2016-11-02 Thread Cody Koeninger
So concrete things people could do - users could tag subject lines appropriately to the component they're asking about - contributors could monitor user@ for tags relating to components they've worked on. I'd be surprised if my miss rate for any mailing list questions well-labeled as Kafka was

[jira] [Commented] (SPARK-18212) Flaky test: org.apache.spark.sql.kafka010.KafkaSourceSuite.assign from specific offsets

2016-11-01 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627838#comment-15627838 ] Cody Koeninger commented on SPARK-18212: So here's a heavily excerpted version of what I see

Re: Does Data pipeline using kafka and structured streaming work?

2016-11-01 Thread Cody Koeninger
t time windowed > aggregation for several weeks now. > > On Tue, Nov 1, 2016 at 6:18 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Look at the resolved subtasks attached to that ticket you linked. >> Some of them are unresolved, but basic functionality is there. >&

Re: Does Data pipeline using kafka and structured streaming work?

2016-11-01 Thread Cody Koeninger
Look at the resolved subtasks attached to that ticket you linked. Some of them are unresolved, but basic functionality is there. On Tue, Nov 1, 2016 at 7:37 PM, shyla deshpande wrote: > Hi Michael, > > Thanks for the reply. > > The following link says there is a open

[jira] [Commented] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module

2016-11-01 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15626774#comment-15626774 ] Cody Koeninger commented on SPARK-17935: Some other things to think about: - are there any

Re: JIRA Components for Streaming

2016-10-31 Thread Cody Koeninger
Makes sense to me. I do wonder if e.g. [SPARK-12345][STRUCTUREDSTREAMING][KAFKA] is going to leave any room in the Github PR form for actual title content? On Mon, Oct 31, 2016 at 1:37 PM, Michael Armbrust wrote: > I'm planning to do a little maintenance on JIRA to

Re: Odp.: Spark Improvement Proposals

2016-10-31 Thread Cody Koeninger
h them, my mail was just to show some > aspects from my side, so from theside of developer and person who is trying > to help others with Spark (via StackOverflow or other ways) > > > Pozdrawiam / Best regards, > > Tomasz > > > > Od: Cody K

Re: MapWithState partitioning

2016-10-31 Thread Cody Koeninger
gt; > > The reason I ask is that it simply looks strange to me that Spark will have > to shuffle each time my input stream and "state" stream during the > mapWithState operation when I now for sure that those two streams will > always share same keys and will not need ac

Re: MapWithState partitioning

2016-10-31 Thread Cody Koeninger
If you call a transformation on an rdd using the same partitioner as that rdd, no shuffle will occur. KafkaRDD doesn't have a partitioner, there's no consistent partitioning scheme that works for all kafka uses. You can wrap each kafkardd with an rdd that has a custom partitioner that you write

Re: Reason for Kafka topic existence check / "Does the topic exist?" error

2016-10-29 Thread Cody Koeninger
to operationalize > topic creation. Not a big deal but now we'll have to make sure we execute > the 'create-topics' type of task or shell script at install time. > > This seems like a Kafka doc issue potentially, to explain what exactly one > can expect from the auto.create.topics.en

[jira] [Commented] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module

2016-10-27 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612663#comment-15612663 ] Cody Koeninger commented on SPARK-17935: So the main thing to point out is that Kafka producers

Re: Zero Data Loss in Spark with Kafka

2016-10-26 Thread Cody Koeninger
application is running and the files are created right. But as soon as I >>> restart the application, it goes back to fromOffset as 0. Any thoughts? >>> >>> regards >>> Sunita >>> >>> On Tue, Oct 25, 2016 at 1:52 PM, Sunita Arvind <sunitarv...@gmail.c

<    1   2   3   4   5   6   7   8   9   10   >