Re: spark streaming kafka connector questions

2016-09-10 Thread Cody Koeninger
Hard to say without seeing the code, but if you do multiple actions on an Rdd without caching, the Rdd will be computed multiple times. On Sep 10, 2016 2:43 AM, "Cheng Yi" wrote: After some investigation, the problem i see is liked caused by a filter and union of the

Re: Streaming Backpressure with Multiple Streams

2016-09-09 Thread Cody Koeninger
Does the same thing happen if you're only using direct stream plus back pressure, not the receiver stream? On Sep 9, 2016 6:41 PM, "Jeff Nadler" wrote: > Maybe this is a pretty esoteric implementation, but I'm seeing some bad > behavior with backpressure plus multiple Kafka

Re: spark streaming kafka connector questions

2016-09-08 Thread Cody Koeninger
- If you're seeing repeated attempts to process the same message, you should be able to look in the UI or logs and see that a task has failed. Figure out why that task failed before chasing other things - You're not using the latest version, the latest version is for spark 2.0. There are two

Re: Removing published kinesis, ganglia artifacts due to license issues?

2016-09-07 Thread Cody Koeninger
To be clear, "safe" has very little to do with this. It's pretty clear that there's very little risk of the spark module for kinesis being considered a derivative work, much less all of spark. The use limitation in 3.3 that caused the amazon license to be put on the apache X list also doesn't

Re: Spark 2.0 with Kafka 0.10 exception

2016-09-07 Thread Cody Koeninger
ask how big an overhead is that? > > It happens intermittently and each time it happens retry is successful. > > Srikanth > > On Wed, Sep 7, 2016 at 3:55 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> That's not what I would have expected to happen with a

Re: Spark 2.0 with Kafka 0.10 exception

2016-09-07 Thread Cody Koeninger
k.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > at java.util.concurrent.ThreadPoolExecuto

Re: Spark 2.0 with Kafka 0.10 exception

2016-09-07 Thread Cody Koeninger
dy. Setting poll timeout helped. >> Our network is fine but brokers are not fully provisioned in test cluster. >> But there isn't enough load to max out on broker capacity. >> Curious that kafkacat running on the same node doesn't have any issues. >> >> Srikanth >&g

Re: Removing published kinesis, ganglia artifacts due to license issues?

2016-09-07 Thread Cody Koeninger
I don't see a reason to remove the non-assembly artifact, why would you? You're not distributing copies of Amazon licensed code, and the Amazon license goes out of its way not to over-reach regarding derivative works. This seems pretty clearly to fall in the spirit of

Re: Reset auto.offset.reset in Kafka 0.10 integ

2016-09-07 Thread Cody Koeninger
The restart doesn't have to be all that > silent. It requires us to set a flag. I thought auto.offset.reset is that > flag. > But there isn't much I can do at this point given that retention has cleaned > things up. > The app has to start. Let admins address the data loss on the side. >

Re: Reset auto.offset.reset in Kafka 0.10 integ

2016-09-07 Thread Cody Koeninger
tion instead of honoring > auto.offset.reset. > It isn't a corner case where retention expired after driver created a batch. > Its easily reproducible and consistent. > > On Tue, Sep 6, 2016 at 3:34 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> You don't

Re: Q: Multiple spark streaming app, one kafka topic, same consumer group

2016-09-06 Thread Cody Koeninger
In general, see the material linked from https://github.com/koeninger/kafka-exactly-once if you want a better understanding of the direct stream. For spark-streaming-kafka-0-8, the direct stream doesn't really care about consumer group, since it uses the simple consumer. For the 0.10 version,

Re: Reset auto.offset.reset in Kafka 0.10 integ

2016-09-06 Thread Cody Koeninger
on was why I got the exception instead of it using > auto.offset.reset on restart? > > > > > On Tue, Sep 6, 2016 at 10:48 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> If you leave enable.auto.commit set to true, it will commit offsets to >>

Re: Reset auto.offset.reset in Kafka 0.10 integ

2016-09-06 Thread Cody Koeninger
> enable.auto.commit = true > auto.offset.reset = latest > > Srikanth > > On Sat, Sep 3, 2016 at 8:59 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Seems like you're confused about the purpose of that line of code, it >> applies to executors, not the driver.

Re: Committing Kafka offsets when using DirectKafkaInputDStream

2016-09-03 Thread Cody Koeninger
The Kafka commit api isn't transactional, you aren't going to get exactly once behavior out of it even if you were committing offsets on a per-partition basis. This doesn't really have anything to do with Spark; the old code you posted was already inherently broken. Make your outputs idempotent

Re: Pausing spark kafka streaming (direct) or exclude/include some partitions on the fly per batch

2016-09-03 Thread Cody Koeninger
r > temporarily excluding partitions is there any way I can supply > topic-partition info on the fly at the beginning of every pull dynamically. > Will streaminglistener be of any help? > > On Fri, Sep 2, 2016 at 10:37 AM, Cody Koeninger <c...@koeninger.org> > wrote: > >

Re: Pausing spark kafka streaming (direct) or exclude/include some partitions on the fly per batch

2016-09-02 Thread Cody Koeninger
If you just want to pause the whole stream, just stop the app and then restart it when you're ready. If you want to do some type of per-partition manipulation, you're going to need to write some code. The 0.10 integration makes the underlying kafka consumer pluggable, so you may be able to wrap

Re: how should I compose keyStore and trustStore if Spark needs to talk to Kafka & Cassandra ?

2016-09-01 Thread Cody Koeninger
Why not just use different files for Kafka? Nothing else in Spark should be using those Kafka configuration parameters. On Thu, Sep 1, 2016 at 3:26 AM, Eric Ho wrote: > I'm interested in what I should put into the trustStore file, not just for > Spark but also for Kafka

[jira] [Comment Edited] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-08-31 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453789#comment-15453789 ] Cody Koeninger edited comment on SPARK-15406 at 9/1/16 2:26 AM: There's

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-08-31 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453789#comment-15453789 ] Cody Koeninger commented on SPARK-15406: There's a big difference between continuing to publish

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-08-31 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453324#comment-15453324 ] Cody Koeninger commented on SPARK-15406: If people want to use older versions of kafka, why

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-08-31 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453246#comment-15453246 ] Cody Koeninger commented on SPARK-15406: Yes. > Structured streaming support for consuming f

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-08-31 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453124#comment-15453124 ] Cody Koeninger commented on SPARK-15406: I don't think it makes sense to try and support multiple

Re: Spark to Kafka communication encrypted ?

2016-08-31 Thread Cody Koeninger
; Warden Ave > Markham, ON L6G 1C7 > Canada > > > > - Original message - > From: Cody Koeninger <c...@koeninger.org> > To: Eric Ho <e...@analyticsmd.com> > Cc: "user@spark.apache.org" <user@spark.apache.org> > Subject: Re: Spark to

Re: Spark to Kafka communication encrypted ?

2016-08-31 Thread Cody Koeninger
Encryption is only available in spark-streaming-kafka-0-10, not 0-8. You enable it the same way you enable it for the Kafka project's new consumer, by setting kafka configuration parameters appropriately. http://kafka.apache.org/documentation.html#security_ssl On Wed, Aug 31, 2016 at 2:03 AM,

Re: Model abstract class in spark ml

2016-08-31 Thread Cody Koeninger
http://blog.originate.com/blog/2014/02/27/types-inside-types-in-scala/ On Wed, Aug 31, 2016 at 2:19 AM, Sean Owen wrote: > Weird, I recompiled Spark with a similar change to Model and it seemed > to work but maybe I missed a step in there. > > On Wed, Aug 31, 2016 at 6:33 AM,

Re: Model abstract class in spark ml

2016-08-31 Thread Cody Koeninger
http://blog.originate.com/blog/2014/02/27/types-inside-types-in-scala/ On Wed, Aug 31, 2016 at 2:19 AM, Sean Owen wrote: > Weird, I recompiled Spark with a similar change to Model and it seemed > to work but maybe I missed a step in there. > > On Wed, Aug 31, 2016 at 6:33 AM,

Re: Structured Streaming with Kafka sources/sinks

2016-08-30 Thread Cody Koeninger
> wrote: >>> >>> thats great >>> >>> is this effort happening anywhere that is publicly visible? github? >>> >>> On Tue, Aug 16, 2016 at 2:04 AM, Reynold Xin <r...@databricks.com> wrote: >>>> >>>> We (the team at

[jira] [Commented] (SPARK-17280) Flaky test: org.apache.spark.streaming.kafka010.JavaKafkaRDDSuite and JavaDirectKafkaStreamSuite.testKafkaStream

2016-08-27 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15441981#comment-15441981 ] Cody Koeninger commented on SPARK-17280: I can take a look but there's not a lot to go

Re: Kafka message metadata with Dstreams

2016-08-25 Thread Cody Koeninger
http://spark.apache.org/docs/latest/api/java/index.html messageHandler receives a kafka MessageAndMetadata object. Alternatively, if you just need metadata information on a per-partition basis, you can use HasOffsetRanges

Re: Maelstrom: Kafka integration with Spark

2016-08-24 Thread Cody Koeninger
stable scenarios (e.g. > advertised hostname failures on EMR). > > Maelstrom will work I believe even for Spark 1.3 and Kafka 0.8.2.1 (and of > course with the latest Kafka 0.10 as well) > > > On Wed, Aug 24, 2016 at 9:49 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >&

Re: Spark 2.0 with Kafka 0.10 exception

2016-08-23 Thread Cody Koeninger
You can set that poll timeout higher with spark.streaming.kafka.consumer.poll.ms but half a second is fairly generous. I'd try to take a look at what's going on with your network or kafka broker during that time. On Tue, Aug 23, 2016 at 4:44 PM, Srikanth wrote: > Hello,

Re: Maelstrom: Kafka integration with Spark

2016-08-23 Thread Cody Koeninger
Were you aware that the spark 2.0 / kafka 0.10 integration also reuses kafka consumer instances on the executors? On Tue, Aug 23, 2016 at 3:19 PM, Jeoffrey Lim wrote: > Hi, > > I have released the first version of a new Kafka integration with Spark > that we use in the

Re: Zero Data Loss in Spark with Kafka

2016-08-23 Thread Cody Koeninger
See https://github.com/koeninger/kafka-exactly-once On Aug 23, 2016 10:30 AM, "KhajaAsmath Mohammed" wrote: > Hi Experts, > > I am looking for some information on how to acheive zero data loss while > working with kafka and Spark. I have searched online and blogs have >

[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets

2016-08-22 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15430872#comment-15430872 ] Cody Koeninger commented on SPARK-17147: My point is more that this probably isn't just two lines

[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets

2016-08-21 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429742#comment-15429742 ] Cody Koeninger commented on SPARK-17147: Have you successfully used the 0.8 consumer

Re: spark streaming Directkafka with checkpointing : changed parameters not considered

2016-08-19 Thread Cody Koeninger
both fault tolerance and > application code/config changes without checkpointing? Is there anything > else which checkpointing gives? I might be missing something. > > > Regards, > Chandan > > > On Thu, Aug 18, 2016 at 8:27 PM, Cody Koeninger <c...@koeninger.org>

Re: spark streaming Directkafka with checkpointing : changed parameters not considered

2016-08-18 Thread Cody Koeninger
Checkpointing is not kafka-specific. It encompasses metadata about the application. You can't re-use a checkpoint if your application has changed. http://spark.apache.org/docs/latest/streaming-programming-guide.html#upgrading-application-code On Thu, Aug 18, 2016 at 4:39 AM, chandan prakash

Re: Rebalancing when adding kafka partitions

2016-08-16 Thread Cody Koeninger
streaming-kafka-0-10-assembly?? > > Srikanth > > On Fri, Aug 12, 2016 at 5:15 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Hrrm, that's interesting. Did you try with subscribe pattern, out of >> curiosity? >> >> I haven't tested repartitioning

Re: Structured Streaming with Kafka sources/sinks

2016-08-15 Thread Cody Koeninger
https://issues.apache.org/jira/browse/SPARK-15406 I'm not working on it (yet?), never got an answer to the question of who was planning to work on it. On Mon, Aug 15, 2016 at 9:12 PM, Guo, Chenzhao wrote: > Hi all, > > > > I’m trying to write Structured Streaming test

Re: read kafka offset from spark checkpoint

2016-08-15 Thread Cody Koeninger
No, you really shouldn't rely on checkpoints if you cant afford to reprocess from the beginning of your retention (or lose data and start from the latest messages). If you're in a real bind, you might be able to get something out of the serialized data in the checkpoint, but it'd probably be

Re: Rebalancing when adding kafka partitions

2016-08-12 Thread Cody Koeninger
logger.info(s"rdd has ${rdd.getNumPartitions} partitions.") > > > Should I be setting some parameter/config? Is the doc for new integ > available? > > Thanks, > Srikanth > > On Fri, Jul 22, 2016 at 2:15 PM, Cody Koeninger <c...@koeninger.org> > wrote: > >

Re: KafkaUtils.createStream not picking smallest offset

2016-08-12 Thread Cody Koeninger
e same consumer group name. But this is not working though . Somehow > createstream is picking the offset from some where other than > /consumers/ from zookeeper > > > Sent from Samsung Mobile. > > > > > > > > > Original message

[jira] [Commented] (SPARK-16917) Spark streaming kafka version compatibility.

2016-08-12 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418747#comment-15418747 ] Cody Koeninger commented on SPARK-16917: It sounds to me like the documentation is clear, because

Re: Spark streaming not processing messages from partitioned topics

2016-08-10 Thread Cody Koeninger
16/08/10 18:16:44 INFO JobScheduler: Added jobs for time 1470833204000 ms > 16/08/10 18:16:45 INFO JobScheduler: Added jobs for time 1470833205000 ms > 16/08/10 18:16:46 INFO JobScheduler: Added jobs for time 1470833206000 ms > 16/08/10 18:16:47 INFO JobScheduler: Added jobs for time 1470833

Re: Spark streaming not processing messages from partitioned topics

2016-08-10 Thread Cody Koeninger
Those logs you're posting are from right after your failure, they don't include what actually went wrong when attempting to read json. Look at your logs more carefully. On Aug 10, 2016 2:07 AM, "Diwakar Dhanuskodi" wrote: > Hi Siva, > > With below code, it is stuck

Re: Spark streaming not processing messages from partitioned topics

2016-08-09 Thread Cody Koeninger
;dataFrame.foreach(println) >} >else >{ > println("Empty DStream ") >}*/ > }) > > On Wed, Aug 10, 2016 at 2:35 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Take out the conditional and the sqlcontext and just do >>

Re: Spark streaming not processing messages from partitioned topics

2016-08-09 Thread Cody Koeninger
Take out the conditional and the sqlcontext and just do rdd => { rdd.foreach(println) as a base line to see if you're reading the data you expect On Tue, Aug 9, 2016 at 3:47 PM, Diwakar Dhanuskodi wrote: > Hi, > > I am reading json messages from kafka . Topics

Re: Kafka Support new topic subscriptions without requiring restart of the streaming context

2016-08-08 Thread Cody Koeninger
The Kafka 0.10 support in spark 2.0 allows for pattern based topic subscription On Aug 8, 2016 1:12 AM, "r7raul1...@163.com" wrote: > How to add new topic to kafka without requiring restart of the streaming > context? > > -- > r7raul1...@163.com >

[jira] [Commented] (SPARK-16917) Spark streaming kafka version compatibility.

2016-08-06 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410623#comment-15410623 ] Cody Koeninger commented on SPARK-16917: I think the doc changes I submitted make it pretty clear

Re: How does MapWithStateRDD distribute the data

2016-08-03 Thread Cody Koeninger
Are you using KafkaUtils.createDirectStream? On Wed, Aug 3, 2016 at 9:42 AM, Soumitra Johri wrote: > Hi, > > I am running a steaming job with 4 executors and 16 cores so that each > executor has two cores to work with. The input Kafka topic has 4 partitions. > With

Re: How to get recommand result for users in a kafka SparkStreaming Application

2016-08-03 Thread Cody Koeninger
MatrixFactorizationModel is serializable. Instantiate it on the driver, not on the executors. On Wed, Aug 3, 2016 at 2:01 AM, wrote: > hello guys: > I have an app which consumes json messages from kafka and recommend > movies for the users in those messages ,the

Re: sampling operation for DStream

2016-08-01 Thread Cody Koeninger
leq...@gmail.com> wrote: > How to do that? if I put the queue inside .transform operation, it doesn't > work. > > On Mon, Aug 1, 2016 at 6:43 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Can you keep a queue per executor in memory? >> >> On Mon

Re: sampling operation for DStream

2016-08-01 Thread Cody Koeninger
leq...@gmail.com> wrote: > How to do that? if I put the queue inside .transform operation, it doesn't > work. > > On Mon, Aug 1, 2016 at 6:43 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Can you keep a queue per executor in memory? >> >> On Mon

Re: sampling operation for DStream

2016-08-01 Thread Cody Koeninger
roblem in detail here: > https://docs.google.com/document/d/1YBV_eHH6U_dVF1giiajuG4ayoVN5R8Qw3IthoKvW5ok > > Could you please give me some suggestions or advice to fix this problem? > > Thanks > > On Fri, Jul 29, 2016 at 6:28 PM, Cody Koeninger <c...@koeninger.org> wrote: >&

Re: sampling operation for DStream

2016-08-01 Thread Cody Koeninger
roblem in detail here: > https://docs.google.com/document/d/1YBV_eHH6U_dVF1giiajuG4ayoVN5R8Qw3IthoKvW5ok > > Could you please give me some suggestions or advice to fix this problem? > > Thanks > > On Fri, Jul 29, 2016 at 6:28 PM, Cody Koeninger <c...@koeninger.org> wrote: >&

[jira] [Commented] (SPARK-16534) Kafka 0.10 Python support

2016-07-31 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401208#comment-15401208 ] Cody Koeninger commented on SPARK-16534: It's on the PR. Yes, one comitter veto is generally

[jira] [Commented] (SPARK-16534) Kafka 0.10 Python support

2016-07-31 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401128#comment-15401128 ] Cody Koeninger commented on SPARK-16534: This idea got a -1 from Reynold, so unless anyone's

Re: Clarifying that spark-x.x.x-bin-hadoopx.x.tgz doesn't include Hadoop itself

2016-07-29 Thread Cody Koeninger
Yeah, and the without hadoop was even more confusing... because if you weren't using hdfs at all, you still needed to download one of the hadoop-x packages in order to get hadoop io classes used by almost everything. :) On Fri, Jul 29, 2016 at 3:06 PM, Marcelo Vanzin wrote:

Re: sampling operation for DStream

2016-07-29 Thread Cody Koeninger
Most stream systems you're still going to incur the cost of reading each message... I suppose you could rotate among reading just the latest messages from a single partition of a Kafka topic if they were evenly balanced. But once you've read the messages, nothing's stopping you from filtering

Re: sampling operation for DStream

2016-07-29 Thread Cody Koeninger
Most stream systems you're still going to incur the cost of reading each message... I suppose you could rotate among reading just the latest messages from a single partition of a Kafka topic if they were evenly balanced. But once you've read the messages, nothing's stopping you from filtering

[jira] [Commented] (SPARK-16746) Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs timeout

2016-07-28 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397520#comment-15397520 ] Cody Koeninger commented on SPARK-16746: >From conversation on mailing list, it wasn't cl

[jira] [Commented] (SPARK-16762) spark hanging when action method print

2016-07-28 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397516#comment-15397516 ] Cody Koeninger commented on SPARK-16762: Couple things - probably better to bring this kind

Re: read only specific jsons

2016-07-27 Thread Cody Koeninger
> > clickDF = cDF.filter(cDF['request.clientIP'].isNotNull()) > > It fails for some cases and errors our with below message > > AnalysisException: u'No such struct field clientIP in cookies, nscClientIP1, > nscClientIP2, uAgent;' > > > On Tue, Jul 26, 2016 at 12:05 PM,

Re: read only specific jsons

2016-07-26 Thread Cody Koeninger
Have you tried filtering out corrupt records with something along the lines of df.filter(df("_corrupt_record").isNull) On Tue, Jul 26, 2016 at 1:53 PM, vr spark wrote: > i am reading data from kafka using spark streaming. > > I am reading json and creating dataframe. > I

Re: Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs timeout

2016-07-26 Thread Cody Koeninger
Can you go ahead and open a Jira ticket with that explanation? Is there a reason you need to use receivers instead of the direct stream? On Tue, Jul 26, 2016 at 4:45 AM, Andy Zhao wrote: > Hi guys, > > I wrote a spark streaming program which consume 1000 messages from

Re: Potential Change in Kafka's Partition Assignment Semantics when Subscription Changes

2016-07-25 Thread Cody Koeninger
This seems really low risk to me. In order to be impacted, it'd have to be someone who was using the kafka integration in spark 2.0, which isn't even officially released yet. On Mon, Jul 25, 2016 at 7:23 PM, Vahid S Hashemian wrote: > Sorry, meant to ask if any Apache

Re: Potential Change in Kafka's Partition Assignment Semantics when Subscription Changes

2016-07-25 Thread Cody Koeninger
This seems really low risk to me. In order to be impacted, it'd have to be someone who was using the kafka integration in spark 2.0, which isn't even officially released yet. On Mon, Jul 25, 2016 at 7:23 PM, Vahid S Hashemian wrote: > Sorry, meant to ask if any Apache

Re: where I can find spark-streaming-kafka for spark2.0

2016-07-25 Thread Cody Koeninger
For 2.0, the kafka dstream support is in two separate subprojects depending on which version of Kafka you are using spark-streaming-kafka-0-10 or spark-streaming-kafka-0-8 corresponding to brokers that are version 0.10+ or 0.8+ On Mon, Jul 25, 2016 at 12:29 PM, Reynold Xin

Re: where I can find spark-streaming-kafka for spark2.0

2016-07-25 Thread Cody Koeninger
For 2.0, the kafka dstream support is in two separate subprojects depending on which version of Kafka you are using spark-streaming-kafka-0-10 or spark-streaming-kafka-0-8 corresponding to brokers that are version 0.10+ or 0.8+ On Mon, Jul 25, 2016 at 12:29 PM, Reynold Xin

Re: Rebalancing when adding kafka partitions

2016-07-22 Thread Cody Koeninger
s example. > Looks simple but its not very obvious how it works :-) > I'll watch out for the docs and ScalaDoc. > > Srikanth > > On Fri, Jul 22, 2016 at 2:15 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> No, restarting from a checkpoint won't do it, you

Re: Rebalancing when adding kafka partitions

2016-07-22 Thread Cody Koeninger
-0.10 On Fri, Jul 22, 2016 at 1:05 PM, Srikanth <srikanth...@gmail.com> wrote: > In Spark 1.x, if we restart from a checkpoint, will it read from new > partitions? > > If you can, pls point us to some doc/link that talks about Kafka 0.10 integ > in Spark 2.0. > > On Fri, Ju

Re: Rebalancing when adding kafka partitions

2016-07-22 Thread Cody Koeninger
For the integration for kafka 0.8, you are literally starting a streaming job against a fixed set of topicapartitions, It will not change throughout the job, so you'll need to restart the spark job if you change kafka partitions. For the integration for kafka 0.10 / spark 2.0, if you use

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.10 Consumer API

2016-07-22 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389797#comment-15389797 ] Cody Koeninger commented on SPARK-12177: This has already been merged for the upcoming Spark 2.0

Re: Latest 200 messages per topic

2016-07-20 Thread Cody Koeninger
the data per key sorted with timestamp , I will always > get the latest 4 ts data on get(key). Spark streaming will get the ID from > Kafka, then read the data from HBASE using get(ID). This will eliminate > usage of Windowing from Spark-Streaming . Is it good to use ? > > Regards,

Re: Latest 200 messages per topic

2016-07-19 Thread Cody Koeninger
Unless you're using only 1 partition per topic, there's no reasonable way of doing this. Offsets for one topicpartition do not necessarily have anything to do with offsets for another topicpartition. You could do the last (200 / number of partitions) messages per topicpartition, but you have no

Re: Spark streaming takes longer time to read json into dataframes

2016-07-19 Thread Cody Koeninger
Yes, if you need more parallelism, you need to either add more kafka partitions or shuffle in spark. Do you actually need the dataframe api, or are you just using it as a way to infer the json schema? Inferring the schema is going to require reading through the RDD once before doing any other

Re: Complications with saving Kafka offsets?

2016-07-15 Thread Cody Koeninger
The bottom line short answer for this is that if you actually care about data integrity, you need to store your offsets transactionally alongside your results in the same data store. If you're ok with double-counting in the event of failures, saving offsets _after_ saving your results, using

Re: Spark Streaming - Direct Approach

2016-07-15 Thread Cody Koeninger
We've been running direct stream jobs in production for over a year, with uptimes in the range of months. I'm pretty slammed with work right now, but when I get time to submit a PR for the 0.10 docs i'll remove the experimental note from 0.8 On Mon, Jul 11, 2016 at 4:35 PM, Tathagata Das

[jira] [Commented] (SPARK-16534) Kafka 0.10 Python support

2016-07-14 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15378240#comment-15378240 ] Cody Koeninger commented on SPARK-16534: [~jerryshao] if you want to work on this, go

Re: is dataframe.write() async? Streaming performance problem

2016-07-08 Thread Cody Koeninger
Maybe obvious, but what happens when you change the s3 write to a println of all the data? That should identify whether it's the issue. count() and read.json() will involve additional tasks (run through the items in the rdd to count them, likewise to infer the schema) but for 300 records that

Re: Why is KafkaUtils.createRDD offsetRanges an Array rather than a Seq?

2016-07-08 Thread Cody Koeninger
Yeah, it's a reasonable lowest common denominator between java and scala, and what's passed to that convenience constructor is actually what's used to construct the class. FWIW, in the 0.10 direct stream api when there's unavoidable wrapping / conversion anyway (since the underlying class takes a

Re: Memory grows exponentially

2016-07-08 Thread Cody Koeninger
Just as an offhand guess, are you doing something like updateStateByKey without expiring old keys? On Fri, Jul 8, 2016 at 2:44 AM, Jörn Franke wrote: > Memory fragmentation? Quiet common with in-memory systems. > >> On 08 Jul 2016, at 08:56, aasish.kumar

Re: Spark streaming. Strict discretizing by time

2016-07-06 Thread Cody Koeninger
In any case thanks, now I understand how to use Spark. > > PS: I will continue work with Spark but to minimize emails stream I plan to > unsubscribe from this mail list > > 2016-07-06 18:55 GMT+02:00 Cody Koeninger <c...@koeninger.org>: >> >> If you aren't proce

Re: Spark streaming. Strict discretizing by time

2016-07-06 Thread Cody Koeninger
s ok. If there are peaks of loading more than possibility of > computational system or data dependent time of calculation, Spark is not > able to provide a periodically stable results output. Sometimes this is > appropriate but sometime this is not appropriate. > > 2016-0

Re: Spark streaming. Strict discretizing by time

2016-07-06 Thread Cody Koeninger
, it works but throughput is much less than without limitations > because of this is an absolute upper limit. And time of processing is half > of available. > > Regarding Spark 2.0 structured streaming I will look it some later. Now I > don't know how to strictly measure th

Re: Spark streaming. Strict discretizing by time

2016-07-06 Thread Cody Koeninger
t; that Flink will strictly terminate processing of messages by time. Deviation > of the time window from 10 seconds to several minutes is impossible. > > PS: I prepared this example to make possible easy observe the problem and > fix it if it is a bug. For me it is obvious. May I ask y

Re: Spark streaming. Strict discretizing by time

2016-07-06 Thread Cody Koeninger
g speed in Spark's app is near to speed of data generation all > is ok. > I added delayFactor in > https://github.com/rssdev10/spark-kafka-streaming/blob/master/src/main/java/SparkStreamingConsumer.java > to emulate slow processing. And streaming process is in degradation. When &g

Re: [VOTE] Release Apache Spark 2.0.0 (RC2)

2016-07-06 Thread Cody Koeninger
I know some usages of the 0.10 kafka connector will be broken until https://github.com/apache/spark/pull/14026 is merged, but the 0.10 connector is a new feature, so not blocking. Sean I'm assuming the DirectKafkaStreamSuite failure you saw was for 0.8? I'll take another look at it. On Wed,

Re: Why's ds.foreachPartition(println) not possible?

2016-07-05 Thread Cody Koeninger
I don't think that's a scala compiler bug. println is a valid expression that returns unit. Unit is not a single-argument function, and does not match any of the overloads of foreachPartition You may be used to a conversion taking place when println is passed to method expecting a function, but

Re: Spark streaming. Strict discretizing by time

2016-07-05 Thread Cody Koeninger
. But in any case I need first response after 10 second. Not minutes > or hours after. > > Thanks. > > > > 2016-07-05 17:12 GMT+02:00 Cody Koeninger <c...@koeninger.org>: >> >> If you're talking about limiting the number of messages per batch to &g

Re: Spark streaming. Strict discretizing by time

2016-07-05 Thread Cody Koeninger
If you're talking about limiting the number of messages per batch to try and keep from exceeding batch time, see http://spark.apache.org/docs/latest/configuration.html look for backpressure and maxRatePerParition But if you're only seeing zeros after your job runs for a minute, it sounds like

Re: Read Kafka topic in a Spark batch job

2016-07-05 Thread Cody Koeninger
If it's a batch job, don't use a stream. You have to store the offsets reliably somewhere regardless. So it sounds like your only issue is with identifying offsets per partition? Look at KafkaCluster.scala, methods getEarliestLeaderOffsets / getLatestLeaderOffsets. On Tue, Jul 5, 2016 at 7:40

[jira] [Created] (SPARK-16359) unidoc workaround for multiple kafka versions

2016-07-03 Thread Cody Koeninger (JIRA)
Cody Koeninger created SPARK-16359: -- Summary: unidoc workaround for multiple kafka versions Key: SPARK-16359 URL: https://issues.apache.org/jira/browse/SPARK-16359 Project: Spark Issue Type

Re: Jenkins networking / port contention

2016-07-01 Thread Cody Koeninger
e're on bare metal. > > the test launch code executes this for each build: > # Generate random point for Zinc > export ZINC_PORT > ZINC_PORT=$(python -S -c "import random; print random.randrange(3030,4030)") > > On Fri, Jul 1, 2016 at 6:02 AM, Cody Koeninger &l

Jenkins networking / port contention

2016-07-01 Thread Cody Koeninger
Can someone familiar with amplab's jenkins setup clarify whether all tests running at a given time are competing for network ports, or whether there's some sort of containerization being done? Based on the use of Utils.startServiceOnPort in the tests, I'd assume the former.

[jira] [Created] (SPARK-16312) Docs for Kafka 0.10 consumer integration

2016-06-29 Thread Cody Koeninger (JIRA)
Cody Koeninger created SPARK-16312: -- Summary: Docs for Kafka 0.10 consumer integration Key: SPARK-16312 URL: https://issues.apache.org/jira/browse/SPARK-16312 Project: Spark Issue Type: Sub

[jira] [Created] (SPARK-16212) code cleanup of kafka-0-8 to match review feedback on 0-10

2016-06-25 Thread Cody Koeninger (JIRA)
Cody Koeninger created SPARK-16212: -- Summary: code cleanup of kafka-0-8 to match review feedback on 0-10 Key: SPARK-16212 URL: https://issues.apache.org/jira/browse/SPARK-16212 Project: Spark

Re: Improving performance of a kafka spark streaming app

2016-06-24 Thread Cody Koeninger
ile an issue? >> >> On Tue, Jun 21, 2016 at 9:04 PM, Colin Kincaid Williams <disc...@uw.edu> >> wrote: >>> Thanks @Cody, I will try that out. In the interm, I tried to validate >>> my Hbase cluster by running a random write test and see 30-40K writes >>&g

Re: NullPointerException when starting StreamingContext

2016-06-24 Thread Cody Koeninger
That looks like a classpath problem. You should not have to include the kafka_2.10 artifact in your pom, spark-streaming-kafka_2.10 already has a transitive dependency on it. That being said, 0.8.2.1 is the correct version, so that's a little strange. How are you building and submitting your

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-22 Thread Cody Koeninger
On Wed, Jun 22, 2016 at 7:46 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> As far as I know the only thing blocking it at this point is lack of >> committer review / approval. >> >> It's technically adding a new feature after spark code-freeze, but it

<    1   2   3   4   5   6   7   8   9   10   >