[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-03-21 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15204421#comment-15204421 ] Cody Koeninger commented on SPARK-12177: I made a PR with my changes for discussion's sake

[jira] [Commented] (SPARK-13939) Kafka createDirectStream not parallelizing properly

2016-03-21 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15204404#comment-15204404 ] Cody Koeninger commented on SPARK-13939: Are you actually certain that more than one of your

Re: How to gracefully handle Kafka OffsetOutOfRangeException

2016-03-21 Thread Cody Koeninger
oo sure if this is an issue with spark engine or with the > streaming module. Please let me know if you need more logs or you want me to > raise a github issue/JIRA. > > Sorry for digressing on the original thread. > > On Fri, Mar 18, 2016 at 8:10 PM, Cody Koeninger <c...@koeninger.org>

Re: How to collect data for some particular point in spark streaming

2016-03-21 Thread Cody Koeninger
Kafka doesn't have an accurate time-based index. Your options are to maintain an index yourself, or start at a sufficiently early offset and filter messages. On Mon, Mar 21, 2016 at 7:28 AM, Nagu Kothapalli wrote: > Hi, > > > I Want to collect data from kafka ( json

[jira] [Commented] (SPARK-13939) Kafka createDirectStream not parallelizing properly

2016-03-20 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15199912#comment-15199912 ] Cody Koeninger commented on SPARK-13939: Do not use pprint(), that's just the python name

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-03-19 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15203027#comment-15203027 ] Cody Koeninger commented on SPARK-12177: Unless I'm misunderstanding your point, those changes

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Cody Koeninger
minimum, dev@ discussion (like this one) should be initiated. >> As PMC is responsible for the project assets (including code), signoff >> is required for it IMO. >> >> More experienced Apache members might be opine better in case I got it >> wrong

Re: ERROR ArrayBuffer(java.nio.channels.ClosedChannelException

2016-03-19 Thread Cody Koeninger
running I'm seeing it retry correctly. However, I am > having trouble getting the job started - number of retries does not seem to > help with startup behavior. > > Thoughts? > > Regards, > > Bryan Jeffrey > > On Fri, Mar 18, 2016 at 10:44 AM, Cody Koeninger <c...@ko

[jira] [Commented] (SPARK-13939) Kafka createDirectStream not parallelizing properly

2016-03-19 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15199638#comment-15199638 ] Cody Koeninger commented on SPARK-13939: When you say print to screen, are you using print

Re: ERROR ArrayBuffer(java.nio.channels.ClosedChannelException

2016-03-19 Thread Cody Koeninger
That's a networking error when the driver is attempting to contact leaders to get the latest available offsets. If it's a transient error, you can look at increasing the value of spark.streaming.kafka.maxRetries, see http://spark.apache.org/docs/latest/configuration.html If it's not a transient

Re: How to gracefully handle Kafka OffsetOutOfRangeException

2016-03-19 Thread Cody Koeninger
Is that happening only at startup, or during processing? If that's happening during normal operation of the stream, you don't have enough resources to process the stream in time. There's not a clean way to deal with that situation, because it's a violation of preconditions. If you want to

Re: Get Pair of Topic and Message from Kafka + Spark Streaming

2016-03-19 Thread Cody Koeninger
There's 1 topic per partition, so you're probably better off dealing with topics that way rather than at the individual message level. http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers Look at the discussion of "HasOffsetRanges" If you

[jira] [Commented] (SPARK-13877) Consider removing Kafka modules from Spark / Spark Streaming

2016-03-19 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200152#comment-15200152 ] Cody Koeninger commented on SPARK-13877: Thumbs down on renaming the package name as well... from

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-03-19 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197536#comment-15197536 ] Cody Koeninger commented on SPARK-12177: My fork is working at a very basic level for caching

Re: SPARK-13843 and future of streaming backends

2016-03-18 Thread Cody Koeninger
Why would a PMC vote be necessary on every code deletion? There was a Jira and pull request discussion about the submodules that have been removed so far. https://issues.apache.org/jira/browse/SPARK-13843 There's another ongoing one about Kafka specifically

Re: SPARK-13843 and future of streaming backends

2016-03-18 Thread Cody Koeninger
arty external repositories - not owned by Apache. >> >> At a minimum, dev@ discussion (like this one) should be initiated. >> As PMC is responsible for the project assets (including code), signoff >> is required for it IMO. >> >> More experienced Apache members mi

[jira] [Commented] (SPARK-13877) Consider removing Kafka modules from Spark / Spark Streaming

2016-03-15 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195973#comment-15195973 ] Cody Koeninger commented on SPARK-13877: [~hshreedharan] They aren't compatible from an api

[jira] [Commented] (SPARK-13877) Consider removing Kafka modules from Spark / Spark Streaming

2016-03-15 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195870#comment-15195870 ] Cody Koeninger commented on SPARK-13877: I don't think it makes sense to put kafka 0.8 and kafka

Re: DynamicPartitionKafkaRDD - 1:n mapping between kafka and RDD partition

2016-03-15 Thread Cody Koeninger
like when they use the existing DStream.repartition where original > per-topicpartition in-order processing is also not observed any more. > > Do you agree? > > On Thu, Mar 10, 2016 at 12:12 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> The central problem wit

[jira] [Commented] (SPARK-13877) Consider removing Kafka modules from Spark / Spark Streaming

2016-03-15 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195435#comment-15195435 ] Cody Koeninger commented on SPARK-13877: I agree that it's a good idea to move everything that's

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-15 Thread Cody Koeninger
en the seek since position() will block to get > the new offset. > > -Jason > > On Mon, Mar 14, 2016 at 2:37 PM, Cody Koeninger <c...@koeninger.org> wrote: > >> Sorry, by metadata I also meant the equivalent of the old >> OffsetRequest api, which partitionsFor

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-14 Thread Cody Koeninger
nsuming > messages. We'd probably want to understand why those are insufficient > before considering new APIs. > > -Jason > > On Mon, Mar 14, 2016 at 12:17 PM, Cody Koeninger <c...@koeninger.org> wrote: > >> Regarding the rebalance listener, in the case of the spark >

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-14 Thread Cody Koeninger
out some specific change on the consumer API, please > feel free to create a new KIP with the detailed motivation and proposed > modifications. > > Guozhang > > On Fri, Mar 11, 2016 at 12:28 PM, Cody Koeninger <c...@koeninger.org> wrote: > >> Is there a KIP or Jira relat

Re: Problem running JavaDirectKafkaWordCount

2016-03-14 Thread Cody Koeninger
Sounds like the jar you built doesn't include the dependencies (in this case, the spark-streaming-kafka subproject). When you use spark-submit to submit a job to spark, you need to either specify all dependencies as additional --jars arguments (which is a pain), or build an uber-jar containing

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-14 Thread Cody Koeninger
s://github.com/guptamukul/sparktest.git > > ____ > From: Cody Koeninger <c...@koeninger.org> > Sent: 11 March 2016 23:04 > To: Mukul Gupta > Cc: user@spark.apache.org > Subject: Re: Kafka + Spark streaming, RDD partitions not processed in

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-11 Thread Cody Koeninger
ee >> that new partitions should be consumed automatically). I guess we can >> continue this discussion on the spark list then :-) >> >> Thanks >> Mansi. >> >> On Thu, Mar 10, 2016 at 7:43 AM, Cody Koeninger <c...@koeninger.org> >> wrote:

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-11 Thread Cody Koeninger
s, String.class, > StringDecoder.class, StringDecoder.class, kafkaParams, topicsSet); > > JavaDStream processed = messages.map(new Function<Tuple2<String, > String>, String>() { > > @Override > public String call(Tuple2<String, String> arg0) throws Exception {

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-11 Thread Cody Koeninger
Can you post your actual code? On Thu, Mar 10, 2016 at 9:55 PM, Mukul Gupta wrote: > Hi All, I was running the following test: Setup 9 VM runing spark workers > with 1 spark executor each. 1 VM running kafka and spark master. Spark > version is 1.6.0 Kafka version is

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-03-10 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15190103#comment-15190103 ] Cody Koeninger commented on SPARK-12177: There are a lot of things I'm really not happy with so

Re: DynamicPartitionKafkaRDD - 1:n mapping between kafka and RDD partition

2016-03-10 Thread Cody Koeninger
The central problem with doing anything like this is that you break one of the basic guarantees of kafka, which is in-order processing on a per-topicpartition basis. As far as PRs go, because of the new consumer interface for kafka 0.9 and 0.10, there's a lot of potential change already underway.

Re: Problem with union of DirectStream

2016-03-10 Thread Cody Koeninger
If you do any RDD transformation, it's going to return a different RDD than the original. The implication for casting to HasOffsetRanges is specifically called out in the docs at http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers On Thu,

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-10 Thread Cody Koeninger
.@gmail.com> wrote: > >> In order to do anything meaningful with the consumer itself in rebalance >> callback (e.g. commit offset), you would need to hold on the consumer >> reference; admittedly it sounds a bit awkward, but by design we choose to >> not enforce it in

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-10 Thread Cody Koeninger
gt; callback (e.g. commit offset), you would need to hold on the consumer > reference; admittedly it sounds a bit awkward, but by design we choose to > not enforce it in the interface itself. > > Guozhang > > On Wed, Mar 9, 2016 at 3:39 PM, Cody Koeninger <c...@koeninger.org> w

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-03-10 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189325#comment-15189325 ] Cody Koeninger commented on SPARK-12177: Clearly K and V are serializable somehow, because

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-09 Thread Cody Koeninger
gt; On Wed, Mar 9, 2016 at 2:11 PM, Guozhang Wang <wangg...@gmail.com> wrote: > > > Filed https://issues.apache.org/jira/browse/KAFKA-3370. > > > > On Wed, Mar 9, 2016 at 1:11 PM, Cody Koeninger <c...@koeninger.org> > wrote: > > > >> That sounds like

[jira] [Commented] (KAFKA-3370) Add options to auto.offset.reset to reset offsets upon initialization only

2016-03-09 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/KAFKA-3370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15188181#comment-15188181 ] Cody Koeninger commented on KAFKA-3370: --- So would case 1 also include addition of an entirely new

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-09 Thread Cody Koeninger
> me know if you have any other ideas. > > Guozhang > > On Wed, Mar 9, 2016 at 12:25 PM, Cody Koeninger <c...@koeninger.org> wrote: > >> Yeah, I think I understood what you were saying. What I'm saying is >> that if there were a way to just fetch metadata without

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-09 Thread Cody Koeninger
s going to get all the partitions assigned > to itself (i.e. you are only running a single instance). > > Guozhang > > > On Wed, Mar 9, 2016 at 6:22 AM, Cody Koeninger <c...@koeninger.org> wrote: > >> Another unfortunate thing about ConsumerRebalanceListener is t

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-03-09 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15187847#comment-15187847 ] Cody Koeninger commented on SPARK-12177: Anybody want to volunteer to fight the good fight

Re: Use cases for kafka direct stream messageHandler

2016-03-09 Thread Cody Koeninger
e relevance to what > you're talking about. > > Perhaps if both functions (the one with partitions arg and the one without) > returned just ConsumerRecord, I would like that more. > > - Alan > > On Tue, Mar 8, 2016 at 6:49 AM, Cody Koeninger <c...@koeninger.org> wrote: >

Re: Use cases for kafka direct stream messageHandler

2016-03-09 Thread Cody Koeninger
e relevance to what > you're talking about. > > Perhaps if both functions (the one with partitions arg and the one without) > returned just ConsumerRecord, I would like that more. > > - Alan > > On Tue, Mar 8, 2016 at 6:49 AM, Cody Koeninger <c...@koeninger.org> wrote: >

Re: submissionTime vs batchTime, DirectKafka

2016-03-09 Thread Cody Koeninger
t; > On Wed, Mar 9, 2016 at 10:46 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Spark streaming by default will not start processing a batch until the >> current batch is finished. So if your processing time is larger than >> your batch time, delays will buil

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-03-09 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15187613#comment-15187613 ] Cody Koeninger commented on SPARK-12177: If anyone wants to take a look at the stuff I'm hacking

Re: submissionTime vs batchTime, DirectKafka

2016-03-09 Thread Cody Koeninger
Spark streaming by default will not start processing a batch until the current batch is finished. So if your processing time is larger than your batch time, delays will build up. On Wed, Mar 9, 2016 at 11:09 AM, Sachin Aggarwal wrote: > Hi All, > > we have batchTime

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-09 Thread Cody Koeninger
to the consumer. Seems like this makes it unnecessarily awkward to serialize or provide a 0 arg constructor for the listener. On Wed, Mar 9, 2016 at 7:28 AM, Cody Koeninger <c...@koeninger.org> wrote: > I thought about ConsumerRebalanceListener, but seeking to the > beginning any time there's

Re: seekToBeginning doesn't work without auto.offset.reset

2016-03-09 Thread Cody Koeninger
@Override > public void onPartitionsAssigned(Collection > partitions) { > consumer.seekToBeginning(partitions.toArray(new > TopicPartition[0])); > } > }; > > consumer.subscribe(topics, listener); > > On

Re: Installing Spark on Mac

2016-03-08 Thread Cody Koeninger
ory > > When I try ./bin/spark-shell --master local[2] > > I get: no such file or directory > Failed to find spark assembly, you need to build Spark before running this > program > > > > Sent from my iPhone > >> On 8 Mar 2016, at 21:50, Cody Koe

seekToBeginning doesn't work without auto.offset.reset

2016-03-08 Thread Cody Koeninger
Using the 0.9 consumer, I would like to start consuming at the beginning or end, without specifying auto.offset.reset. This does not seem to be possible: val kafkaParams = Map[String, Object]( "bootstrap.servers" -> conf.getString("kafka.brokers"), "key.deserializer" ->

Re: Installing Spark on Mac

2016-03-08 Thread Cody Koeninger
aster.sh" > > Thanks, > > Aida > Sent from my iPhone > >> On 8 Mar 2016, at 19:02, Cody Koeninger <c...@koeninger.org> wrote: >> >> You said you downloaded a prebuilt version. >> >> You shouldn't have to mess with maven or building spark at all

Re: Installing Spark on Mac

2016-03-08 Thread Cody Koeninger
You said you downloaded a prebuilt version. You shouldn't have to mess with maven or building spark at all. All you need is a jvm, which it looks like you already have installed. You should be able to follow the instructions at http://spark.apache.org/docs/latest/ and

Re: Use cases for kafka direct stream messageHandler

2016-03-08 Thread Cody Koeninger
No, looks like you'd have to catch them in the serializer and have the serializer return option or something. The new consumer builds a buffer full of records, not one at a time. On Mar 8, 2016 4:43 AM, "Marius Soutier" <mps@gmail.com> wrote: > > > On 04.03.2016, at

Re: Use cases for kafka direct stream messageHandler

2016-03-08 Thread Cody Koeninger
No, looks like you'd have to catch them in the serializer and have the serializer return option or something. The new consumer builds a buffer full of records, not one at a time. On Mar 8, 2016 4:43 AM, "Marius Soutier" <mps@gmail.com> wrote: > > > On 04.03.2016, at

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-03-07 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184387#comment-15184387 ] Cody Koeninger commented on SPARK-12177: I've been hacking on a simple lru cache for consumers

[jira] [Commented] (SPARK-13707) Streaming UI tab misleading for window operations

2016-03-07 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183129#comment-15183129 ] Cody Koeninger commented on SPARK-13707: - To be clear, is this a problem with the UI only? I.e

Use cases for kafka direct stream messageHandler

2016-03-04 Thread Cody Koeninger
Wanted to survey what people are using the direct stream messageHandler for, besides just extracting key / value / offset. Would your use case still work if that argument was removed, and the stream just contained ConsumerRecord objects

Use cases for kafka direct stream messageHandler

2016-03-04 Thread Cody Koeninger
Wanted to survey what people are using the direct stream messageHandler for, besides just extracting key / value / offset. Would your use case still work if that argument was removed, and the stream just contained ConsumerRecord objects

Re: getting a list of executors for use in getPreferredLocations

2016-03-03 Thread Cody Koeninger
gt; What do you mean by consistent? Throughout the life cycle of an app, the >> executors can come and go and as a result really has no consistency. Do you >> just need it for a specific job? >> >> >> >> On Thu, Mar 3, 2016 at 3:08 PM, Cody Koeninger <c...@koeni

getting a list of executors for use in getPreferredLocations

2016-03-03 Thread Cody Koeninger
I need getPreferredLocations to choose a consistent executor for a given partition in a stream. In order to do that, I need to know what the current executors are. I'm currently grabbing them from the block manager master .getPeers(), which works, but I don't know if that's the most reasonable

Re: Upgrading to Kafka 0.9.x

2016-03-02 Thread Cody Koeninger
Jay, thanks for the response. Regarding the new consumer API for 0.9, I've been reading through the code for it and thinking about how it fits in to the existing Spark integration. So far I've seen some interesting challenges, and if you (or anyone else on the dev list) have time to provide some

Re: Upgrading to Kafka 0.9.x

2016-03-02 Thread Cody Koeninger
Jay, thanks for the response. Regarding the new consumer API for 0.9, I've been reading through the code for it and thinking about how it fits in to the existing Spark integration. So far I've seen some interesting challenges, and if you (or anyone else on the dev list) have time to provide some

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-03-02 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176037#comment-15176037 ] Cody Koeninger commented on SPARK-12177: How is it a huge hassle to keep the known working

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-03-01 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174507#comment-15174507 ] Cody Koeninger commented on SPARK-12177: Thanks for the example of performance numbers

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-03-01 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174415#comment-15174415 ] Cody Koeninger commented on SPARK-12177: Mansi are you talking about performance improvements

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-03-01 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174212#comment-15174212 ] Cody Koeninger commented on SPARK-12177: I'm happy to help in whatever way. If people think

Re: Spark streaming from Kafka best fit

2016-03-01 Thread Cody Koeninger
essing further stages and fetching > next batch. > > I will start with higher number of executor cores and see how it goes. > > -- > Thanks > Jatin Kumar | Rocket Scientist > +91-7696741743 m > > On Tue, Mar 1, 2016 at 9:07 PM, Cody Koeninger <c...@koeninger.org> wrote: &

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-03-01 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174092#comment-15174092 ] Cody Koeninger commented on SPARK-12177: My thoughts so far Must-haves: - The major new feature

Re: Spark Streaming: java.lang.StackOverflowError

2016-03-01 Thread Cody Koeninger
What code is triggering the stack overflow? On Mon, Feb 29, 2016 at 11:13 PM, Vinti Maheshwari wrote: > Hi All, > > I am getting below error in spark-streaming application, i am using kafka > for input stream. When i was doing with socket, it was working fine. But > when i

Re: Spark streaming from Kafka best fit

2016-03-01 Thread Cody Koeninger
> "How do I keep a balance of executors which receive data from Kafka and which process data" I think you're misunderstanding how the direct stream works. The executor which receives data is also the executor which processes data, there aren't separate receivers. If it's a single stage worth of

[jira] [Commented] (SPARK-13569) Kafka DStreams from wildcard topic filters

2016-03-01 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173895#comment-15173895 ] Cody Koeninger commented on SPARK-13569: This is probably reasonable, but SPARK-12177

Re: perl Kafka::Producer, “Kafka::Exception::Producer”, “code”, -1000, “message”, "Invalid argument

2016-02-29 Thread Cody Koeninger
Does this issue involve Spark at all? Otherwise you may have better luck on a perl or kafka related list. On Mon, Feb 29, 2016 at 3:26 PM, Vinti Maheshwari wrote: > Hi All, > > I wrote kafka producer using kafka perl api, But i am getting error when i > am passing

Re: kafka + mysql filtering problem

2016-02-29 Thread Cody Koeninger
You're getting confused about what code is running on the driver vs what code is running on the executor. Read http://spark.apache.org/docs/latest/programming-guide.html#understanding-closures-a-nameclosureslinka On Mon, Feb 29, 2016 at 8:00 AM, franco barrientos <

Re: kafka streaming topic partitions vs executors

2016-02-26 Thread Cody Koeninger
Spark in general isn't a good fit if you're trying to make sure that certain tasks only run on certain executors. You can look at overriding getPreferredLocations and increasing the value of spark.locality.wait, but even then, what do you do when an executor fails? On Fri, Feb 26, 2016 at 8:08

Re: How does Spark streaming's Kafka direct stream survive from worker node failure?

2016-02-26 Thread Cody Koeninger
in Kafka? > > 发自WPS邮箱客戶端 > 在 Cody Koeninger <c...@koeninger.org>,2016年2月25日 上午11:58写道: > > The per partition offsets are part of the rdd as defined on the driver. > Have you read > > https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md > >

Re: How does Spark streaming's Kafka direct stream survive from worker node failure?

2016-02-24 Thread Cody Koeninger
The per partition offsets are part of the rdd as defined on the driver. Have you read https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md and/or watched https://www.youtube.com/watch?v=fXnNEq1v3VA On Wed, Feb 24, 2016 at 9:05 PM, Yuhang Chen wrote:

Re: Spark and KafkaUtils

2016-02-24 Thread Cody Koeninger
he/javax.activation/activation/jars/activation-1.1.jar:javax/activation/ActivationDataFlavor.class* > > Here is complete error log: > https://gist.github.com/Vibhuti/07c24d2893fa6e520d4c > > > Regards, > ~Vinti > > On Wed, Feb 24, 2016 at 12:16 PM, Cody Koeninger <c.

Re: Spark and KafkaUtils

2016-02-24 Thread Cody Koeninger
y. I need to check how to use sbt > assembly. > > Regards, > ~Vinti > > On Wed, Feb 24, 2016 at 11:10 AM, Cody Koeninger <c...@koeninger.org> > wrote: > >> Are you using sbt assembly? That's what will include all of the >> non-provided dependencies in a s

Re: Spark and KafkaUtils

2016-02-24 Thread Cody Koeninger
encies ++= Seq( > "org.apache.spark" % "spark-streaming_2.10" % "1.5.2", > "org.apache.spark" % "spark-streaming-kafka_2.10" % "1.5.2" > ) > > > > Regards, > ~Vinti > > On Wed, Feb 24, 2016 at 9:33 AM, Cody K

Re: Spark and KafkaUtils

2016-02-24 Thread Cody Koeninger
spark streaming is provided, kafka is not. This build file https://github.com/koeninger/kafka-exactly-once/blob/master/build.sbt includes some hacks for ivy issues that may no longer be strictly necessary, but try that build and see if it works for you. On Wed, Feb 24, 2016 at 11:14 AM, Vinti

Re: Kafka partition increased while Spark Streaming is running

2016-02-24 Thread Cody Koeninger
That's correct, when you create a direct stream, you specify the topicpartitions you want to be a part of the stream (the other method for creating a direct stream is just a convenience wrapper). On Wed, Feb 24, 2016 at 2:15 AM, 陈宇航 wrote: > Here I use the

Re: Read from kafka after application is restarted

2016-02-22 Thread Cody Koeninger
The direct stream will let you do both of those things. Is there a reason you want to use receivers? http://spark.apache.org/docs/latest/streaming-kafka-integration.html http://spark.apache.org/docs/latest/configuration.html#spark-streaming look for maxRatePerPartition On Mon, Feb 22, 2016

Kafka connector mention in Matei's keynote

2016-02-18 Thread Cody Koeninger
I saw this slide: http://image.slidesharecdn.com/east2016v2matei-160217154412/95/2016-spark-summit-east-keynote-matei-zaharia-5-638.jpg?cb=1455724433 Didn't see the talk - was this just referring to the existing work on the spark-streaming-kafka subproject, or is someone actually working on

Re: Spark Streaming with Kafka Use Case

2016-02-18 Thread Cody Koeninger
g : > Can this issue be resolved by having a smaller block interval? > > Regards, > Praveen > On 18 Feb 2016 21:30, "praveen S" <mylogi...@gmail.com> wrote: > >> Can having a smaller block interval only resolve this? >> >> Regards, >> Praveen &

Re: Spark Streaming with Kafka Use Case

2016-02-18 Thread Cody Koeninger
Backpressure won't help you with the first batch, you'd need spark.streaming.kafka.maxRatePerPartition for that On Thu, Feb 18, 2016 at 9:40 AM, praveen S wrote: > Have a look at > > spark.streaming.backpressure.enabled > Property > > Regards, > Praveen > On 18 Feb 2016

Re: Spark Streaming with Kafka DirectStream

2016-02-17 Thread Cody Koeninger
You can print whatever you want wherever you want, it's just a question of whether it's going to show up on the driver or the various executors logs On Wed, Feb 17, 2016 at 5:50 AM, Cyril Scetbon wrote: > I don't think we can print an integer value in a spark streaming

Re: Spark Streaming with Kafka Use Case

2016-02-17 Thread Cody Koeninger
Just use a kafka rdd in a batch job or two, then start your streaming job. On Wed, Feb 17, 2016 at 12:57 AM, Abhishek Anand wrote: > I have a spark streaming application running in production. I am trying to > find a solution for a particular use case when my

Re: Saving Kafka Offsets to Cassandra at begining of each batch in Spark Streaming

2016-02-16 Thread Cody Koeninger
You could use sc.parallelize... but the offsets are already available at the driver, and they're a (hopefully) small enough amount of data that's it's probably more straightforward to just use the normal cassandra client to save them from the driver. On Tue, Feb 16, 2016 at 1:15 AM, Abhishek

Re: Kafka + Spark 1.3 Integration

2016-02-11 Thread Cody Koeninger
uot;500"); > props.put("zookeeper.sync.time.ms", "250"); > props.put("auto.commit.interval.ms", "1000"); > > > How can I do the same for the receiver inside spark-streaming for Spark V1.3.1 > > > Thanks > > Nipun > > > > On Wed, F

Re: Skip empty batches - spark streaming

2016-02-11 Thread Cody Koeninger
Please don't change the behavior of DirectKafkaInputDStream. Returning an empty rdd is (imho) the semantically correct thing to do, and some existing jobs depend on that behavior. If it's really an issue for you, you can either override directkafkainputdstream, or just check isEmpty as the first

Re: Kafka + Spark 1.3 Integration

2016-02-10 Thread Cody Koeninger
It's a pair because there's a key and value for each message. If you just want a single topic, put a single topic in the map of topic -> number of partitions. See https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java On

[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM

2016-02-09 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15139250#comment-15139250 ] Cody Koeninger commented on SPARK-3146: --- I think this can be safely closed, given the messageHandler

[jira] [Commented] (SPARK-13106) KafkaUtils.createDirectStream method with messageHandler and topics

2016-02-09 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15139242#comment-15139242 ] Cody Koeninger commented on SPARK-13106: https://issues.apache.org/jira/browse/SPARK-10963

[jira] [Commented] (SPARK-7827) kafka streaming NotLeaderForPartition Exception could not be handled normally

2016-02-09 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-7827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15139252#comment-15139252 ] Cody Koeninger commented on SPARK-7827: --- If there is no more information available on this, can we

Re: Kafka directsream receiving rate

2016-02-06 Thread Cody Koeninger
sages processed per event in >> sparkstreaming web UI . Also I am counting the messages inside >> foreachRDD . >> Removed the settings for backpressure but still the same . >> >> >> >> >> >> Sent from Samsung Mobile. >>

Re: Kafka directsream receiving rate

2016-02-05 Thread Cody Koeninger
If you're using the direct stream, you have 0 receivers. Do you mean you have 1 executor? Can you post the relevant call to createDirectStream from your code, as well as any relevant spark configuration? On Thu, Feb 4, 2016 at 8:13 PM, Diwakar Dhanuskodi < diwakar.dhanusk...@gmail.com> wrote:

Re: kafkaDirectStream usage error

2016-02-05 Thread Cody Koeninger
2 things: - you're only attempting to read from a single TopicAndPartition. Since your topic has multiple partitions, this probably isn't what you want - you're getting an offset out of range exception because the offset you're asking for doesn't exist in kafka. Use the other

Re: Kafka directsream receiving rate

2016-02-05 Thread Cody Koeninger
Partition=100" --driver-memory 2g > --executor-memory 1g --class com.tcs.dime.spark.SparkReceiver --files > /etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml,/etc/hadoop/conf/mapred-site.xml,/etc/hadoop/conf/yarn-site.xml,/etc/hive/conf/hive-site.xml > --jars

Re: Kafka directsream receiving rate

2016-02-05 Thread Cody Koeninger
am counting the messages inside > foreachRDD . > Removed the settings for backpressure but still the same . > > > > > > Sent from Samsung Mobile. > > > ---- Original message > From: Cody Koeninger <c...@koeninger.org> > Date:06/02/2016 00

Re: spark streaming web ui not showing the events - direct kafka api

2016-02-04 Thread Cody Koeninger
own issue ? > > On Wed, Jan 27, 2016 at 11:52 PM, Cody Koeninger <c...@koeninger.org> > wrote: > >> Have you tried spark 1.5? >> >> On Wed, Jan 27, 2016 at 11:14 AM, vimal dinakaran <vimal3...@gmail.com> >> wrote: >> >>> Hi , >>>

Re: question on spark.streaming.kafka.maxRetries

2016-02-03 Thread Cody Koeninger
KafkaRDD will use the standard kafka configuration parameter refresh.leader.backoff.ms if it is set in the kafkaParams map passed to createDirectStream. On Tue, Feb 2, 2016 at 9:10 PM, Chen Song wrote: > For Kafka direct stream, is there a way to set the time between

Re: Spark Streaming with Kafka - batch DStreams in memory

2016-02-02 Thread Cody Koeninger
It's possible you could (ab)use updateStateByKey or mapWithState for this. But honestly it's probably a lot more straightforward to just choose a reasonable batch size that gets you a reasonable file size for most of your keys, then use filecrush or something similar to deal with the hdfs small

<    3   4   5   6   7   8   9   10   11   12   >