Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-09 Thread Cody Koeninger
ometimes, it fails with StackOverFlow > Error. > > Why does the Streaming job not restart from checkpoint directory when the > job failed earlier with Kafka Brokers getting messed up? We have the > checkpoint directory in our hdfs. > > On Mon, Nov 9, 2015 at 12:34 PM, Cody Koening

Re: Checkpointing an InputDStream from Kafka

2015-11-06 Thread Cody Koeninger
Have you looked at the driver and executor logs? Without being able to see what's in the "do stuff with the dstream" section of code... I'd suggest starting with a simpler job, e.g that does nothing but print each message, and verify whether it checkpoints On Fri, Nov 6, 2015 at 3:59 AM, Kathi

Re: Regarding The Kafka Offset Management Issue In Direct Stream Approach.

2015-11-06 Thread Cody Koeninger
Questions about Spark-kafka integration are better directed to the Spark user mailing list. I'm not 100% sure what you're asking. The spark createDirectStream api will not store any offsets internally, unless you enable checkpointing. On Sun, Nov 1, 2015 at 10:26 PM, Charan Ganga Phani

Re: Very slow performance on very small record counts

2015-11-03 Thread Cody Koeninger
; > > > For completeness’ sake: I am using the direct stream API. > > > > *From:* Cody Koeninger [mailto:c...@koeninger.org] > *Sent:* Saturday, October 31, 2015 2:00 PM > *To:* YOUNG, MATTHEW, T (Intel Corp) <matthew.t.yo...@intel.com> > *Subject:* Re: Very slow performan

Re: Exception while reading from kafka stream

2015-11-02 Thread Cody Koeninger
somedomain":8020; > [ERROR] [11/02/2015 12:59:08.102] [Executor task launch worker-0] > [akka.tcp://sparkDriver@10.125.4.200:34251/user/CoarseGrainedScheduler] > swallowing exception during message send > (akka.remote.RemoteTransportExceptionNoStackTrace) > > > > *Th

Re: Exception while reading from kafka stream

2015-10-30 Thread Cody Koeninger
Just put them all in one stream and switch processing based on the topic On Fri, Oct 30, 2015 at 6:29 AM, Ramkumar V wrote: > i want to join all those logs in some manner. That's what i'm trying to do. > > *Thanks*, > > > > On

Re: Need more tasks in KafkaDirectStream

2015-10-29 Thread Cody Koeninger
Consuming from kafka is inherently limited to using a number of consumer nodes less than or equal to the number of kafka partitions. If you think about it, you're going to be paying some network cost to repartition that data from a consumer to different processing nodes, regardless of what Spark

Re: Spark/Kafka Streaming Job Gets Stuck

2015-10-29 Thread Cody Koeninger
If you're writing to s3, want to avoid small files, and don't actually need 3 minute latency... you may want to consider just running a regular spark job (using KafkaUtils.createRDD) at scheduled intervals rather than a streaming job. On Thu, Oct 29, 2015 at 8:16 AM, Sabarish Sasidharan <

Re: correct and fast way to stop streaming application

2015-10-27 Thread Cody Koeninger
If you want to make sure that your offsets are increasing without gaps... one way to do that is to enforce that invariant when you're saving to your database. That would probably mean using a real database instead of zookeeper though. On Tue, Oct 27, 2015 at 4:13 AM, Krot Viacheslav

Re: Regarding the Kafka offset management issue in Direct Stream Approach.

2015-10-26 Thread Cody Koeninger
Questions about spark's kafka integration should probably be directed to the spark user mailing list, not this one. I don't monitor kafka mailing lists as closely, for instance. For the direct stream, Spark doesn't keep any state regarding offsets, unless you enable checkpointing. Have you read

[jira] [Commented] (SPARK-11195) Exception thrown on executor throws ClassNotFound on driver

2015-10-23 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971783#comment-14971783 ] Cody Koeninger commented on SPARK-11195: I'm seeing this on 1.5.1 as well > Exception thr

[jira] [Commented] (SPARK-11211) Kafka - offsetOutOfRange forces to largest

2015-10-23 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971834#comment-14971834 ] Cody Koeninger commented on SPARK-11211: In an attempt to replicate, I used the following

Re: Sporadic error after moving from kafka receiver to kafka direct stream

2015-10-22 Thread Cody Koeninger
That sounds like a networking issue to me. Stuff to try - make sure every executor node can talk to every kafka broker on relevant ports - look at firewalls / network config. Even if you can make the initial connection, something may be happening after a while (we've seen ... "interesting"...

Re: Lost leader exception in Kafka Direct for Streaming

2015-10-21 Thread Cody Koeninger
kasireddy < swethakasire...@gmail.com> wrote: > Hi Cody, > > What other options do I have other than monitoring and restarting the job? > Can the job recover automatically? > > Thanks, > Sweth > > On Thu, Oct 1, 2015 at 7:18 AM, Cody Koeninger <c...@koeninger.org>

Re: Kafka Streaming and Filtering > 3000 partitons

2015-10-21 Thread Cody Koeninger
The rdd partitions are 1:1 with kafka topicpartitions, so you can use offsets ranges to figure out which topic a given rdd partition is for and proceed accordingly. See the kafka integration guide in the spark streaming docs for more details, or https://github.com/koeninger/kafka-exactly-once As

Re: Kafka Streaming and Filtering > 3000 partitons

2015-10-21 Thread Cody Koeninger
osr.fromOffset > > val end = osr.untilOffset > > > > // Now we know the topic name, we can filter something > > // Or could we have referenced the topic name from > > // offsetRanges(TaskContext.get.partitionId) earlier > > // before we enter

Re: Problems w/YARN Spark Streaming app reading from Kafka

2015-10-16 Thread Cody Koeninger
What do you mean by "the current documentation states it isn’t used"? http://spark.apache.org/docs/latest/configuration.html still lists the value and its meaning. As far as the issue you're seeing, are you measuring records by looking at logs, the spark ui, or actual downstream sinks of data?

Re: Node afinity for Kafka-Direct Stream

2015-10-14 Thread Cody Koeninger
Assumptions about locality in spark are not very reliable, regardless of what consumer you use. Even if you have locality preferences, and locality wait turned up really high, you still have to account for losing executors. On Wed, Oct 14, 2015 at 8:23 AM, Gerard Maas

Re: Node afinity for Kafka-Direct Stream

2015-10-14 Thread Cody Koeninger
; with the direct stream approach? > > -greetz, Gerard. > > > On Wed, Oct 14, 2015 at 4:31 PM, Cody Koeninger <c...@koeninger.org> > wrote: > >> Assumptions about locality in spark are not very reliable, regardless of >> what consumer you use. Even if you

[jira] [Commented] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2015-10-11 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952299#comment-14952299 ] Cody Koeninger commented on SPARK-11045: "in Receiver based model, the number of parti

[jira] [Commented] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2015-10-11 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952373#comment-14952373 ] Cody Koeninger commented on SPARK-11045: What you're saying doesn't address my point regarding

[jira] [Commented] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2015-10-11 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952444#comment-14952444 ] Cody Koeninger commented on SPARK-11045: Direct stream partitions are 1:1 with kafka

[jira] [Commented] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2015-10-10 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951877#comment-14951877 ] Cody Koeninger commented on SPARK-11045: The comments regarding parallelism are not accurate

Re: Kafka streaming "at least once" semantics

2015-10-09 Thread Cody Koeninger
rocessing it more than once. Whether this happens via > checkpointing, WAL, or kafka offsets is irrelevant, as long as we don't > drop > data. :) > > A couple of things we've tried: > - Using the kafka direct stream API (via Cody Koeninger > < > https://github.com/koen

Re: does KafkaCluster can be public ?

2015-10-08 Thread Cody Koeninger
If anyone is interested in keeping tabs on it, the jira for this is https://issues.apache.org/jira/browse/SPARK-10963 On Wed, Oct 7, 2015 at 3:16 AM, Erwan ALLAIN <eallain.po...@gmail.com> wrote: > Thanks guys ! > > On Wed, Oct 7, 2015 at 1:41 AM, Cody Koeninger <c...@ko

Re: Streaming DirectKafka assertion errors

2015-10-08 Thread Cody Koeninger
It sounds like you moved the job from one environment to another? This may sound silly, but make sure (eg using lsof) the brokers the job is connecting to are actually the ones you expect. As far as the checkpoint goes, the log output should indicate whether the job is restoring from checkpoint.

Re: Weird performance pattern of Spark Streaming (1.4.1) + direct Kafka

2015-10-07 Thread Cody Koeninger
d > totalizedMetrics42975002 > > What we observe is that the largest difference comes from the > materialization of the RDD. This pattern repeats cyclically one on, one off. > > Any ideas where to further look? > > kr, Gerard. > > > On Wed, Oct 7, 2015 at 1:33 AM, Tathaga

[jira] [Commented] (SPARK-5569) Checkpoints cannot reference classes defined outside of Spark's assembly

2015-10-06 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14945091#comment-14945091 ] Cody Koeninger commented on SPARK-5569: --- The gist I originally posted, linked at the top

Re: does KafkaCluster can be public ?

2015-10-06 Thread Cody Koeninger
I personally think KafkaCluster (or the equivalent) should be made public. When I'm deploying spark I just sed out the private[spark] and rebuild. There's a general reluctance to make things public due to backwards compatibility, but if enough people ask for it... ? On Tue, Oct 6, 2015 at 6:51

Re: Weird performance pattern of Spark Streaming (1.4.1) + direct Kafka

2015-10-06 Thread Cody Koeninger
Can you say anything more about what the job is doing? First thing I'd do is try to get some metrics on the time taken by your code on the executors (e.g. when processing the iterator) to see if it's consistent between the two situations. On Tue, Oct 6, 2015 at 11:45 AM, Gerard Maas

Re: Weird performance pattern of Spark Streaming (1.4.1) + direct Kafka

2015-10-06 Thread Cody Koeninger
; > Slow TaskFast Tasklocal computation347.6281.53spark computation6930263metric > collection70138wall clock process7000401total records processed42975002 > > (time in ms) > > kr, Gerard. > > > On Tue, Oct 6, 2015 at 8:01 PM, Cody Koeninger <c...@koeninger.org> wrote: > >&

Re: Weird performance pattern of Spark Streaming (1.4.1) + direct Kafka

2015-10-06 Thread Cody Koeninger
'm not saying you >> should). >> >> Sent from my iPhone >> >> On 06 Oct 2015, at 22:17, Cody Koeninger <c...@koeninger.org> wrote: >> >> I'm not clear on what you're measuring. Can you post relevant code >> snippets including the measurement cod

Re: does KafkaCluster can be public ?

2015-10-06 Thread Cody Koeninger
s to the caller. I end up using getLatestLeaderOffsets to >> figure out how to initialize initial offsets. >> >> On Tue, Oct 6, 2015 at 3:24 PM, Cody Koeninger <c...@koeninger.org> >> wrote: >> > I personally think KafkaCluster (or the equivalent) should b

[jira] [Created] (SPARK-10963) Make KafkaCluster api public

2015-10-06 Thread Cody Koeninger (JIRA)
Cody Koeninger created SPARK-10963: -- Summary: Make KafkaCluster api public Key: SPARK-10963 URL: https://issues.apache.org/jira/browse/SPARK-10963 Project: Spark Issue Type: Improvement

Re: spark.streaming.kafka.maxRatePerPartition for direct stream

2015-10-02 Thread Cody Koeninger
abled=true >> >> >> This should automatically delay the next batch until the current one is >> processed, or at least create that balance over a few batches/periods >> between the consume/process rate vs ingestion rate. >> >> >> Nicu >> >&g

Re: Checkpointing is super slow

2015-10-02 Thread Cody Koeninger
Why are you sure it's checkpointing speed? Have you compared it against checkpointing to hdfs, s3, or local disk? On Fri, Oct 2, 2015 at 1:17 AM, Sourabh Chandak wrote: > Hi, > > I have a receiverless kafka streaming job which was started yesterday > evening and was

Re: Spark Streaming over YARN

2015-10-02 Thread Cody Koeninger
r a spark job failure) > > > ----- Mail original - > De: "Cody Koeninger" <c...@koeninger.org> > À: "Nicolas Biau" <nib...@free.fr> > Cc: "user" <user@spark.apache.org> > Envoyé: Vendredi 2 Octobre 2015 17:43:41 > Objet: Re: Spa

Re: Spark Streaming over YARN

2015-10-02 Thread Cody Koeninger
If you're using the receiver based implementation, and want more parallelism, you have to create multiple streams and union them together. Or use the direct stream. On Fri, Oct 2, 2015 at 10:40 AM, wrote: > Hello, > I have a job receiving data from kafka (4 partitions) and

Re: Spark Streaming over YARN

2015-10-02 Thread Cody Koeninger
On Fri, Oct 2, 2015 at 11:36 AM, <nib...@free.fr> wrote: > Sorry, I just said that I NEED to manage offsets, so in case of Kafka > Direct Stream , how can I handle this ? > Update Zookeeper manually ? why not but any other solutions ? > > - Mail original - >

Re: Kafka Direct Stream

2015-10-01 Thread Cody Koeninger
You can get the topic for a given partition from the offset range. You can either filter using that; or just have a single rdd and match on topic when doing mapPartitions or foreachPartition (which I think is a better idea)

Re: Lost leader exception in Kafka Direct for Streaming

2015-10-01 Thread Cody Koeninger
Did you check you kafka broker logs to see what was going on during that time? The direct stream will handle normal leader loss / rebalance by retrying tasks. But the exception you got indicates that something with kafka was wrong, such that offsets were being re-used. ie. your job already

Re: spark.streaming.kafka.maxRatePerPartition for direct stream

2015-10-01 Thread Cody Koeninger
That depends on your job, your cluster resources, the number of seconds per batch... You'll need to do some empirical work to figure out how many messages per batch a given executor can handle. Divide that by the number of seconds per batch. On Thu, Oct 1, 2015 at 3:39 PM, Sourabh Chandak

[jira] [Commented] (SPARK-9472) Consistent hadoop config for streaming

2015-09-30 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936866#comment-14936866 ] Cody Koeninger commented on SPARK-9472: --- You can get around this by passing in the hadoop context

Re: [streaming] reading Kafka direct stream throws kafka.common.OffsetOutOfRangeException

2015-09-30 Thread Cody Koeninger
Offset out of range means the message in question is no longer available on Kafka. What's your kafka log retention set to, and how does that compare to your processing time? On Wed, Sep 30, 2015 at 4:26 AM, Alexey Ponkin wrote: > Hi > > I have simple spark-streaming job(8

[jira] [Commented] (SPARK-9472) Consistent hadoop config for streaming

2015-09-30 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14937483#comment-14937483 ] Cody Koeninger commented on SPARK-9472: --- I'd suggest patching your distribution then, I doubt

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Cody Koeninger
Try writing and reading to the topics in question using the kafka command line tools, to eliminate your code as a variable. That number of partitions is probably more than sufficient: https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

Re: Spark Streaming many subscriptions vs many jobs

2015-09-29 Thread Cody Koeninger
There isn't an easy way of ensuring delivery semantics for producing to kafka (see https://cwiki.apache.org/confluence/display/KAFKA/KIP-27+-+Conditional+Publish). If there's only one logical consumer of the intermediate state, I wouldn't write it back to kafka, i'd just keep it in a single spark

Re: Spark-Kafka Connector issue

2015-09-29 Thread Cody Koeninger
> > Sent from Outlook <http://taps.io/outlookmobile> > > _____ > From: Cody Koeninger <c...@koeninger.org> > Sent: Tuesday, September 29, 2015 12:33 am > Subject: Re: Spark-Kafka Connector issue > To: Ratika Prasad <rpra...@couponsinc.

Re: Adding / Removing worker nodes for Spark Streaming

2015-09-28 Thread Cody Koeninger
ep 28, 2015 at 11:43 AM, Augustus Hong <augus...@branchmetrics.io > > wrote: > >> Got it, thank you! >> >> >> On Mon, Sep 28, 2015 at 11:37 AM, Cody Koeninger <c...@koeninger.org> >> wrote: >> >>> Losing worker nodes without stopping is de

Re: Spark-Kafka Connector issue

2015-09-28 Thread Cody Koeninger
This is a user list question not a dev list question. Looks like your driver is having trouble communicating to the kafka brokers. Make sure the broker host and port is available from the driver host (using nc or telnet); make sure that you're providing the _broker_ host and port to

Re: kafka direct streaming with checkpointing

2015-09-25 Thread Cody Koeninger
ou may have a bug that prevents the state to be saved and you >>can’t restart the app w/o upgrade >> >> Less than ideal, yes :) >> >> -adrian >> >> From: Radu Brumariu >> Date: Friday, September 25, 2015 at 1:31 AM >> To: Cody Koeninger >

Re: Kafka & Spark Streaming

2015-09-25 Thread Cody Koeninger
; job to be killed because of one failing customer,and affect others in the > same job. Hope that makes sense. > > thnx > > On Fri, Sep 25, 2015 at 12:52 PM, Cody Koeninger <c...@koeninger.org> > wrote: > >> Your success case will work fine, it is a 1-1 mapping as you sai

Re: Kafka & Spark Streaming

2015-09-25 Thread Cody Koeninger
nd to a task > inside the task execution code, in cases where the intermediate operations > do not change partitions, shuffle etc. > > -neelesh > > On Fri, Sep 25, 2015 at 11:14 AM, Cody Koeninger <c...@koeninger.org> > wrote: > >> >> http://spark.apache.o

Re: kafka direct streaming with checkpointing

2015-09-25 Thread Cody Koeninger
rying to say, is that this particular situation is part of the > general checkpointing use case, not an edge case. > I would like to understand why shouldn't the checkpointing mechanism, > already existent in Spark, handle this situation too ? > > On Fri, Sep 25, 2015 at 12:20 PM, Cody Ko

Re: Kafka & Spark Streaming

2015-09-25 Thread Cody Koeninger
http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers also has an example of how to close over the offset ranges so they are available on executors. On Fri, Sep 25, 2015 at 12:50 PM, Neelesh wrote: > Hi, >We are

Re: ERROR BoundedByteBufferReceive: OOME with size 352518400

2015-09-25 Thread Cody Koeninger
$.main(SparkSubmit.scala:75) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > > Thanks, > Sourabh > > On Thu, Sep 24, 2015 at 2:04 PM, Cody Koeninger <c...@koeninger.org> > wrote: > >> That looks like the OOM is in the driver, when getting pa

Re: ERROR BoundedByteBufferReceive: OOME with size 352518400

2015-09-25 Thread Cody Koeninger
$.main(SparkSubmit.scala:75) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > > Thanks, > Sourabh > > On Thu, Sep 24, 2015 at 2:04 PM, Cody Koeninger <c...@koeninger.org> > wrote: > >> That looks like the OOM is in the driver, when getting pa

Re: kafka direct streaming with checkpointing

2015-09-24 Thread Cody Koeninger
No, you cant use checkpointing across code changes. Either store offsets yourself, or start up your new app code and let it catch up before killing the old one. On Thu, Sep 24, 2015 at 8:40 AM, Radu Brumariu wrote: > Hi, > in my application I use Kafka direct streaming and I

Re: Reading avro data using KafkaUtils.createDirectStream

2015-09-24 Thread Cody Koeninger
Your third and fourth type parameters need to be subclasses of kafka.serializer.Decoder On Thu, Sep 24, 2015 at 10:30 AM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Hi, > I'm trying to use KafkaUtils.createDirectStream to read avro messages from > Kafka but something is off with my

Re: kafka direct streaming with checkpointing

2015-09-24 Thread Cody Koeninger
using Kafka. > Is there a ticket to add this sort of semantics to checkpointing ? Does it > even make sense to add it there ? > > Thanks, > Radu > > > On Thursday, September 24, 2015, Cody Koeninger <c...@koeninger.org> > wrote: > >> No, you cant use checkpointing

Re: ERROR BoundedByteBufferReceive: OOME with size 352518400

2015-09-24 Thread Cody Koeninger
That looks like the OOM is in the driver, when getting partition metadata to create the direct stream. In that case, executor memory allocation doesn't matter. Allocate more driver memory, or put a profiler on it to see what's taking up heap. On Thu, Sep 24, 2015 at 3:51 PM, Sourabh Chandak

Re: ERROR BoundedByteBufferReceive: OOME with size 352518400

2015-09-24 Thread Cody Koeninger
That looks like the OOM is in the driver, when getting partition metadata to create the direct stream. In that case, executor memory allocation doesn't matter. Allocate more driver memory, or put a profiler on it to see what's taking up heap. On Thu, Sep 24, 2015 at 3:51 PM, Sourabh Chandak

Re: Checkpoint files are saved before stream is saved to file (rdd.toDF().write ...)?

2015-09-23 Thread Cody Koeninger
TD can correct me on this, but I believe checkpointing is done after a set of jobs is submitted, not after they are completed. If you fail while processing the jobs, starting over from that checkpoint should put you in the correct state. In any case, are you actually observing a loss of messages

[jira] [Comment Edited] (SPARK-10732) Starting spark streaming from a specific point in time.

2015-09-22 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903231#comment-14903231 ] Cody Koeninger edited comment on SPARK-10732 at 9/22/15 7:02 PM: - Yeah

[jira] [Commented] (SPARK-10732) Starting spark streaming from a specific point in time.

2015-09-22 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903231#comment-14903231 ] Cody Koeninger commented on SPARK-10732: Yeah, even if that gets implemented it will likely

[jira] [Commented] (SPARK-10734) DirectKafkaInputDStream uses the OffsetRequest.LatestTime to find the latest offset, however using the batch time would be more desireable.

2015-09-22 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903106#comment-14903106 ] Cody Koeninger commented on SPARK-10734: as I explained in SPARK-10732 , kafka's getOffsetsBefore

[jira] [Commented] (SPARK-10732) Starting spark streaming from a specific point in time.

2015-09-22 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903110#comment-14903110 ] Cody Koeninger commented on SPARK-10732: As I already said, kafka's implementation

[jira] [Commented] (SPARK-10732) Starting spark streaming from a specific point in time.

2015-09-22 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902648#comment-14902648 ] Cody Koeninger commented on SPARK-10732: It doesn't make sense to decouple this from

[jira] [Commented] (SPARK-5569) Checkpoints cannot reference classes defined outside of Spark's assembly

2015-09-22 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902692#comment-14902692 ] Cody Koeninger commented on SPARK-5569: --- I'm still not clear on what you're doing

[jira] [Commented] (SPARK-5569) Checkpoints cannot reference classes defined outside of Spark's assembly

2015-09-21 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901106#comment-14901106 ] Cody Koeninger commented on SPARK-5569: --- The direct stream doesn't save offset ranges, it saves

Re: passing SparkContext as parameter

2015-09-21 Thread Cody Koeninger
That isn't accurate, I think you're confused about foreach. Look at http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd On Mon, Sep 21, 2015 at 7:36 AM, Romi Kuntsman wrote: > foreach is something that runs on the

[jira] [Commented] (SPARK-5569) Checkpoints cannot reference classes defined outside of Spark's assembly

2015-09-21 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901364#comment-14901364 ] Cody Koeninger commented on SPARK-5569: --- You can typically use sbt's merge strategy to deal

Re: Spark Streaming and Kafka MultiNode Setup - Data Locality

2015-09-21 Thread Cody Koeninger
The direct stream already uses the kafka leader for a given partition as the preferred location. I don't run kafka on the same nodes as spark, and I don't know anyone who does, so that situation isn't particularly well tested. On Mon, Sep 21, 2015 at 1:15 PM, Ashish Soni

[jira] [Commented] (SPARK-5569) Checkpoints cannot reference classes defined outside of Spark's assembly

2015-09-21 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901435#comment-14901435 ] Cody Koeninger commented on SPARK-5569: --- Yeah, spark-streaming can be marked as provided. Try

Re: Spark Streaming application code change and stateful transformations

2015-09-17 Thread Cody Koeninger
ialRdd with the values preloaded from DB > 2. By cleaning the checkpoint in between upgrades, data is loaded > only once > > Hope this helps, > -adrian > > From: Ofir Kerker > Date: Wednesday, September 16, 2015 at 6:12 PM > To: Cody Koeninger > Cc: "user@s

Re: KafkaDirectStream can't be recovered from checkpoint

2015-09-17 Thread Cody Koeninger
Is there a particular reason you're calling checkpoint on the stream in addition to the streaming context? On Thu, Sep 17, 2015 at 2:36 PM, Petr Novak wrote: > Hi all, > it throws FileBasedWriteAheadLogReader: Error reading next item, EOF > reached > java.io.EOFException >

Re: Spark streaming to database exception handling

2015-09-17 Thread Cody Koeninger
If you fail the task (throw an exception) it will be retried On Thu, Sep 17, 2015 at 4:56 PM, david w wrote: > I am using spark stream to receive data from kafka, and then write result > rdd > to external database inside foreachPartition(). All thing works fine, my > question

[jira] [Commented] (SPARK-10320) Kafka Support new topic subscriptions without requiring restart of the streaming context

2015-09-16 Thread Cody Koeninger (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790843#comment-14790843 ] Cody Koeninger commented on SPARK-10320: I don't think there's much benefit to multiple dstreams

Re: unoin streams not working for streams > 3

2015-09-15 Thread Cody Koeninger
I assume you're using the receiver based stream (createStream) rather than createDirectStream? Receivers each get scheduled as if they occupy a core, so you need at least one more core than number of receivers if you want to get any work done. Try using the direct stream if you can't combine

Re: Is it required to remove checkpoint when submitting a code change?

2015-09-14 Thread Cody Koeninger
; what kind of parameters are you guys loading from the configuration file, > when using checkpoints. > > I really appreciate all the help on this. > > Many thanks, > > Ricardo > > > > > > > > > On Fri, Sep 11, 2015 at 11:09 AM, Cody Koeninger <

Re: Spark Streaming application code change and stateful transformations

2015-09-14 Thread Cody Koeninger
Solution 2 sounds better to me. You aren't always going to have graceful shutdowns. On Mon, Sep 14, 2015 at 1:49 PM, Ofir Kerker wrote: > Hi, > My Spark Streaming application consumes messages (events) from Kafka every > 10 seconds using the direct stream approach and

Re: New JavaRDD Inside JavaPairDStream

2015-09-11 Thread Cody Koeninger
No, in general you can't make new RDDs in code running on the executors. It looks like your properties file is a constant, why not process it at the beginning of the job and broadcast the result? On Fri, Sep 11, 2015 at 2:09 PM, Rachana Srivastava < rachana.srivast...@markmonitor.com> wrote: >

Re: Is it required to remove checkpoint when submitting a code change?

2015-09-11 Thread Cody Koeninger
iewsHourlyCount.filter(_._2 >= > *minPageview*) > permalinkAudienceStream.map(a => s"${a._1}\t${a._2}") > .repartition(1) > .saveAsTextFiles(DESTINATION_FILE, "txt") > > } > > I really

Re: spark streaming 1.3 with kafka connection timeout

2015-09-10 Thread Cody Koeninger
Post the actual stacktrace you're getting On Thu, Sep 10, 2015 at 12:21 AM, Shushant Arora wrote: > Executors in spark streaming 1.3 fetch messages from kafka in batches and > what happens when executor takes longer time to complete a fetch batch > > say in > > >

Re: Using KafkaDirectStream, stopGracefully and exceptions

2015-09-10 Thread Cody Koeninger
The kafka direct stream meets those requirements. You don't need checkpointing for exactly-once. Indeed, unless your output operations are idempotent, you can't get exactly-once if you're relying on checkpointing. Instead, you need to store the offsets atomically in the same transaction as your

Re: Using KafkaDirectStream, stopGracefully and exceptions

2015-09-10 Thread Cody Koeninger
that something is screwed up and you need to reprocess. On Thu, Sep 10, 2015 at 9:17 AM, Krzysztof Zarzycki <k.zarzy...@gmail.com> wrote: > Thanks guys for your answers. I put my answers in text, below. > > Cheers, > Krzysztof Zarzycki > > 2015-09-10 15:39 GMT+02:00 Cody Koeni

Re: spark streaming 1.3 with kafka connection timeout

2015-09-10 Thread Cody Koeninger
icname,177], [topicname,105], [topicname,171], [topicname,141], > [topicname,285], [topicname,27], [topicname,168], [topicname,267], > [topicname,213], [topicname,153], [topicname,138], [topicname,255], > [topicname,222], [topicname,243], [topicname,261], [topicname,90], > [topicname,11

Re: spark streaming 1.3 with kafka connection timeout

2015-09-10 Thread Cody Koeninger
ws Exception { > v1.foreachPartition(transformations); > return null; > } > }); > > > In KafkaStreamTransformations : > > > @Override > public void call(Iterator<byte[][]> t) throws Exception { > try{ > while(t.ha

Re: Using KafkaDirectStream, stopGracefully and exceptions

2015-09-10 Thread Cody Koeninger
m, so can not comment on performance benefits >> of one over other , but whatever performance benchmark we have done, >> dibbhatt/kafka-spark-consumer stands out.. >> >> Regards, >> Dibyendu >> >> On Thu, Sep 10, 2015 at 8:08 PM, Cody Koeninger <c...@koe

Re: [streaming] DStream with window performance issue

2015-09-09 Thread Cody Koeninger
ectStream instead of union of three but it is > still growing exponential with window method. > > -- > Яндекс.Почта — надёжная почта > http://mail.yandex.ru/neo2/collect/?exp=1=1 > > > 08.09.2015, 23:53, "Cody Koeninger" <c...@koeninger.org>: > > Yeah,

Re: Java vs. Scala for Spark

2015-09-09 Thread Cody Koeninger
Java 8 lambdas are broken to the point of near-uselessness (because of checked exceptions and inability to close over non-final references). I wouldn't use them as a deciding factor in language choice. Any competent developer should be able to write reasonable java-in-scala after a week and

Re: Spark streaming -> cassandra : Fault Tolerance

2015-09-09 Thread Cody Koeninger
It's been a while since I've looked at the cassandra connector, so I can't give you specific advice on it. But in general, if a spark task fails (uncaught exception), it will be retried automatically. In the case of the kafka direct stream rdd, it will have exactly the same messages as the first

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Cody Koeninger
{ > case "topicA" => ... > case "topicB" => ... > case _ => > } > ) > } > > > > > 08.09.2015, 19:21, "Cody Koeninger" <c...@koeninger.org>: > > That doesn't really

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Cody Koeninger
Have you tried deleting or moving the contents of the checkpoint directory and restarting the job? On Fri, Sep 4, 2015 at 8:02 PM, Dmitry Goldenberg wrote: > Sorry, more relevant code below: > > SparkConf sparkConf = createSparkConf(appName, kahunaEnv); >

Re: Is HDFS required for Spark streaming?

2015-09-08 Thread Cody Koeninger
Yes, local directories will be sufficient On Sat, Sep 5, 2015 at 10:44 AM, N B wrote: > Hi TD, > > Thanks! > > So our application does turn on checkpoints but we do not recover upon > application restart (we just blow the checkpoint directory away first and > re-create the

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Cody Koeninger
quite > strange given that the jobs used to fire every 1 second, we've switched to > 10, now trying to switch to 20 and batch duration millis is not changing. > > Does anything stand out in the code perhaps? > > On Tue, Sep 8, 2015 at 9:53 AM, Cody Koeninger <c...@koeninger.org>

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Cody Koeninger
artitionBy the same partitioner > ) instead of UnionRDD. > > Nonetheless I did not see any hints about such a bahaviour in doc. > Is it a bug or absolutely normal behaviour? > > > > > > 08.09.2015, 17:03, "Cody Koeninger" <c...@koeninger.org>

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Cody Koeninger
need to cast them(using some > business rules) to pairs and unite. > > -- > Яндекс.Почта — надёжная почта > http://mail.yandex.ru/neo2/collect/?exp=1=1 > > > 08.09.2015, 19:11, "Cody Koeninger" <c...@koeninger.org>: > > I'm not 100% sure what's go

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Cody Koeninger
Can you provide more info (what version of spark, code example)? On Tue, Sep 8, 2015 at 8:18 AM, Alexey Ponkin wrote: > Hi, > > I have an application with 2 streams, which are joined together. > Stream1 - is simple DStream(relativly small size batch chunks) > Stream2 - is a

Re: Different Kafka createDirectStream implementations

2015-09-08 Thread Cody Koeninger
If you're providing starting offsets explicitly, then auto offset reset isn't relevant. On Tue, Sep 8, 2015 at 11:44 AM, Dan Dutrow wrote: > The two methods of createDirectStream appear to have different > implementations, the second checks the offset.reset flags and does

<    5   6   7   8   9   10   11   12   13   14   >