Re: submissionTime vs batchTime, DirectKafka

2016-03-09 Thread Mario Ds Briggs
Look at org.apache.spark.streaming.scheduler.JobGenerator it has a RecurringTimer (timer) that will simply post 'JobGenerate' events to a EventLoop at the batchInterval time. This EventLoop's thread then picks up these events, uses the streamingContext.graph' to generate a Job (InputDstream's

Re: submissionTime vs batchTime, DirectKafka

2016-03-09 Thread Sachin Aggarwal
Hi cody, let me try once again to explain with example. In BatchInfo class of spark "scheduling delay" is defined as *def schedulingDelay: Option[Long] = processingStartTime.map(_ - submissionTime)* I am dumping batchinfo object in my LatencyListener which extends StreamingListener. batchTime

[RESULT] [VOTE] Release Apache Spark 1.6.1 (RC1)

2016-03-09 Thread Michael Armbrust
This vote passes with nine +1s (five binding) and one binding +0! Thanks to everyone who tested/voted. I'll start work on publishing the release today. +1: Mark Hamstra* Moshe Eshel Egor Pahomov Reynold Xin* Yin Huai* Andrew Or* Burak Yavuz Kousuke Saruta Michael Armbrust* 0: Sean Owen* -1:

Re: [VOTE] Release Apache Spark 1.6.1 (RC1)

2016-03-09 Thread Michael Armbrust
+1 - Ported all our internal jobs to run on 1.6.1 with no regressions. On Wed, Mar 9, 2016 at 7:04 AM, Kousuke Saruta wrote: > +1 (non-binding) > > > On 2016/03/09 4:28, Burak Yavuz wrote: > > +1 > > On Tue, Mar 8, 2016 at 10:59 AM, Andrew Or

Re: Request to add a new book to the Books section on Spark's website

2016-03-09 Thread Sean Owen
Oh yeah I already added it after your earlier message, have a look. On Wed, Mar 9, 2016 at 7:45 PM, Mohammed Guller wrote: > My book on Spark was recently published. I would like to request it to be > added to the Books section on Spark's website. > > > > Here are the

Request to add a new book to the Books section on Spark's website

2016-03-09 Thread Mohammed Guller
My book on Spark was recently published. I would like to request it to be added to the Books section on Spark's website. Here are the details about the book. Title: Big Data Analytics with Spark Author: Mohammed Guller Link:

Re: Use cases for kafka direct stream messageHandler

2016-03-09 Thread Cody Koeninger
Yeah, to be clear, I'm talking about having only one constructor for a direct stream, that will give you a stream of ConsumerRecord. Different needs for topic subscription, starting offsets, etc could be handled by calling appropriate methods after construction but before starting the stream.

Re: Use cases for kafka direct stream messageHandler

2016-03-09 Thread Alan Braithwaite
I'd probably prefer to keep it the way it is, unless it's becoming more like the function without the messageHandler argument. Right now I have code like this, but I wish it were more similar looking: if (parsed.partitions.isEmpty()) { JavaPairInputDStream

Re: submissionTime vs batchTime, DirectKafka

2016-03-09 Thread Cody Koeninger
I'm really not sure what you're asking. On Wed, Mar 9, 2016 at 12:43 PM, Sachin Aggarwal wrote: > where are we capturing this delay? > I am aware of scheduling delay which is defined as processing > time-submission time not the batch create time > > On Wed, Mar 9,

Re: submissionTime vs batchTime, DirectKafka

2016-03-09 Thread Sachin Aggarwal
where are we capturing this delay? I am aware of scheduling delay which is defined as processing time-submission time not the batch create time On Wed, Mar 9, 2016 at 10:46 PM, Cody Koeninger wrote: > Spark streaming by default will not start processing a batch until the >

Re: submissionTime vs batchTime, DirectKafka

2016-03-09 Thread Cody Koeninger
Spark streaming by default will not start processing a batch until the current batch is finished. So if your processing time is larger than your batch time, delays will build up. On Wed, Mar 9, 2016 at 11:09 AM, Sachin Aggarwal wrote: > Hi All, > > we have batchTime

Re: [VOTE] Release Apache Spark 1.6.1 (RC1)

2016-03-09 Thread Kousuke Saruta
+1 (non-binding) On 2016/03/09 4:28, Burak Yavuz wrote: +1 On Tue, Mar 8, 2016 at 10:59 AM, Andrew Or > wrote: +1 2016-03-08 10:59 GMT-08:00 Yin Huai >: +1

RE: Spark Scheduler creating Straggler Node

2016-03-09 Thread Ioannis.Deligiannis
It would be nice to have a code-configurable fall-back plan for such cases. Any generalized solution can cause problems elsewhere. Simply replicating hot cached blocks would be complicated to maintain and could cause OOME. In the case I described on the JIRA, the hot partition will be changing

RE: Spark Scheduler creating Straggler Node

2016-03-09 Thread Ioannis.Deligiannis
It would be nice to have a code-configurable fall-back plan for such cases. Any generalized solution can cause problems elsewhere. Simply replicating hot cached blocks would be complicated to maintain and could cause OOME. In the case I described on the JIRA, the hot partition will be changing

Re: Inconsistent file extensions and omitting file extensions written by CSV, TEXT and JSON data sources.

2016-03-09 Thread Hyukjin Kwon
This discussion is going to the Jira. Please refer the Jira if anyone is interested in this. On 9 Mar 2016 6:31 p.m., "Sean Owen" wrote: > From your JIRA, it seems like you're referring to the "part-*" files. > These files are effectively an internal representation, and I

Re: Inconsistent file extensions and omitting file extensions written by CSV, TEXT and JSON data sources.

2016-03-09 Thread Sean Owen
>From your JIRA, it seems like you're referring to the "part-*" files. These files are effectively an internal representation, and I would not expect them to have such an extension. For example, you're not really guaranteed that the way the data breaks up leaves each file a valid JSON doc. On