Re: Inconsistent file extensions and omitting file extensions written by CSV, TEXT and JSON data sources.

2016-03-09 Thread Sean Owen
>From your JIRA, it seems like you're referring to the "part-*" files. These files are effectively an internal representation, and I would not expect them to have such an extension. For example, you're not really guaranteed that the way the data breaks up leaves each file a valid JSON doc. On Wed,

Re: Inconsistent file extensions and omitting file extensions written by CSV, TEXT and JSON data sources.

2016-03-09 Thread Hyukjin Kwon
This discussion is going to the Jira. Please refer the Jira if anyone is interested in this. On 9 Mar 2016 6:31 p.m., "Sean Owen" wrote: > From your JIRA, it seems like you're referring to the "part-*" files. > These files are effectively an internal representation, and I would > not expect them

RE: Spark Scheduler creating Straggler Node

2016-03-09 Thread Ioannis.Deligiannis
It would be nice to have a code-configurable fall-back plan for such cases. Any generalized solution can cause problems elsewhere. Simply replicating hot cached blocks would be complicated to maintain and could cause OOME. In the case I described on the JIRA, the hot partition will be changing

RE: Spark Scheduler creating Straggler Node

2016-03-09 Thread Ioannis.Deligiannis
It would be nice to have a code-configurable fall-back plan for such cases. Any generalized solution can cause problems elsewhere. Simply replicating hot cached blocks would be complicated to maintain and could cause OOME. In the case I described on the JIRA, the hot partition will be changing

Re: [VOTE] Release Apache Spark 1.6.1 (RC1)

2016-03-09 Thread Kousuke Saruta
+1 (non-binding) On 2016/03/09 4:28, Burak Yavuz wrote: +1 On Tue, Mar 8, 2016 at 10:59 AM, Andrew Or > wrote: +1 2016-03-08 10:59 GMT-08:00 Yin Huai mailto:yh...@databricks.com>>: +1 On Mon, Mar 7, 2016 at 12:39 PM, Reynold Xin

submissionTime vs batchTime, DirectKafka

2016-03-09 Thread Sachin Aggarwal
Hi All, we have batchTime and submissionTime. @param batchTime Time of the batch @param submissionTime Clock time of when jobs of this batch was submitted to the streaming scheduler queue 1) we are seeing difference between batchTime and submissionTime for small batches(300ms) even in minute

Re: submissionTime vs batchTime, DirectKafka

2016-03-09 Thread Cody Koeninger
Spark streaming by default will not start processing a batch until the current batch is finished. So if your processing time is larger than your batch time, delays will build up. On Wed, Mar 9, 2016 at 11:09 AM, Sachin Aggarwal wrote: > Hi All, > > we have batchTime and submissionTime. > > @para

Re: submissionTime vs batchTime, DirectKafka

2016-03-09 Thread Sachin Aggarwal
where are we capturing this delay? I am aware of scheduling delay which is defined as processing time-submission time not the batch create time On Wed, Mar 9, 2016 at 10:46 PM, Cody Koeninger wrote: > Spark streaming by default will not start processing a batch until the > current batch is finis

Re: submissionTime vs batchTime, DirectKafka

2016-03-09 Thread Cody Koeninger
I'm really not sure what you're asking. On Wed, Mar 9, 2016 at 12:43 PM, Sachin Aggarwal wrote: > where are we capturing this delay? > I am aware of scheduling delay which is defined as processing > time-submission time not the batch create time > > On Wed, Mar 9, 2016 at 10:46 PM, Cody Koeninger

Re: Use cases for kafka direct stream messageHandler

2016-03-09 Thread Alan Braithwaite
I'd probably prefer to keep it the way it is, unless it's becoming more like the function without the messageHandler argument. Right now I have code like this, but I wish it were more similar looking: if (parsed.partitions.isEmpty()) { JavaPairInputDStream kvstream = KafkaUtils

Re: Use cases for kafka direct stream messageHandler

2016-03-09 Thread Cody Koeninger
Yeah, to be clear, I'm talking about having only one constructor for a direct stream, that will give you a stream of ConsumerRecord. Different needs for topic subscription, starting offsets, etc could be handled by calling appropriate methods after construction but before starting the stream. On

Request to add a new book to the Books section on Spark's website

2016-03-09 Thread Mohammed Guller
My book on Spark was recently published. I would like to request it to be added to the Books section on Spark's website. Here are the details about the book. Title: Big Data Analytics with Spark Author: Mohammed Guller Link: www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/

Re: Request to add a new book to the Books section on Spark's website

2016-03-09 Thread Sean Owen
Oh yeah I already added it after your earlier message, have a look. On Wed, Mar 9, 2016 at 7:45 PM, Mohammed Guller wrote: > My book on Spark was recently published. I would like to request it to be > added to the Books section on Spark's website. > > > > Here are the details about the book. > >

Re: [VOTE] Release Apache Spark 1.6.1 (RC1)

2016-03-09 Thread Michael Armbrust
+1 - Ported all our internal jobs to run on 1.6.1 with no regressions. On Wed, Mar 9, 2016 at 7:04 AM, Kousuke Saruta wrote: > +1 (non-binding) > > > On 2016/03/09 4:28, Burak Yavuz wrote: > > +1 > > On Tue, Mar 8, 2016 at 10:59 AM, Andrew Or wrote: > >> +1 >> >> 2016-03-08 10:59 GMT-08:00 Yin

[RESULT] [VOTE] Release Apache Spark 1.6.1 (RC1)

2016-03-09 Thread Michael Armbrust
This vote passes with nine +1s (five binding) and one binding +0! Thanks to everyone who tested/voted. I'll start work on publishing the release today. +1: Mark Hamstra* Moshe Eshel Egor Pahomov Reynold Xin* Yin Huai* Andrew Or* Burak Yavuz Kousuke Saruta Michael Armbrust* 0: Sean Owen* -1: (

Re: submissionTime vs batchTime, DirectKafka

2016-03-09 Thread Sachin Aggarwal
Hi cody, let me try once again to explain with example. In BatchInfo class of spark "scheduling delay" is defined as *def schedulingDelay: Option[Long] = processingStartTime.map(_ - submissionTime)* I am dumping batchinfo object in my LatencyListener which extends StreamingListener. batchTime

Re: submissionTime vs batchTime, DirectKafka

2016-03-09 Thread Mario Ds Briggs
Look at org.apache.spark.streaming.scheduler.JobGenerator it has a RecurringTimer (timer) that will simply post 'JobGenerate' events to a EventLoop at the batchInterval time. This EventLoop's thread then picks up these events, uses the streamingContext.graph' to generate a Job (InputDstream's