date:20160309

Re: Inconsistent file extensions and omitting file extensions written by CSV, TEXT and JSON data sources.

2016-03-09 Thread Sean Owen

>From your JIRA, it seems like you're referring to the "part-*" files. These files are effectively an internal representation, and I would not expect them to have such an extension. For example, you're not really guaranteed that the way the data breaks up leaves each file a valid JSON doc. On Wed,

Re: Inconsistent file extensions and omitting file extensions written by CSV, TEXT and JSON data sources.

2016-03-09 Thread Hyukjin Kwon

This discussion is going to the Jira. Please refer the Jira if anyone is interested in this. On 9 Mar 2016 6:31 p.m., "Sean Owen" wrote: > From your JIRA, it seems like you're referring to the "part-*" files. > These files are effectively an internal representation, and I would > not expect them

RE: Spark Scheduler creating Straggler Node

2016-03-09 Thread Ioannis.Deligiannis

It would be nice to have a code-configurable fall-back plan for such cases. Any generalized solution can cause problems elsewhere. Simply replicating hot cached blocks would be complicated to maintain and could cause OOME. In the case I described on the JIRA, the hot partition will be changing

RE: Spark Scheduler creating Straggler Node

2016-03-09 Thread Ioannis.Deligiannis

It would be nice to have a code-configurable fall-back plan for such cases. Any generalized solution can cause problems elsewhere. Simply replicating hot cached blocks would be complicated to maintain and could cause OOME. In the case I described on the JIRA, the hot partition will be changing

Re: [VOTE] Release Apache Spark 1.6.1 (RC1)

2016-03-09 Thread Kousuke Saruta

+1 (non-binding) On 2016/03/09 4:28, Burak Yavuz wrote: +1 On Tue, Mar 8, 2016 at 10:59 AM, Andrew Or > wrote: +1 2016-03-08 10:59 GMT-08:00 Yin Huai mailto:yh...@databricks.com>>: +1 On Mon, Mar 7, 2016 at 12:39 PM, Reynold Xin

submissionTime vs batchTime, DirectKafka

2016-03-09 Thread Sachin Aggarwal

Hi All, we have batchTime and submissionTime. @param batchTime Time of the batch @param submissionTime Clock time of when jobs of this batch was submitted to the streaming scheduler queue 1) we are seeing difference between batchTime and submissionTime for small batches(300ms) even in minute

Re: submissionTime vs batchTime, DirectKafka

2016-03-09 Thread Cody Koeninger

Spark streaming by default will not start processing a batch until the current batch is finished. So if your processing time is larger than your batch time, delays will build up. On Wed, Mar 9, 2016 at 11:09 AM, Sachin Aggarwal wrote: > Hi All, > > we have batchTime and submissionTime. > > @para

Re: submissionTime vs batchTime, DirectKafka

2016-03-09 Thread Sachin Aggarwal

where are we capturing this delay? I am aware of scheduling delay which is defined as processing time-submission time not the batch create time On Wed, Mar 9, 2016 at 10:46 PM, Cody Koeninger wrote: > Spark streaming by default will not start processing a batch until the > current batch is finis

Re: submissionTime vs batchTime, DirectKafka

2016-03-09 Thread Cody Koeninger

I'm really not sure what you're asking. On Wed, Mar 9, 2016 at 12:43 PM, Sachin Aggarwal wrote: > where are we capturing this delay? > I am aware of scheduling delay which is defined as processing > time-submission time not the batch create time > > On Wed, Mar 9, 2016 at 10:46 PM, Cody Koeninger

Re: Use cases for kafka direct stream messageHandler

2016-03-09 Thread Alan Braithwaite

I'd probably prefer to keep it the way it is, unless it's becoming more like the function without the messageHandler argument. Right now I have code like this, but I wish it were more similar looking: if (parsed.partitions.isEmpty()) { JavaPairInputDStream kvstream = KafkaUtils

Re: Use cases for kafka direct stream messageHandler

2016-03-09 Thread Cody Koeninger

Yeah, to be clear, I'm talking about having only one constructor for a direct stream, that will give you a stream of ConsumerRecord. Different needs for topic subscription, starting offsets, etc could be handled by calling appropriate methods after construction but before starting the stream. On

Request to add a new book to the Books section on Spark's website

2016-03-09 Thread Mohammed Guller

My book on Spark was recently published. I would like to request it to be added to the Books section on Spark's website. Here are the details about the book. Title: Big Data Analytics with Spark Author: Mohammed Guller Link: www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/

Re: Request to add a new book to the Books section on Spark's website

2016-03-09 Thread Sean Owen

Oh yeah I already added it after your earlier message, have a look. On Wed, Mar 9, 2016 at 7:45 PM, Mohammed Guller wrote: > My book on Spark was recently published. I would like to request it to be > added to the Books section on Spark's website. > > > > Here are the details about the book. > >

Re: [VOTE] Release Apache Spark 1.6.1 (RC1)

2016-03-09 Thread Michael Armbrust

+1 - Ported all our internal jobs to run on 1.6.1 with no regressions. On Wed, Mar 9, 2016 at 7:04 AM, Kousuke Saruta wrote: > +1 (non-binding) > > > On 2016/03/09 4:28, Burak Yavuz wrote: > > +1 > > On Tue, Mar 8, 2016 at 10:59 AM, Andrew Or wrote: > >> +1 >> >> 2016-03-08 10:59 GMT-08:00 Yin

[RESULT] [VOTE] Release Apache Spark 1.6.1 (RC1)

2016-03-09 Thread Michael Armbrust

This vote passes with nine +1s (five binding) and one binding +0! Thanks to everyone who tested/voted. I'll start work on publishing the release today. +1: Mark Hamstra* Moshe Eshel Egor Pahomov Reynold Xin* Yin Huai* Andrew Or* Burak Yavuz Kousuke Saruta Michael Armbrust* 0: Sean Owen* -1: (

Re: submissionTime vs batchTime, DirectKafka

2016-03-09 Thread Sachin Aggarwal

Hi cody, let me try once again to explain with example. In BatchInfo class of spark "scheduling delay" is defined as *def schedulingDelay: Option[Long] = processingStartTime.map(_ - submissionTime)* I am dumping batchinfo object in my LatencyListener which extends StreamingListener. batchTime

Re: submissionTime vs batchTime, DirectKafka

2016-03-09 Thread Mario Ds Briggs

Look at org.apache.spark.streaming.scheduler.JobGenerator it has a RecurringTimer (timer) that will simply post 'JobGenerate' events to a EventLoop at the batchInterval time. This EventLoop's thread then picks up these events, uses the streamingContext.graph' to generate a Job (InputDstream's

Re: Inconsistent file extensions and omitting file extensions written by CSV, TEXT and JSON data sources.

Re: Inconsistent file extensions and omitting file extensions written by CSV, TEXT and JSON data sources.

RE: Spark Scheduler creating Straggler Node

RE: Spark Scheduler creating Straggler Node

Re: [VOTE] Release Apache Spark 1.6.1 (RC1)

submissionTime vs batchTime, DirectKafka

Re: submissionTime vs batchTime, DirectKafka

Re: submissionTime vs batchTime, DirectKafka

Re: submissionTime vs batchTime, DirectKafka

Re: Use cases for kafka direct stream messageHandler

Re: Use cases for kafka direct stream messageHandler

Request to add a new book to the Books section on Spark's website

Re: Request to add a new book to the Books section on Spark's website

Re: [VOTE] Release Apache Spark 1.6.1 (RC1)

[RESULT] [VOTE] Release Apache Spark 1.6.1 (RC1)

Re: submissionTime vs batchTime, DirectKafka

Re: submissionTime vs batchTime, DirectKafka

17 matches

Site Navigation

Mail list logo

Footer information