Re: StructuredStreaming status

2016-10-20 Thread Amit Sela
On Thu, Oct 20, 2016 at 7:40 AM Matei Zaharia wrote: > Yeah, as Shivaram pointed out, there have been research projects that > looked at it. Also, Structured Streaming was explicitly designed to not > make microbatching part of the API or part of the output behavior

RE: StructuredStreaming status

2016-10-20 Thread assaf.mendelson
My thoughts were of handling just the “current” state of the sliding window (i.e. the “last” window). The idea is that at least in cases which I encountered, the sliding window is used to “forget” irrelevant information and therefore when a step goes out of date for the “current” window it

Re: StructuredStreaming status

2016-10-20 Thread Michael Armbrust
> > let’s say we would have implemented distinct count by saving a map with > the key being the distinct value and the value being the last time we saw > this value. This would mean that we wouldn’t really need to save all the > steps in the middle and copy the data, we could only save the last

Get size of intermediate results

2016-10-20 Thread Andreas Hechenberger
Hey awesome Spark-Dev's :) i am new to spark and i read a lot but now i am stuck :( so please be kind, if i ask silly questions. I want to analyze some algorithms and strategies in spark and for one experiment i want to know the size of the intermediate results between iterations/jobs. Some of

[PSA] TaskContext.partitionId != the actual logical partition index

2016-10-20 Thread Reynold Xin
FYI - Xiangrui submitted an amazing pull request to fix a long standing issue with a lot of the nondeterministic expressions (rand, randn, monotonically_increasing_id): https://github.com/apache/spark/pull/15567 Prior to this PR, we were using TaskContext.partitionId as the partition index in

Re: [PSA] TaskContext.partitionId != the actual logical partition index

2016-10-20 Thread Cody Koeninger
Access to the partition ID is necessary for basically every single one of my jobs, and there isn't a foreachPartiionWithIndex equivalent. You can kind of work around it with empty foreach after the map, but it's really awkward to explain to people. On Thu, Oct 20, 2016 at 12:52 PM, Reynold Xin

Re: Mini-Proposal: Make it easier to contribute to the contributing to Spark Guide

2016-10-20 Thread Fred Reiss
Great idea! If the developer docs are in github, then new contributors who find errors or omissions can update the docs as an introduction to the PR process. Fred On Wed, Oct 19, 2016 at 5:46 PM, Reynold Xin wrote: > For the contributing guide I think it makes more sense

Re: StructuredStreaming status

2016-10-20 Thread Michael Armbrust
> > On a personal note, I'm quite surprised that this is all the progress in > Structured Streaming over the last three months since 2.0 was released. I > was under the impression that this was one of the biggest things that the > Spark community actively works on, but that is clearly not the

Re: Get size of intermediate results

2016-10-20 Thread Egor Pahomov
I needed the same for debugging and I just added "count" action in debug mode for every step I was interested in. It's very time-consuming, but I debug not very often. 2016-10-20 2:17 GMT-07:00 Andreas Hechenberger : > Hey awesome Spark-Dev's :) > > i am new to spark

Re: [PSA] TaskContext.partitionId != the actual logical partition index

2016-10-20 Thread Reynold Xin
Seems like a good new API to add? On Thu, Oct 20, 2016 at 11:14 AM, Cody Koeninger wrote: > Access to the partition ID is necessary for basically every single one > of my jobs, and there isn't a foreachPartiionWithIndex equivalent. > You can kind of work around it with

Re: [PSA] TaskContext.partitionId != the actual logical partition index

2016-10-20 Thread Cody Koeninger
Yep, I had submitted a PR that included it way back in the original direct stream for kafka, but it got nixed in favor of TaskContext.partitionId ;) The concern then was about too many xWithBlah apis on rdd. If we do want to deprecate taskcontext.partitionId and add foreachPartitionWithIndex, I