On Thu, Oct 20, 2016 at 7:40 AM Matei Zaharia
wrote:
> Yeah, as Shivaram pointed out, there have been research projects that
> looked at it. Also, Structured Streaming was explicitly designed to not
> make microbatching part of the API or part of the output behavior
My thoughts were of handling just the “current” state of the sliding window
(i.e. the “last” window). The idea is that at least in cases which I
encountered, the sliding window is used to “forget” irrelevant information and
therefore when a step goes out of date for the “current” window it
>
> let’s say we would have implemented distinct count by saving a map with
> the key being the distinct value and the value being the last time we saw
> this value. This would mean that we wouldn’t really need to save all the
> steps in the middle and copy the data, we could only save the last
Hey awesome Spark-Dev's :)
i am new to spark and i read a lot but now i am stuck :( so please be
kind, if i ask silly questions.
I want to analyze some algorithms and strategies in spark and for one
experiment i want to know the size of the intermediate results between
iterations/jobs. Some of
FYI - Xiangrui submitted an amazing pull request to fix a long standing
issue with a lot of the nondeterministic expressions (rand, randn,
monotonically_increasing_id): https://github.com/apache/spark/pull/15567
Prior to this PR, we were using TaskContext.partitionId as the partition
index in
Access to the partition ID is necessary for basically every single one
of my jobs, and there isn't a foreachPartiionWithIndex equivalent.
You can kind of work around it with empty foreach after the map, but
it's really awkward to explain to people.
On Thu, Oct 20, 2016 at 12:52 PM, Reynold Xin
Great idea! If the developer docs are in github, then new contributors who
find errors or omissions can update the docs as an introduction to the PR
process.
Fred
On Wed, Oct 19, 2016 at 5:46 PM, Reynold Xin wrote:
> For the contributing guide I think it makes more sense
>
> On a personal note, I'm quite surprised that this is all the progress in
> Structured Streaming over the last three months since 2.0 was released. I
> was under the impression that this was one of the biggest things that the
> Spark community actively works on, but that is clearly not the
I needed the same for debugging and I just added "count" action in debug
mode for every step I was interested in. It's very time-consuming, but I
debug not very often.
2016-10-20 2:17 GMT-07:00 Andreas Hechenberger :
> Hey awesome Spark-Dev's :)
>
> i am new to spark
Seems like a good new API to add?
On Thu, Oct 20, 2016 at 11:14 AM, Cody Koeninger wrote:
> Access to the partition ID is necessary for basically every single one
> of my jobs, and there isn't a foreachPartiionWithIndex equivalent.
> You can kind of work around it with
Yep, I had submitted a PR that included it way back in the original
direct stream for kafka, but it got nixed in favor of
TaskContext.partitionId ;) The concern then was about too many
xWithBlah apis on rdd.
If we do want to deprecate taskcontext.partitionId and add
foreachPartitionWithIndex, I
11 matches
Mail list logo