Re: [PSA] TaskContext.partitionId != the actual logical partition index

2016-10-20 Thread Cody Koeninger
Yep, I had submitted a PR that included it way back in the original direct stream for kafka, but it got nixed in favor of TaskContext.partitionId ;) The concern then was about too many xWithBlah apis on rdd. If we do want to deprecate taskcontext.partitionId and add foreachPartitionWithIndex, I

Re: [PSA] TaskContext.partitionId != the actual logical partition index

2016-10-20 Thread Reynold Xin
Seems like a good new API to add? On Thu, Oct 20, 2016 at 11:14 AM, Cody Koeninger wrote: > Access to the partition ID is necessary for basically every single one > of my jobs, and there isn't a foreachPartiionWithIndex equivalent. > You can kind of work around it with

Re: [PSA] TaskContext.partitionId != the actual logical partition index

2016-10-20 Thread Cody Koeninger
Access to the partition ID is necessary for basically every single one of my jobs, and there isn't a foreachPartiionWithIndex equivalent. You can kind of work around it with empty foreach after the map, but it's really awkward to explain to people. On Thu, Oct 20, 2016 at 12:52 PM, Reynold Xin

[PSA] TaskContext.partitionId != the actual logical partition index

2016-10-20 Thread Reynold Xin
FYI - Xiangrui submitted an amazing pull request to fix a long standing issue with a lot of the nondeterministic expressions (rand, randn, monotonically_increasing_id): https://github.com/apache/spark/pull/15567 Prior to this PR, we were using TaskContext.partitionId as the partition index in