Re: grouping by a partitioned key

2015-08-12 Thread Philip Weaver
Yes, I am partitoning using DataFrameWriter.partitionBy, which produces the keyed directory structure that you referenced in that link. On Tue, Aug 11, 2015 at 11:54 PM, Hemant Bhanawat hemant9...@gmail.com wrote: As far as I know, Spark SQL cannot process data on a per-partition-basis.

Re: grouping by a partitioned key

2015-08-12 Thread Hemant Bhanawat
Inline.. On Thu, Aug 13, 2015 at 5:06 AM, Eugene Morozov fathers...@list.ru wrote: Hemant, William, pls see inlined. On 12 Aug 2015, at 18:18, Philip Weaver philip.wea...@gmail.com wrote: Yes, I am partitoning using DataFrameWriter.partitionBy, which produces the keyed directory structure

Re: grouping by a partitioned key

2015-08-12 Thread Hemant Bhanawat
As far as I know, Spark SQL cannot process data on a per-partition-basis. DataFrame.foreachPartition is the way. I haven't tried it, but, following looks like a not-so-sophisticated way of making spark sql partition aware.

grouping by a partitioned key

2015-08-11 Thread Philip Weaver
If I have an RDD that happens to already be partitioned by a key, how efficient can I expect a groupBy operation to be? I would expect that Spark shouldn't have to move data around between nodes, and simply will have a small amount of work just checking the partitions to discover that it doesn't

Re: grouping by a partitioned key

2015-08-11 Thread Eugene Morozov
Philip, If all data per key are inside just one partition, then Spark will figure that out. Can you guarantee that’s the case? What is it you try to achieve? There might be another way for it, when you might be 100% sure what’s happening. You can print debugString or explain (for DataFrame) to

Re: grouping by a partitioned key

2015-08-11 Thread Philip Weaver
Thanks. In my particular case, I am calculating a distinct count on a key that is unique to each partition, so I want to calculate the distinct count within each partition, and then sum those. This approach will avoid moving the sets of that key around between nodes, which would be very