Yes, I am partitoning using DataFrameWriter.partitionBy, which produces the
keyed directory structure that you referenced in that link.
On Tue, Aug 11, 2015 at 11:54 PM, Hemant Bhanawat hemant9...@gmail.com
wrote:
As far as I know, Spark SQL cannot process data on a per-partition-basis.
Inline..
On Thu, Aug 13, 2015 at 5:06 AM, Eugene Morozov fathers...@list.ru wrote:
Hemant, William, pls see inlined.
On 12 Aug 2015, at 18:18, Philip Weaver philip.wea...@gmail.com wrote:
Yes, I am partitoning using DataFrameWriter.partitionBy, which produces
the keyed directory structure
As far as I know, Spark SQL cannot process data on a per-partition-basis.
DataFrame.foreachPartition is the way.
I haven't tried it, but, following looks like a not-so-sophisticated way of
making spark sql partition aware.
If I have an RDD that happens to already be partitioned by a key, how
efficient can I expect a groupBy operation to be? I would expect that Spark
shouldn't have to move data around between nodes, and simply will have a
small amount of work just checking the partitions to discover that it
doesn't
Philip,
If all data per key are inside just one partition, then Spark will figure that
out. Can you guarantee that’s the case?
What is it you try to achieve? There might be another way for it, when you
might be 100% sure what’s happening.
You can print debugString or explain (for DataFrame) to
Thanks.
In my particular case, I am calculating a distinct count on a key that is
unique to each partition, so I want to calculate the distinct count within
each partition, and then sum those. This approach will avoid moving the
sets of that key around between nodes, which would be very