Yes, I am partitoning using DataFrameWriter.partitionBy, which produces the
keyed directory structure that you referenced in that link.

On Tue, Aug 11, 2015 at 11:54 PM, Hemant Bhanawat <hemant9...@gmail.com>
wrote:

> As far as I know, Spark SQL cannot process data on a per-partition-basis.
> DataFrame.foreachPartition is the way.
>
> I haven't tried it, but, following looks like a not-so-sophisticated way
> of making spark sql partition aware.
>
>
> http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery
>
>
> On Wed, Aug 12, 2015 at 5:00 AM, Philip Weaver <philip.wea...@gmail.com>
> wrote:
>
>> Thanks.
>>
>> In my particular case, I am calculating a distinct count on a key that is
>> unique to each partition, so I want to calculate the distinct count within
>> each partition, and then sum those. This approach will avoid moving the
>> sets of that key around between nodes, which would be very expensive.
>>
>> Currently, to accomplish this we are manually reading in the parquet
>> files (not through Spark SQL), using a bitset to calculate the unique count
>> within each partition, and accumulating that sum. Doing this through Spark
>> SQL would be nice, but the naive "SELECT distinct(count(...))" approach
>> takes 60 times as long :). The approach I mentioned above might be an
>> acceptable hybrid solution.
>>
>> - Philip
>>
>>
>> On Tue, Aug 11, 2015 at 3:27 PM, Eugene Morozov <fathers...@list.ru>
>> wrote:
>>
>>> Philip,
>>>
>>> If all data per key are inside just one partition, then Spark will
>>> figure that out. Can you guarantee that’s the case?
>>> What is it you try to achieve? There might be another way for it, when
>>> you might be 100% sure what’s happening.
>>>
>>> You can print debugString or explain (for DataFrame) to see what’s
>>> happening under the hood.
>>>
>>>
>>> On 12 Aug 2015, at 01:19, Philip Weaver <philip.wea...@gmail.com> wrote:
>>>
>>> If I have an RDD that happens to already be partitioned by a key, how
>>> efficient can I expect a groupBy operation to be? I would expect that Spark
>>> shouldn't have to move data around between nodes, and simply will have a
>>> small amount of work just checking the partitions to discover that it
>>> doesn't need to move anything around.
>>>
>>> Now, what if we're talking about a parquet database created by using
>>> DataFrameWriter.partitionBy(...), then will Spark SQL be smart when I group
>>> by a key that I'm already partitioned by?
>>>
>>> - Philip
>>>
>>>
>>> Eugene Morozov
>>> fathers...@list.ru
>>>
>>>
>>>
>>>
>>>
>>
>

Reply via email to