Re: grouping by a partitioned key

Hemant Bhanawat Wed, 12 Aug 2015 22:36:07 -0700

Inline..

On Thu, Aug 13, 2015 at 5:06 AM, Eugene Morozov <fathers...@list.ru> wrote:


> Hemant, William, pls see inlined.
>
> On 12 Aug 2015, at 18:18, Philip Weaver <philip.wea...@gmail.com> wrote:
>
> Yes, I am partitoning using DataFrameWriter.partitionBy, which produces
> the keyed directory structure that you referenced in that link.
>
>
> Have you tried to use DataFrame API instead of SQL? I mean smth like
> dataFrame.select(key).agg(count).distinct().agg(sum).
> Could you print explain for this way and for SQL you tried? I’m just
> curious of the difference.
>
>
> On Tue, Aug 11, 2015 at 11:54 PM, Hemant Bhanawat <hemant9...@gmail.com>
> wrote:
>
>> As far as I know, Spark SQL cannot process data on a per-partition-basis.
>> DataFrame.foreachPartition is the way.
>>
>
> What do you mean by “cannot process on per-partition-basis”? DataFrame is
> an RDD on steroids.
>

I meant that Spark SQL cannot process data of a single partition like you
can do with foreachpartition.

>
>
>> I haven't tried it, but, following looks like a not-so-sophisticated way
>> of making spark sql partition aware.
>>
>>
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery
>>
>>
>> On Wed, Aug 12, 2015 at 5:00 AM, Philip Weaver <philip.wea...@gmail.com>
>> wrote:
>>
>>> Thanks.
>>>
>>> In my particular case, I am calculating a distinct count on a key that
>>> is unique to each partition, so I want to calculate the distinct count
>>> within each partition, and then sum those. This approach will avoid moving
>>> the sets of that key around between nodes, which would be very expensive.
>>>
>>> Currently, to accomplish this we are manually reading in the parquet
>>> files (not through Spark SQL), using a bitset to calculate the unique count
>>> within each partition, and accumulating that sum. Doing this through Spark
>>> SQL would be nice, but the naive "SELECT distinct(count(...))" approach
>>> takes 60 times as long :). The approach I mentioned above might be an
>>> acceptable hybrid solution.
>>>
>>> - Philip
>>>
>>>
>>> On Tue, Aug 11, 2015 at 3:27 PM, Eugene Morozov <fathers...@list.ru>
>>> wrote:
>>>
>>>> Philip,
>>>>
>>>> If all data per key are inside just one partition, then Spark will
>>>> figure that out. Can you guarantee that’s the case?
>>>> What is it you try to achieve? There might be another way for it, when
>>>> you might be 100% sure what’s happening.
>>>>
>>>> You can print debugString or explain (for DataFrame) to see what’s
>>>> happening under the hood.
>>>>
>>>>
>>>> On 12 Aug 2015, at 01:19, Philip Weaver <philip.wea...@gmail.com>
>>>> wrote:
>>>>
>>>> If I have an RDD that happens to already be partitioned by a key, how
>>>> efficient can I expect a groupBy operation to be? I would expect that Spark
>>>> shouldn't have to move data around between nodes, and simply will have a
>>>> small amount of work just checking the partitions to discover that it
>>>> doesn't need to move anything around.
>>>>
>>>> Now, what if we're talking about a parquet database created by using
>>>> DataFrameWriter.partitionBy(...), then will Spark SQL be smart when I group
>>>> by a key that I'm already partitioned by?
>>>>
>>>> - Philip
>>>>
>>>>
>>>> Eugene Morozov
>>>> fathers...@list.ru
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>
> Eugene Morozov
> fathers...@list.ru
>
>
>
>
>

Re: grouping by a partitioned key

Reply via email to