Inline.. On Thu, Aug 13, 2015 at 5:06 AM, Eugene Morozov <fathers...@list.ru> wrote:
> Hemant, William, pls see inlined. > > On 12 Aug 2015, at 18:18, Philip Weaver <philip.wea...@gmail.com> wrote: > > Yes, I am partitoning using DataFrameWriter.partitionBy, which produces > the keyed directory structure that you referenced in that link. > > > Have you tried to use DataFrame API instead of SQL? I mean smth like > dataFrame.select(key).agg(count).distinct().agg(sum). > Could you print explain for this way and for SQL you tried? I’m just > curious of the difference. > > > On Tue, Aug 11, 2015 at 11:54 PM, Hemant Bhanawat <hemant9...@gmail.com> > wrote: > >> As far as I know, Spark SQL cannot process data on a per-partition-basis. >> DataFrame.foreachPartition is the way. >> > > What do you mean by “cannot process on per-partition-basis”? DataFrame is > an RDD on steroids. > I meant that Spark SQL cannot process data of a single partition like you can do with foreachpartition. > > >> I haven't tried it, but, following looks like a not-so-sophisticated way >> of making spark sql partition aware. >> >> >> http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery >> >> >> On Wed, Aug 12, 2015 at 5:00 AM, Philip Weaver <philip.wea...@gmail.com> >> wrote: >> >>> Thanks. >>> >>> In my particular case, I am calculating a distinct count on a key that >>> is unique to each partition, so I want to calculate the distinct count >>> within each partition, and then sum those. This approach will avoid moving >>> the sets of that key around between nodes, which would be very expensive. >>> >>> Currently, to accomplish this we are manually reading in the parquet >>> files (not through Spark SQL), using a bitset to calculate the unique count >>> within each partition, and accumulating that sum. Doing this through Spark >>> SQL would be nice, but the naive "SELECT distinct(count(...))" approach >>> takes 60 times as long :). The approach I mentioned above might be an >>> acceptable hybrid solution. >>> >>> - Philip >>> >>> >>> On Tue, Aug 11, 2015 at 3:27 PM, Eugene Morozov <fathers...@list.ru> >>> wrote: >>> >>>> Philip, >>>> >>>> If all data per key are inside just one partition, then Spark will >>>> figure that out. Can you guarantee that’s the case? >>>> What is it you try to achieve? There might be another way for it, when >>>> you might be 100% sure what’s happening. >>>> >>>> You can print debugString or explain (for DataFrame) to see what’s >>>> happening under the hood. >>>> >>>> >>>> On 12 Aug 2015, at 01:19, Philip Weaver <philip.wea...@gmail.com> >>>> wrote: >>>> >>>> If I have an RDD that happens to already be partitioned by a key, how >>>> efficient can I expect a groupBy operation to be? I would expect that Spark >>>> shouldn't have to move data around between nodes, and simply will have a >>>> small amount of work just checking the partitions to discover that it >>>> doesn't need to move anything around. >>>> >>>> Now, what if we're talking about a parquet database created by using >>>> DataFrameWriter.partitionBy(...), then will Spark SQL be smart when I group >>>> by a key that I'm already partitioned by? >>>> >>>> - Philip >>>> >>>> >>>> Eugene Morozov >>>> fathers...@list.ru >>>> >>>> >>>> >>>> >>>> >>> >> > > Eugene Morozov > fathers...@list.ru > > > > >