Yes, I am partitoning using DataFrameWriter.partitionBy, which produces the keyed directory structure that you referenced in that link.
On Tue, Aug 11, 2015 at 11:54 PM, Hemant Bhanawat <hemant9...@gmail.com> wrote: > As far as I know, Spark SQL cannot process data on a per-partition-basis. > DataFrame.foreachPartition is the way. > > I haven't tried it, but, following looks like a not-so-sophisticated way > of making spark sql partition aware. > > > http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery > > > On Wed, Aug 12, 2015 at 5:00 AM, Philip Weaver <philip.wea...@gmail.com> > wrote: > >> Thanks. >> >> In my particular case, I am calculating a distinct count on a key that is >> unique to each partition, so I want to calculate the distinct count within >> each partition, and then sum those. This approach will avoid moving the >> sets of that key around between nodes, which would be very expensive. >> >> Currently, to accomplish this we are manually reading in the parquet >> files (not through Spark SQL), using a bitset to calculate the unique count >> within each partition, and accumulating that sum. Doing this through Spark >> SQL would be nice, but the naive "SELECT distinct(count(...))" approach >> takes 60 times as long :). The approach I mentioned above might be an >> acceptable hybrid solution. >> >> - Philip >> >> >> On Tue, Aug 11, 2015 at 3:27 PM, Eugene Morozov <fathers...@list.ru> >> wrote: >> >>> Philip, >>> >>> If all data per key are inside just one partition, then Spark will >>> figure that out. Can you guarantee that’s the case? >>> What is it you try to achieve? There might be another way for it, when >>> you might be 100% sure what’s happening. >>> >>> You can print debugString or explain (for DataFrame) to see what’s >>> happening under the hood. >>> >>> >>> On 12 Aug 2015, at 01:19, Philip Weaver <philip.wea...@gmail.com> wrote: >>> >>> If I have an RDD that happens to already be partitioned by a key, how >>> efficient can I expect a groupBy operation to be? I would expect that Spark >>> shouldn't have to move data around between nodes, and simply will have a >>> small amount of work just checking the partitions to discover that it >>> doesn't need to move anything around. >>> >>> Now, what if we're talking about a parquet database created by using >>> DataFrameWriter.partitionBy(...), then will Spark SQL be smart when I group >>> by a key that I'm already partitioned by? >>> >>> - Philip >>> >>> >>> Eugene Morozov >>> fathers...@list.ru >>> >>> >>> >>> >>> >> >