Re: [SQL] Using HashPartitioner to distribute by column

2015-01-21 Thread Michael Davies
Hi Cheng, Are you saying that by setting up the lineage schemaRdd.keyBy(_.getString(1)).partitionBy(new HashPartitioner(n)).values.applySchema(schema) then Spark SQL will know that an SQL “group by” on Customer Code will not have to shuffle? But the prepared will have already shuffled so we p

Re: Mapping directory structure to columns in SparkSQL

2015-01-09 Thread Michael Davies
n, would you like to discuss? Thanks Mick > On 30 Dec 2014, at 17:40, Michael Davies wrote: > > Hi Michael, > > I’ve looked through the example and the test cases and I think I understand > what we need to do - so I’ll give it a go. > > I think what I’d like to try

Re: Mapping directory structure to columns in SparkSQL

2014-12-30 Thread Michael Davies
Hi Michael, I’ve looked through the example and the test cases and I think I understand what we need to do - so I’ll give it a go. I think what I’d like to try to do is allow files to be added at anytime, so perhaps I can cache partition info, and also what may be useful for us would be to d