Re: [SQL] Using HashPartitioner to distribute by column

2015-01-21 Thread Michael Davies
Hi Cheng, 

Are you saying that by setting up the lineage 
schemaRdd.keyBy(_.getString(1)).partitionBy(new 
HashPartitioner(n)).values.applySchema(schema)
then Spark SQL will know that an SQL “group by” on Customer Code will not have 
to shuffle?

But the prepared will have already shuffled so we pay an upfront cost for 
future groupings (assuming we cache I suppose) 

Mick

> On 20 Jan 2015, at 20:44, Cheng Lian  wrote:
> 
> First of all, even if the underlying dataset is partitioned as expected, a 
> shuffle can’t be avoided. Because Spark SQL knows nothing about the 
> underlying data distribution. However, this does reduce network IO.
> 
> You can prepare your data like this (say CustomerCode is a string field with 
> ordinal 1):
> 
> val schemaRdd = sql(...)
> val schema = schemaRdd.schema
> val prepared = schemaRdd.keyBy(_.getString(1)).partitionBy(new 
> HashPartitioner(n)).values.applySchema(schema)
> n should be equal to spark.sql.shuffle.partitions.
> 
> Cheng
> 
> On 1/19/15 7:44 AM, Mick Davies wrote:
> 
> 
> 
>> Is it possible to use a HashPartioner or something similar to distribute a
>> SchemaRDDs data by the hash of a particular column or set of columns.
>> 
>> Having done this I would then hope that GROUP BY could avoid shuffle
>> 
>> E.g. set up a HashPartioner on CustomerCode field so that 
>> 
>> SELECT CustomerCode, SUM(Cost)
>> FROM Orders
>> GROUP BY CustomerCode
>> 
>> would not need to shuffle.
>> 
>> Cheers 
>> Mick
>> 
>> 
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Using-HashPartitioner-to-distribute-by-column-tp21237.html
>>  
>> 
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>> 
>> For additional commands, e-mail: user-h...@spark.apache.org 
>> 
>> 
>> 
> 
> 



Re: Mapping directory structure to columns in SparkSQL

2015-01-09 Thread Michael Davies
Hi Michael, 

I have got the directory based column support working at least in a trial. I 
have put the trial code here - DirIndexParquet.scala 
<https://github.com/MickDavies/spark-parquet-dirindex/blob/master/src/main/scala/org/apache/spark/sql/parquet/DirIndexParquet.scala>
 it has involved me copying quite a lot of newParquet. 

There are some tests here that parquet 
<https://github.com/MickDavies/spark-parquet-dirindex/tree/master/src/test/scala/org/apache/spark/sql/parquet>
 illustrate use.

I’d be keen to help in anyway with the datasources API changes that you 
mention, would you like to discuss?

Thanks

Mick



> On 30 Dec 2014, at 17:40, Michael Davies  wrote:
> 
> Hi Michael, 
> 
> I’ve looked through the example and the test cases and I think I understand 
> what we need to do - so I’ll give it a go. 
> 
> I think what I’d like to try to do is allow files to be added at anytime, so 
> perhaps I can cache partition info, and also what may be useful for us would 
> be to derive schema from the set of all files, hopefully this is achievable 
> also.
> 
> Thanks
> 
> Mick
> 
> 
>> On 30 Dec 2014, at 04:49, Michael Armbrust > <mailto:mich...@databricks.com>> wrote:
>> 
>> You can't do this now without writing a bunch of custom logic (see here for 
>> an example: 
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala
>>  
>> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala>)
>> 
>> I would like to make this easier as part of improvements to the datasources 
>> api that we are planning for Spark 1.3
>> 
>> On Mon, Dec 29, 2014 at 2:19 AM, Mickalas > <mailto:michael.belldav...@gmail.com>> wrote:
>> I see that there is already a request to add wildcard support to the
>> SQLContext.parquetFile function
>> https://issues.apache.org/jira/browse/SPARK-3928 
>> <https://issues.apache.org/jira/browse/SPARK-3928>.
>> 
>> What seems like a useful thing for our use case is to associate the
>> directory structure with certain columns in the table, but it does not seem
>> like this is supported.
>> 
>> For example we want to create parquet files on a daily basis associated with
>> geographic regions and so will create a set of files under directories such
>> as:
>> 
>> * 2014-12-29/Americas
>> * 2014-12-29/Asia
>> * 2014-12-30/Americas
>> * ...
>> 
>> Where queries have predicates that match the column values determinable from
>> directory structure it would be good to only extract data from matching
>> files.
>> 
>> Does anyone know if something like this is supported, or whether this is a
>> reasonable thing to request?
>> 
>> Mick
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html
>>  
>> <http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html>
>> Sent from the Apache Spark User List mailing list archive at Nabble.com 
>> <http://nabble.com/>.
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> For additional commands, e-mail: user-h...@spark.apache.org 
>> <mailto:user-h...@spark.apache.org>
>> 
>> 
> 



Re: Mapping directory structure to columns in SparkSQL

2014-12-30 Thread Michael Davies
Hi Michael, 

I’ve looked through the example and the test cases and I think I understand 
what we need to do - so I’ll give it a go. 

I think what I’d like to try to do is allow files to be added at anytime, so 
perhaps I can cache partition info, and also what may be useful for us would be 
to derive schema from the set of all files, hopefully this is achievable also.

Thanks

Mick


> On 30 Dec 2014, at 04:49, Michael Armbrust  wrote:
> 
> You can't do this now without writing a bunch of custom logic (see here for 
> an example: 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala
>  
> )
> 
> I would like to make this easier as part of improvements to the datasources 
> api that we are planning for Spark 1.3
> 
> On Mon, Dec 29, 2014 at 2:19 AM, Mickalas  > wrote:
> I see that there is already a request to add wildcard support to the
> SQLContext.parquetFile function
> https://issues.apache.org/jira/browse/SPARK-3928 
> .
> 
> What seems like a useful thing for our use case is to associate the
> directory structure with certain columns in the table, but it does not seem
> like this is supported.
> 
> For example we want to create parquet files on a daily basis associated with
> geographic regions and so will create a set of files under directories such
> as:
> 
> * 2014-12-29/Americas
> * 2014-12-29/Asia
> * 2014-12-30/Americas
> * ...
> 
> Where queries have predicates that match the column values determinable from
> directory structure it would be good to only extract data from matching
> files.
> 
> Does anyone know if something like this is supported, or whether this is a
> reasonable thing to request?
> 
> Mick
> 
> 
> 
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html
>  
> 
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 
>