Re: Mapping directory structure to columns in SparkSQL

2015-01-09 Thread Michael Davies
Hi Michael, 

I have got the directory based column support working at least in a trial. I 
have put the trial code here - DirIndexParquet.scala 
https://github.com/MickDavies/spark-parquet-dirindex/blob/master/src/main/scala/org/apache/spark/sql/parquet/DirIndexParquet.scala
 it has involved me copying quite a lot of newParquet. 

There are some tests here that parquet 
https://github.com/MickDavies/spark-parquet-dirindex/tree/master/src/test/scala/org/apache/spark/sql/parquet
 illustrate use.

I’d be keen to help in anyway with the datasources API changes that you 
mention, would you like to discuss?

Thanks

Mick



 On 30 Dec 2014, at 17:40, Michael Davies michael.belldav...@gmail.com wrote:
 
 Hi Michael, 
 
 I’ve looked through the example and the test cases and I think I understand 
 what we need to do - so I’ll give it a go. 
 
 I think what I’d like to try to do is allow files to be added at anytime, so 
 perhaps I can cache partition info, and also what may be useful for us would 
 be to derive schema from the set of all files, hopefully this is achievable 
 also.
 
 Thanks
 
 Mick
 
 
 On 30 Dec 2014, at 04:49, Michael Armbrust mich...@databricks.com 
 mailto:mich...@databricks.com wrote:
 
 You can't do this now without writing a bunch of custom logic (see here for 
 an example: 
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala
  
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala)
 
 I would like to make this easier as part of improvements to the datasources 
 api that we are planning for Spark 1.3
 
 On Mon, Dec 29, 2014 at 2:19 AM, Mickalas michael.belldav...@gmail.com 
 mailto:michael.belldav...@gmail.com wrote:
 I see that there is already a request to add wildcard support to the
 SQLContext.parquetFile function
 https://issues.apache.org/jira/browse/SPARK-3928 
 https://issues.apache.org/jira/browse/SPARK-3928.
 
 What seems like a useful thing for our use case is to associate the
 directory structure with certain columns in the table, but it does not seem
 like this is supported.
 
 For example we want to create parquet files on a daily basis associated with
 geographic regions and so will create a set of files under directories such
 as:
 
 * 2014-12-29/Americas
 * 2014-12-29/Asia
 * 2014-12-30/Americas
 * ...
 
 Where queries have predicates that match the column values determinable from
 directory structure it would be good to only extract data from matching
 files.
 
 Does anyone know if something like this is supported, or whether this is a
 reasonable thing to request?
 
 Mick
 
 
 
 
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html
  
 http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com 
 http://nabble.com/.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 mailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org 
 mailto:user-h...@spark.apache.org
 
 
 



Re: Mapping directory structure to columns in SparkSQL

2014-12-30 Thread Michael Davies
Hi Michael, 

I’ve looked through the example and the test cases and I think I understand 
what we need to do - so I’ll give it a go. 

I think what I’d like to try to do is allow files to be added at anytime, so 
perhaps I can cache partition info, and also what may be useful for us would be 
to derive schema from the set of all files, hopefully this is achievable also.

Thanks

Mick


 On 30 Dec 2014, at 04:49, Michael Armbrust mich...@databricks.com wrote:
 
 You can't do this now without writing a bunch of custom logic (see here for 
 an example: 
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala
  
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala)
 
 I would like to make this easier as part of improvements to the datasources 
 api that we are planning for Spark 1.3
 
 On Mon, Dec 29, 2014 at 2:19 AM, Mickalas michael.belldav...@gmail.com 
 mailto:michael.belldav...@gmail.com wrote:
 I see that there is already a request to add wildcard support to the
 SQLContext.parquetFile function
 https://issues.apache.org/jira/browse/SPARK-3928 
 https://issues.apache.org/jira/browse/SPARK-3928.
 
 What seems like a useful thing for our use case is to associate the
 directory structure with certain columns in the table, but it does not seem
 like this is supported.
 
 For example we want to create parquet files on a daily basis associated with
 geographic regions and so will create a set of files under directories such
 as:
 
 * 2014-12-29/Americas
 * 2014-12-29/Asia
 * 2014-12-30/Americas
 * ...
 
 Where queries have predicates that match the column values determinable from
 directory structure it would be good to only extract data from matching
 files.
 
 Does anyone know if something like this is supported, or whether this is a
 reasonable thing to request?
 
 Mick
 
 
 
 
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html
  
 http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 mailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org 
 mailto:user-h...@spark.apache.org
 
 



Mapping directory structure to columns in SparkSQL

2014-12-29 Thread Mickalas
I see that there is already a request to add wildcard support to the
SQLContext.parquetFile function
https://issues.apache.org/jira/browse/SPARK-3928.

What seems like a useful thing for our use case is to associate the
directory structure with certain columns in the table, but it does not seem
like this is supported.

For example we want to create parquet files on a daily basis associated with
geographic regions and so will create a set of files under directories such
as:

* 2014-12-29/Americas
* 2014-12-29/Asia
* 2014-12-30/Americas
* ...

Where queries have predicates that match the column values determinable from
directory structure it would be good to only extract data from matching
files.

Does anyone know if something like this is supported, or whether this is a
reasonable thing to request?

Mick








--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Mapping directory structure to columns in SparkSQL

2014-12-29 Thread Michael Armbrust
You can't do this now without writing a bunch of custom logic (see here for
an example:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala
)

I would like to make this easier as part of improvements to the datasources
api that we are planning for Spark 1.3

On Mon, Dec 29, 2014 at 2:19 AM, Mickalas michael.belldav...@gmail.com
wrote:

 I see that there is already a request to add wildcard support to the
 SQLContext.parquetFile function
 https://issues.apache.org/jira/browse/SPARK-3928.

 What seems like a useful thing for our use case is to associate the
 directory structure with certain columns in the table, but it does not seem
 like this is supported.

 For example we want to create parquet files on a daily basis associated
 with
 geographic regions and so will create a set of files under directories such
 as:

 * 2014-12-29/Americas
 * 2014-12-29/Asia
 * 2014-12-30/Americas
 * ...

 Where queries have predicates that match the column values determinable
 from
 directory structure it would be good to only extract data from matching
 files.

 Does anyone know if something like this is supported, or whether this is a
 reasonable thing to request?

 Mick








 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org