Hi Michael, 

I’ve looked through the example and the test cases and I think I understand 
what we need to do - so I’ll give it a go. 

I think what I’d like to try to do is allow files to be added at anytime, so 
perhaps I can cache partition info, and also what may be useful for us would be 
to derive schema from the set of all files, hopefully this is achievable also.

Thanks

Mick


> On 30 Dec 2014, at 04:49, Michael Armbrust <mich...@databricks.com> wrote:
> 
> You can't do this now without writing a bunch of custom logic (see here for 
> an example: 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala
>  
> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala>)
> 
> I would like to make this easier as part of improvements to the datasources 
> api that we are planning for Spark 1.3
> 
> On Mon, Dec 29, 2014 at 2:19 AM, Mickalas <michael.belldav...@gmail.com 
> <mailto:michael.belldav...@gmail.com>> wrote:
> I see that there is already a request to add wildcard support to the
> SQLContext.parquetFile function
> https://issues.apache.org/jira/browse/SPARK-3928 
> <https://issues.apache.org/jira/browse/SPARK-3928>.
> 
> What seems like a useful thing for our use case is to associate the
> directory structure with certain columns in the table, but it does not seem
> like this is supported.
> 
> For example we want to create parquet files on a daily basis associated with
> geographic regions and so will create a set of files under directories such
> as:
> 
> * 2014-12-29/Americas
> * 2014-12-29/Asia
> * 2014-12-30/Americas
> * ...
> 
> Where queries have predicates that match the column values determinable from
> directory structure it would be good to only extract data from matching
> files.
> 
> Does anyone know if something like this is supported, or whether this is a
> reasonable thing to request?
> 
> Mick
> 
> 
> 
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html
>  
> <http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html>
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 
> 

Reply via email to