Re: Mapping directory structure to columns in SparkSQL

Michael Davies Fri, 09 Jan 2015 06:22:12 -0800

Hi Michael, 

I have got the directory based column support working at least in a trial. I 
have put the trial code here - DirIndexParquet.scala 
<https://github.com/MickDavies/spark-parquet-dirindex/blob/master/src/main/scala/org/apache/spark/sql/parquet/DirIndexParquet.scala>
 it has involved me copying quite a lot of newParquet.


There are some tests here that parquet 
<https://github.com/MickDavies/spark-parquet-dirindex/tree/master/src/test/scala/org/apache/spark/sql/parquet>
 illustrate use.

I’d be keen to help in anyway with the datasources API changes that you 
mention, would you like to discuss?

Thanks

Mick



> On 30 Dec 2014, at 17:40, Michael Davies <michael.belldav...@gmail.com> wrote:
> 
> Hi Michael, 
> 
> I’ve looked through the example and the test cases and I think I understand 
> what we need to do - so I’ll give it a go. 
> 
> I think what I’d like to try to do is allow files to be added at anytime, so 
> perhaps I can cache partition info, and also what may be useful for us would 
> be to derive schema from the set of all files, hopefully this is achievable 
> also.
> 
> Thanks
> 
> Mick
> 
> 
>> On 30 Dec 2014, at 04:49, Michael Armbrust <mich...@databricks.com 
>> <mailto:mich...@databricks.com>> wrote:
>> 
>> You can't do this now without writing a bunch of custom logic (see here for 
>> an example: 
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala
>>  
>> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala>)
>> 
>> I would like to make this easier as part of improvements to the datasources 
>> api that we are planning for Spark 1.3
>> 
>> On Mon, Dec 29, 2014 at 2:19 AM, Mickalas <michael.belldav...@gmail.com 
>> <mailto:michael.belldav...@gmail.com>> wrote:
>> I see that there is already a request to add wildcard support to the
>> SQLContext.parquetFile function
>> https://issues.apache.org/jira/browse/SPARK-3928 
>> <https://issues.apache.org/jira/browse/SPARK-3928>.
>> 
>> What seems like a useful thing for our use case is to associate the
>> directory structure with certain columns in the table, but it does not seem
>> like this is supported.
>> 
>> For example we want to create parquet files on a daily basis associated with
>> geographic regions and so will create a set of files under directories such
>> as:
>> 
>> * 2014-12-29/Americas
>> * 2014-12-29/Asia
>> * 2014-12-30/Americas
>> * ...
>> 
>> Where queries have predicates that match the column values determinable from
>> directory structure it would be good to only extract data from matching
>> files.
>> 
>> Does anyone know if something like this is supported, or whether this is a
>> reasonable thing to request?
>> 
>> Mick
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html
>>  
>> <http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html>
>> Sent from the Apache Spark User List mailing list archive at Nabble.com 
>> <http://nabble.com/>.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> For additional commands, e-mail: user-h...@spark.apache.org 
>> <mailto:user-h...@spark.apache.org>
>> 
>> 
>

Re: Mapping directory structure to columns in SparkSQL

Reply via email to