Hi Michael, I have got the directory based column support working at least in a trial. I have put the trial code here - DirIndexParquet.scala <https://github.com/MickDavies/spark-parquet-dirindex/blob/master/src/main/scala/org/apache/spark/sql/parquet/DirIndexParquet.scala> it has involved me copying quite a lot of newParquet.
There are some tests here that parquet <https://github.com/MickDavies/spark-parquet-dirindex/tree/master/src/test/scala/org/apache/spark/sql/parquet> illustrate use. I’d be keen to help in anyway with the datasources API changes that you mention, would you like to discuss? Thanks Mick > On 30 Dec 2014, at 17:40, Michael Davies <michael.belldav...@gmail.com> wrote: > > Hi Michael, > > I’ve looked through the example and the test cases and I think I understand > what we need to do - so I’ll give it a go. > > I think what I’d like to try to do is allow files to be added at anytime, so > perhaps I can cache partition info, and also what may be useful for us would > be to derive schema from the set of all files, hopefully this is achievable > also. > > Thanks > > Mick > > >> On 30 Dec 2014, at 04:49, Michael Armbrust <mich...@databricks.com >> <mailto:mich...@databricks.com>> wrote: >> >> You can't do this now without writing a bunch of custom logic (see here for >> an example: >> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala >> >> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala>) >> >> I would like to make this easier as part of improvements to the datasources >> api that we are planning for Spark 1.3 >> >> On Mon, Dec 29, 2014 at 2:19 AM, Mickalas <michael.belldav...@gmail.com >> <mailto:michael.belldav...@gmail.com>> wrote: >> I see that there is already a request to add wildcard support to the >> SQLContext.parquetFile function >> https://issues.apache.org/jira/browse/SPARK-3928 >> <https://issues.apache.org/jira/browse/SPARK-3928>. >> >> What seems like a useful thing for our use case is to associate the >> directory structure with certain columns in the table, but it does not seem >> like this is supported. >> >> For example we want to create parquet files on a daily basis associated with >> geographic regions and so will create a set of files under directories such >> as: >> >> * 2014-12-29/Americas >> * 2014-12-29/Asia >> * 2014-12-30/Americas >> * ... >> >> Where queries have predicates that match the column values determinable from >> directory structure it would be good to only extract data from matching >> files. >> >> Does anyone know if something like this is supported, or whether this is a >> reasonable thing to request? >> >> Mick >> >> >> >> >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html >> >> <http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html> >> Sent from the Apache Spark User List mailing list archive at Nabble.com >> <http://nabble.com/>. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> <mailto:user-unsubscr...@spark.apache.org> >> For additional commands, e-mail: user-h...@spark.apache.org >> <mailto:user-h...@spark.apache.org> >> >> >