Re: Mapping directory structure to columns in SparkSQL
Hi Michael, I have got the directory based column support working at least in a trial. I have put the trial code here - DirIndexParquet.scala https://github.com/MickDavies/spark-parquet-dirindex/blob/master/src/main/scala/org/apache/spark/sql/parquet/DirIndexParquet.scala it has involved me copying quite a lot of newParquet. There are some tests here that parquet https://github.com/MickDavies/spark-parquet-dirindex/tree/master/src/test/scala/org/apache/spark/sql/parquet illustrate use. I’d be keen to help in anyway with the datasources API changes that you mention, would you like to discuss? Thanks Mick On 30 Dec 2014, at 17:40, Michael Davies michael.belldav...@gmail.com wrote: Hi Michael, I’ve looked through the example and the test cases and I think I understand what we need to do - so I’ll give it a go. I think what I’d like to try to do is allow files to be added at anytime, so perhaps I can cache partition info, and also what may be useful for us would be to derive schema from the set of all files, hopefully this is achievable also. Thanks Mick On 30 Dec 2014, at 04:49, Michael Armbrust mich...@databricks.com mailto:mich...@databricks.com wrote: You can't do this now without writing a bunch of custom logic (see here for an example: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala) I would like to make this easier as part of improvements to the datasources api that we are planning for Spark 1.3 On Mon, Dec 29, 2014 at 2:19 AM, Mickalas michael.belldav...@gmail.com mailto:michael.belldav...@gmail.com wrote: I see that there is already a request to add wildcard support to the SQLContext.parquetFile function https://issues.apache.org/jira/browse/SPARK-3928 https://issues.apache.org/jira/browse/SPARK-3928. What seems like a useful thing for our use case is to associate the directory structure with certain columns in the table, but it does not seem like this is supported. For example we want to create parquet files on a daily basis associated with geographic regions and so will create a set of files under directories such as: * 2014-12-29/Americas * 2014-12-29/Asia * 2014-12-30/Americas * ... Where queries have predicates that match the column values determinable from directory structure it would be good to only extract data from matching files. Does anyone know if something like this is supported, or whether this is a reasonable thing to request? Mick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html Sent from the Apache Spark User List mailing list archive at Nabble.com http://nabble.com/. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org
Re: Mapping directory structure to columns in SparkSQL
Hi Michael, I’ve looked through the example and the test cases and I think I understand what we need to do - so I’ll give it a go. I think what I’d like to try to do is allow files to be added at anytime, so perhaps I can cache partition info, and also what may be useful for us would be to derive schema from the set of all files, hopefully this is achievable also. Thanks Mick On 30 Dec 2014, at 04:49, Michael Armbrust mich...@databricks.com wrote: You can't do this now without writing a bunch of custom logic (see here for an example: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala) I would like to make this easier as part of improvements to the datasources api that we are planning for Spark 1.3 On Mon, Dec 29, 2014 at 2:19 AM, Mickalas michael.belldav...@gmail.com mailto:michael.belldav...@gmail.com wrote: I see that there is already a request to add wildcard support to the SQLContext.parquetFile function https://issues.apache.org/jira/browse/SPARK-3928 https://issues.apache.org/jira/browse/SPARK-3928. What seems like a useful thing for our use case is to associate the directory structure with certain columns in the table, but it does not seem like this is supported. For example we want to create parquet files on a daily basis associated with geographic regions and so will create a set of files under directories such as: * 2014-12-29/Americas * 2014-12-29/Asia * 2014-12-30/Americas * ... Where queries have predicates that match the column values determinable from directory structure it would be good to only extract data from matching files. Does anyone know if something like this is supported, or whether this is a reasonable thing to request? Mick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org
Mapping directory structure to columns in SparkSQL
I see that there is already a request to add wildcard support to the SQLContext.parquetFile function https://issues.apache.org/jira/browse/SPARK-3928. What seems like a useful thing for our use case is to associate the directory structure with certain columns in the table, but it does not seem like this is supported. For example we want to create parquet files on a daily basis associated with geographic regions and so will create a set of files under directories such as: * 2014-12-29/Americas * 2014-12-29/Asia * 2014-12-30/Americas * ... Where queries have predicates that match the column values determinable from directory structure it would be good to only extract data from matching files. Does anyone know if something like this is supported, or whether this is a reasonable thing to request? Mick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Mapping directory structure to columns in SparkSQL
You can't do this now without writing a bunch of custom logic (see here for an example: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala ) I would like to make this easier as part of improvements to the datasources api that we are planning for Spark 1.3 On Mon, Dec 29, 2014 at 2:19 AM, Mickalas michael.belldav...@gmail.com wrote: I see that there is already a request to add wildcard support to the SQLContext.parquetFile function https://issues.apache.org/jira/browse/SPARK-3928. What seems like a useful thing for our use case is to associate the directory structure with certain columns in the table, but it does not seem like this is supported. For example we want to create parquet files on a daily basis associated with geographic regions and so will create a set of files under directories such as: * 2014-12-29/Americas * 2014-12-29/Asia * 2014-12-30/Americas * ... Where queries have predicates that match the column values determinable from directory structure it would be good to only extract data from matching files. Does anyone know if something like this is supported, or whether this is a reasonable thing to request? Mick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org