[ https://issues.apache.org/jira/browse/SPARK-20622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Noam Asor updated SPARK-20622: ------------------------------ Description: h4. Why There are cases where traditional M/R jobs and RDD based Spark jobs writes out partitioned parquet in 'value only' named directories i.e. {{hdfs:///some/base/path/2017/05/06}} and not in 'key=value' named directories i.e. {{hdfs:///some/base/path/year=2017/month=05/day=06}} which prevents users from leveraging Spark SQL parquet partition discovery when reading the former back. h4. What This issue is a proposal for a solution which will allow Spark SQL to discover parquet partitions for 'value only' named directories. h4. How By introducing a new Spark SQL read option *partitionTemplate*. *partitionTemplate* is in a Path form and it should include base path followed by the missing 'key=' as a template for transforming 'value only' named dirs to 'key=value' named dirs. In the example above this will look like: {{hdfs:///some/base/path/year=/month=/day=/}}. To simplify the solution this option should be tied with *basePath* option, meaning that *partitionTemplate* option is valid only if *basePath* is set also. In the end for the above scenario, this will look something like: {code} spark.read .option("basePath", "hdfs:///some/base/path") .option("partitionTemplate", "hdfs:///some/base/path/year=/month=/day=/") .parquet(...) {code} which will allow Spark SQL to do parquet partition discovery on the following directory tree: {code} some |--base |--path |--2016 |--... |--2017 |--01 |--02 |--... |--15 |--... |--... {code} adding to the schema of the resulted DataFrame the columns year, month, day and their respective values as expected. was: h4. Why There are cases where traditional M/R jobs and RDD based Spark jobs writes out partitioned parquet in 'value only' named directories i.e. {{hdfs:///some/base/path/2017/05/06}} and not in 'key=value' named directories i.e. {{hdfs:///some/base/path/year=2017/month=05/day=06}} which prevents users from leveraging Spark SQL parquet partition discovery when reading the former back. h4. What This issue is a proposal for a solution which will allow Spark SQL to discover parquet partitions for 'value only' named directories. h4. how By introducing a new Spark SQL read option *partitionTemplate*. *partitionTemplate* is in a Path form and it should include base path followed by the missing 'key=' as a template for transforming 'value only' named dirs to 'key=value' named dirs. In the example above this will look like: {{hdfs:///some/base/path/year=/month=/day=/}}. To simplify the solution this option should be tied with *basePath* option, meaning that *partitionTemplate* option is valid only if *basePath* is set also. In the end for the above scenario, this will look something like: {code} spark.read .option("basePath", "hdfs:///some/base/path") .option("basePath", "hdfs:///some/base/path/year=/month=/day=/") .parquet(...) {code} which will allow Spark SQL to do parquet partition discovery on the following directory tree: {code} some |--base |--path |--2016 |--... |--2017 |--01 |--02 |--... |--15 |--... |--... {code} adding to the schema of the resulted DataFrame the columns year, month, day and their respective values as expected. > Parquet partition discovery for non key=value named directories > --------------------------------------------------------------- > > Key: SPARK-20622 > URL: https://issues.apache.org/jira/browse/SPARK-20622 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.2.0 > Reporter: Noam Asor > > h4. Why > There are cases where traditional M/R jobs and RDD based Spark jobs writes > out partitioned parquet in 'value only' named directories i.e. > {{hdfs:///some/base/path/2017/05/06}} and not in 'key=value' named > directories i.e. {{hdfs:///some/base/path/year=2017/month=05/day=06}} which > prevents users from leveraging Spark SQL parquet partition discovery when > reading the former back. > h4. What > This issue is a proposal for a solution which will allow Spark SQL to > discover parquet partitions for 'value only' named directories. > h4. How > By introducing a new Spark SQL read option *partitionTemplate*. > *partitionTemplate* is in a Path form and it should include base path > followed by the missing 'key=' as a template for transforming 'value only' > named dirs to 'key=value' named dirs. In the example above this will look > like: > {{hdfs:///some/base/path/year=/month=/day=/}}. > To simplify the solution this option should be tied with *basePath* option, > meaning that *partitionTemplate* option is valid only if *basePath* is set > also. > In the end for the above scenario, this will look something like: > {code} > spark.read > .option("basePath", "hdfs:///some/base/path") > .option("partitionTemplate", "hdfs:///some/base/path/year=/month=/day=/") > .parquet(...) > {code} > which will allow Spark SQL to do parquet partition discovery on the following > directory tree: > {code} > some > |--base > |--path > |--2016 > |--... > |--2017 > |--01 > |--02 > |--... > |--15 > |--... > |--... > {code} > adding to the schema of the resulted DataFrame the columns year, month, day > and their respective values as expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org