[jira] [Updated] (SPARK-20622) Parquet partition discovery for non key=value named directories

Noam Asor (JIRA) Wed, 10 May 2017 07:01:41 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-20622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Noam Asor updated SPARK-20622:
------------------------------
    Description: 
h4. Why
There are cases where traditional M/R jobs and RDD based Spark jobs writes out 
partitioned parquet in 'value only' named directories i.e. 
{{hdfs:///some/base/path/2017/05/06}} and not in 'key=value' named directories 
i.e. {{hdfs:///some/base/path/year=2017/month=05/day=06}} which prevents users 
from leveraging Spark SQL parquet partition discovery when reading the former 
back.
h4. What
This issue is a proposal for a solution which will allow Spark SQL to discover 
parquet partitions for 'value only' named directories.
h4. How
By introducing a new Spark SQL read option *partitionTemplate*.
*partitionTemplate* is in a Path form and it should include base path followed 
by the missing 'key=' as a template for transforming 'value only' named dirs to 
'key=value' named dirs. In the example above this will look like: 
{{hdfs:///some/base/path/year=/month=/day=/}}.

To simplify the solution this option should be tied with *basePath* option, 
meaning that *partitionTemplate* option is valid only if *basePath* is set also.
In the end for the above scenario, this will look something like:
{code}
spark.read
  .option("basePath", "hdfs:///some/base/path")
  .option("partitionTemplate", "hdfs:///some/base/path/year=/month=/day=/")
  .parquet(...)
{code}
which will allow Spark SQL to do parquet partition discovery on the following 
directory tree:
{code}
some
  |--base
       |--path
             |--2016
                  |--...
             |--2017
                   |--01
                   |--02
                       |--...
                       |--15
                       |--...
                   |--...
{code}
adding to the schema of the resulted DataFrame the columns year, month, day and 
their respective values as expected.

  was:
h4. Why
There are cases where traditional M/R jobs and RDD based Spark jobs writes out 
partitioned parquet in 'value only' named directories i.e. 
{{hdfs:///some/base/path/2017/05/06}} and not in 'key=value' named directories 
i.e. {{hdfs:///some/base/path/year=2017/month=05/day=06}} which prevents users 
from leveraging Spark SQL parquet partition discovery when reading the former 
back.
h4. What
This issue is a proposal for a solution which will allow Spark SQL to discover 
parquet partitions for 'value only' named directories.
h4. how
By introducing a new Spark SQL read option *partitionTemplate*.
*partitionTemplate* is in a Path form and it should include base path followed 
by the missing 'key=' as a template for transforming 'value only' named dirs to 
'key=value' named dirs. In the example above this will look like: 
{{hdfs:///some/base/path/year=/month=/day=/}}.

To simplify the solution this option should be tied with *basePath* option, 
meaning that *partitionTemplate* option is valid only if *basePath* is set also.
In the end for the above scenario, this will look something like:
{code}
spark.read
  .option("basePath", "hdfs:///some/base/path")
  .option("basePath", "hdfs:///some/base/path/year=/month=/day=/")
  .parquet(...)
{code}
which will allow Spark SQL to do parquet partition discovery on the following 
directory tree:
{code}
some
  |--base
       |--path
             |--2016
                  |--...
             |--2017
                   |--01
                   |--02
                       |--...
                       |--15
                       |--...
                   |--...
{code}
adding to the schema of the resulted DataFrame the columns year, month, day and 
their respective values as expected.


> Parquet partition discovery for non key=value named directories
> ---------------------------------------------------------------
>
>                 Key: SPARK-20622
>                 URL: https://issues.apache.org/jira/browse/SPARK-20622
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Noam Asor
>
> h4. Why
> There are cases where traditional M/R jobs and RDD based Spark jobs writes 
> out partitioned parquet in 'value only' named directories i.e. 
> {{hdfs:///some/base/path/2017/05/06}} and not in 'key=value' named 
> directories i.e. {{hdfs:///some/base/path/year=2017/month=05/day=06}} which 
> prevents users from leveraging Spark SQL parquet partition discovery when 
> reading the former back.
> h4. What
> This issue is a proposal for a solution which will allow Spark SQL to 
> discover parquet partitions for 'value only' named directories.
> h4. How
> By introducing a new Spark SQL read option *partitionTemplate*.
> *partitionTemplate* is in a Path form and it should include base path 
> followed by the missing 'key=' as a template for transforming 'value only' 
> named dirs to 'key=value' named dirs. In the example above this will look 
> like: 
> {{hdfs:///some/base/path/year=/month=/day=/}}.
> To simplify the solution this option should be tied with *basePath* option, 
> meaning that *partitionTemplate* option is valid only if *basePath* is set 
> also.
> In the end for the above scenario, this will look something like:
> {code}
> spark.read
>   .option("basePath", "hdfs:///some/base/path")
>   .option("partitionTemplate", "hdfs:///some/base/path/year=/month=/day=/")
>   .parquet(...)
> {code}
> which will allow Spark SQL to do parquet partition discovery on the following 
> directory tree:
> {code}
> some
>   |--base
>        |--path
>              |--2016
>                   |--...
>              |--2017
>                    |--01
>                    |--02
>                        |--...
>                        |--15
>                        |--...
>                    |--...
> {code}
> adding to the schema of the resulted DataFrame the columns year, month, day 
> and their respective values as expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20622) Parquet partition discovery for non key=value named directories

Reply via email to