[jira] [Created] (SPARK-20622) Parquet partition discovery for non key=value named directories

2017-05-06 Thread Noam Asor (JIRA)
Noam Asor created SPARK-20622:
-

 Summary: Parquet partition discovery for non key=value named 
directories
 Key: SPARK-20622
 URL: https://issues.apache.org/jira/browse/SPARK-20622
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Noam Asor


h4. Why
There are cases where traditional M/R jobs and RDD based Spark jobs writes out 
partitioned parquet in 'value only' named directories i.e. 
{{hdfs:///some/base/path/2017/05/06}} and not in 'key=value' named directories 
i.e. {{hdfs:///some/base/path/year=2017/month=05/day=06}} which prevents users 
from leveraging Spark SQL parquet partition discovery when reading the former 
back.
h4. What
This issue is a proposal for a solution which will allow Spark SQL to discover 
parquet partitions for 'value only' named directories.
h4. how
By introducing a new Spark SQL read option *partitionTemplate*.
*partitionTemplate* is in a Path form and it should include base path followed 
by the missing 'key=' as a template for transforming 'value only' named dirs to 
'key=value' named dirs. In the example above this will look like: 
{{hdfs:///some/base/path/year=/month=/day=/}}.

To simplify the solution this option should be tied with *basePath* option, 
meaning that *partitionTemplate* option is valid only if *basePath* is set also.
In the end for the above scenario, this will look something like:
{code}
spark.read
  .option("basePath", "hdfs:///some/base/path")
  .option("basePath", "hdfs:///some/base/path/year=/month=/day=/")
  .parquet(...)
{code}
which will allow Spark SQL to do parquet partition discovery on the following 
directory tree:
{code}
some
  |--base
   |--path
 |--2016
  |--...
 |--2017
   |--01
   |--02
   |--...
   |--15
   |--...
   |--...
{code}
adding to the schema of the resulted DataFrame the columns year, month, day and 
their respective values as expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20622) Parquet partition discovery for non key=value named directories

2017-05-10 Thread Noam Asor (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noam Asor updated SPARK-20622:
--
Description: 
h4. Why
There are cases where traditional M/R jobs and RDD based Spark jobs writes out 
partitioned parquet in 'value only' named directories i.e. 
{{hdfs:///some/base/path/2017/05/06}} and not in 'key=value' named directories 
i.e. {{hdfs:///some/base/path/year=2017/month=05/day=06}} which prevents users 
from leveraging Spark SQL parquet partition discovery when reading the former 
back.
h4. What
This issue is a proposal for a solution which will allow Spark SQL to discover 
parquet partitions for 'value only' named directories.
h4. How
By introducing a new Spark SQL read option *partitionTemplate*.
*partitionTemplate* is in a Path form and it should include base path followed 
by the missing 'key=' as a template for transforming 'value only' named dirs to 
'key=value' named dirs. In the example above this will look like: 
{{hdfs:///some/base/path/year=/month=/day=/}}.

To simplify the solution this option should be tied with *basePath* option, 
meaning that *partitionTemplate* option is valid only if *basePath* is set also.
In the end for the above scenario, this will look something like:
{code}
spark.read
  .option("basePath", "hdfs:///some/base/path")
  .option("partitionTemplate", "hdfs:///some/base/path/year=/month=/day=/")
  .parquet(...)
{code}
which will allow Spark SQL to do parquet partition discovery on the following 
directory tree:
{code}
some
  |--base
   |--path
 |--2016
  |--...
 |--2017
   |--01
   |--02
   |--...
   |--15
   |--...
   |--...
{code}
adding to the schema of the resulted DataFrame the columns year, month, day and 
their respective values as expected.

  was:
h4. Why
There are cases where traditional M/R jobs and RDD based Spark jobs writes out 
partitioned parquet in 'value only' named directories i.e. 
{{hdfs:///some/base/path/2017/05/06}} and not in 'key=value' named directories 
i.e. {{hdfs:///some/base/path/year=2017/month=05/day=06}} which prevents users 
from leveraging Spark SQL parquet partition discovery when reading the former 
back.
h4. What
This issue is a proposal for a solution which will allow Spark SQL to discover 
parquet partitions for 'value only' named directories.
h4. how
By introducing a new Spark SQL read option *partitionTemplate*.
*partitionTemplate* is in a Path form and it should include base path followed 
by the missing 'key=' as a template for transforming 'value only' named dirs to 
'key=value' named dirs. In the example above this will look like: 
{{hdfs:///some/base/path/year=/month=/day=/}}.

To simplify the solution this option should be tied with *basePath* option, 
meaning that *partitionTemplate* option is valid only if *basePath* is set also.
In the end for the above scenario, this will look something like:
{code}
spark.read
  .option("basePath", "hdfs:///some/base/path")
  .option("basePath", "hdfs:///some/base/path/year=/month=/day=/")
  .parquet(...)
{code}
which will allow Spark SQL to do parquet partition discovery on the following 
directory tree:
{code}
some
  |--base
   |--path
 |--2016
  |--...
 |--2017
   |--01
   |--02
   |--...
   |--15
   |--...
   |--...
{code}
adding to the schema of the resulted DataFrame the columns year, month, day and 
their respective values as expected.


> Parquet partition discovery for non key=value named directories
> ---
>
> Key: SPARK-20622
> URL: https://issues.apache.org/jira/browse/SPARK-20622
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Noam Asor
>
> h4. Why
> There are cases where traditional M/R jobs and RDD based Spark jobs writes 
> out partitioned parquet in 'value only' named directories i.e. 
> {{hdfs:///some/base/path/2017/05/06}} and not in 'key=value' named 
> directories i.e. {{hdfs:///some/base/path/year=2017/month=05/day=06}} which 
> prevents users from leveraging Spark SQL parquet partition discovery when 
> reading the former back.
> h4. What
> This issue is a proposal for a solution which will allow Spark SQL to 
> discover parquet partitions for 'value only' named directories.
> h4. How
> By introducing a new Spark SQL read option *partitionTemplate*.
> *partitionTemplate* is in a Path form and it should include base path 
> followed by the missing 'key=' as a template for transforming 'value only' 
> named dirs to 'key=value' named dirs. In the example above this will look 
> lik

[jira] [Updated] (SPARK-20622) Parquet partition discovery for non key=value named directories

2017-06-07 Thread Noam Asor (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noam Asor updated SPARK-20622:
--
Priority: Minor  (was: Major)

> Parquet partition discovery for non key=value named directories
> ---
>
> Key: SPARK-20622
> URL: https://issues.apache.org/jira/browse/SPARK-20622
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Noam Asor
>Priority: Minor
>
> h4. Why
> There are cases where traditional M/R jobs and RDD based Spark jobs writes 
> out partitioned parquet in 'value only' named directories i.e. 
> {{hdfs:///some/base/path/2017/05/06}} and not in 'key=value' named 
> directories i.e. {{hdfs:///some/base/path/year=2017/month=05/day=06}} which 
> prevents users from leveraging Spark SQL parquet partition discovery when 
> reading the former back.
> h4. What
> This issue is a proposal for a solution which will allow Spark SQL to 
> discover parquet partitions for 'value only' named directories.
> h4. How
> By introducing a new Spark SQL read option *partitionTemplate*.
> *partitionTemplate* is in a Path form and it should include base path 
> followed by the missing 'key=' as a template for transforming 'value only' 
> named dirs to 'key=value' named dirs. In the example above this will look 
> like: 
> {{hdfs:///some/base/path/year=/month=/day=/}}.
> To simplify the solution this option should be tied with *basePath* option, 
> meaning that *partitionTemplate* option is valid only if *basePath* is set 
> also.
> In the end for the above scenario, this will look something like:
> {code}
> spark.read
>   .option("basePath", "hdfs:///some/base/path")
>   .option("partitionTemplate", "hdfs:///some/base/path/year=/month=/day=/")
>   .parquet(...)
> {code}
> which will allow Spark SQL to do parquet partition discovery on the following 
> directory tree:
> {code}
> some
>   |--base
>|--path
>  |--2016
>   |--...
>  |--2017
>|--01
>|--02
>|--...
>|--15
>|--...
>|--...
> {code}
> adding to the schema of the resulted DataFrame the columns year, month, day 
> and their respective values as expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20622) Parquet partition discovery for non key=value named directories

2017-06-07 Thread Noam Asor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040515#comment-16040515
 ] 

Noam Asor commented on SPARK-20622:
---

The provided pull request is not complete and is rather in a POC state.
If it will be useful enough to be looked at and considered as part of Spark 
than it should get polished first.

> Parquet partition discovery for non key=value named directories
> ---
>
> Key: SPARK-20622
> URL: https://issues.apache.org/jira/browse/SPARK-20622
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Noam Asor
>Priority: Minor
>
> h4. Why
> There are cases where traditional M/R jobs and RDD based Spark jobs writes 
> out partitioned parquet in 'value only' named directories i.e. 
> {{hdfs:///some/base/path/2017/05/06}} and not in 'key=value' named 
> directories i.e. {{hdfs:///some/base/path/year=2017/month=05/day=06}} which 
> prevents users from leveraging Spark SQL parquet partition discovery when 
> reading the former back.
> h4. What
> This issue is a proposal for a solution which will allow Spark SQL to 
> discover parquet partitions for 'value only' named directories.
> h4. How
> By introducing a new Spark SQL read option *partitionTemplate*.
> *partitionTemplate* is in a Path form and it should include base path 
> followed by the missing 'key=' as a template for transforming 'value only' 
> named dirs to 'key=value' named dirs. In the example above this will look 
> like: 
> {{hdfs:///some/base/path/year=/month=/day=/}}.
> To simplify the solution this option should be tied with *basePath* option, 
> meaning that *partitionTemplate* option is valid only if *basePath* is set 
> also.
> In the end for the above scenario, this will look something like:
> {code}
> spark.read
>   .option("basePath", "hdfs:///some/base/path")
>   .option("partitionTemplate", "hdfs:///some/base/path/year=/month=/day=/")
>   .parquet(...)
> {code}
> which will allow Spark SQL to do parquet partition discovery on the following 
> directory tree:
> {code}
> some
>   |--base
>|--path
>  |--2016
>   |--...
>  |--2017
>|--01
>|--02
>|--...
>|--15
>|--...
>|--...
> {code}
> adding to the schema of the resulted DataFrame the columns year, month, day 
> and their respective values as expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14999) RDDs union checks if the ==

2016-04-29 Thread Noam Asor (JIRA)
Noam Asor created SPARK-14999:
-

 Summary: RDDs union checks if the ==
 Key: SPARK-14999
 URL: https://issues.apache.org/jira/browse/SPARK-14999
 Project: Spark
  Issue Type: Improvement
Reporter: Noam Asor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14999) RDDs union checks if the ==

2016-04-29 Thread Noam Asor (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noam Asor closed SPARK-14999.
-
Resolution: Invalid

> RDDs union checks if the ==
> ---
>
> Key: SPARK-14999
> URL: https://issues.apache.org/jira/browse/SPARK-14999
> Project: Spark
>  Issue Type: Improvement
>Reporter: Noam Asor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14999) RDDs union

2016-04-29 Thread Noam Asor (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noam Asor updated SPARK-14999:
--
Summary: RDDs union   (was: RDDs union checks if the ==)

> RDDs union 
> ---
>
> Key: SPARK-14999
> URL: https://issues.apache.org/jira/browse/SPARK-14999
> Project: Spark
>  Issue Type: Improvement
>Reporter: Noam Asor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22247) Hive partition filter very slow

2017-10-12 Thread Noam Asor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16201971#comment-16201971
 ] 

Noam Asor commented on SPARK-22247:
---

Maybe this issue is related SPARK-17992

> Hive partition filter very slow
> ---
>
> Key: SPARK-22247
> URL: https://issues.apache.org/jira/browse/SPARK-22247
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.0.2, 2.1.1
>Reporter: Patrick Duin
>Priority: Minor
>
> I found an issue where filtering partitions using a dataframe results in very 
> bad performance.
> To reproduce:
> Create a hive table with a lot of partitions and write a spark query on that 
> table that filters based on the partition column.
> In my use case I've got a table with about 30k partitions. 
> I filter the partitions using some scala via spark-shell:
> {{table.filter("partition=x or partition=y")}}
> This results in a Hive thrift API call:{{ #get_partitions('db', 'table', 
> -1)}} which is very slow (minutes) and loads all metastore partitions in 
> memory.
> Doing a more simple filter:
> {{table.filter("partition=x)}} 
> Results in a Hive Thrift API call:{{ #get_partitions_by_filter('db', 'table', 
> 'partition = "x', -1)}} which is very fast (seconds) and only fetches 
> partition X into memory.
> If possible Spark should translate the filter into the more performant Thrift 
> call or fallback to a more scalable solution where it filters our partitions 
> without having to loading them all into memory first (for instance fetching 
> the partitions in batches).
> I've posted my original question on 
> [SO|https://stackoverflow.com/questions/46152526/how-should-i-configure-spark-to-correctly-prune-hive-metastore-partitions]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org