[jira] [Commented] (SPARK-25411) Implement range partition in Spark

2019-09-23 Thread Christopher Hoshino-Fish (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936176#comment-16936176
 ] 

Christopher Hoshino-Fish commented on SPARK-25411:
--

I've also done this in the past to balance skewed partitions and get uniform 
query performance. Really effective, would be great to see this combined with 
computed/derived partitions

> Implement range partition in Spark
> --
>
> Key: SPARK-25411
> URL: https://issues.apache.org/jira/browse/SPARK-25411
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wang, Gang
>Priority: Major
> Attachments: range partition design doc.pdf
>
>
> In our product environment, there are some partitioned fact tables, which are 
> all quite huge. To accelerate join execution, we need make them also 
> bucketed. Than comes the problem, if the bucket number is large enough, there 
> may be too many files(files count = bucket number * partition count), which 
> may bring pressure to the HDFS. And if the bucket number is small, Spark will 
> launch equal number of tasks to read/write it.
>  
> So, can we implement a new partition support range values, just like range 
> partition in Oracle/MySQL 
> ([https://docs.oracle.com/cd/E17952_01/mysql-5.7-en/partitioning-range.html]).
>  Say, we can partition by a date column, and make every two months as a 
> partition, or partitioned by a integer column, make interval of 1 as a 
> partition.
>  
> Ideally, feature like range partition should be implemented in Hive. While, 
> it's been always hard to update Hive version in a prod environment, and much 
> lightweight and flexible if we implement it in Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28977) JDBC Dataframe Reader Doc Doesn't Match JDBC Data Source Page

2019-09-05 Thread Christopher Hoshino-Fish (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923741#comment-16923741
 ] 

Christopher Hoshino-Fish commented on SPARK-28977:
--

thanks Sean!

> JDBC Dataframe Reader Doc Doesn't Match JDBC Data Source Page
> -
>
> Key: SPARK-28977
> URL: https://issues.apache.org/jira/browse/SPARK-28977
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.3
>Reporter: Christopher Hoshino-Fish
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.4.5, 3.0.0
>
>
> [https://spark.apache.org/docs/2.4.3/sql-data-sources-jdbc.html]
> Specifically in the partitionColumn section, this page says:
> "{{partitionColumn}} must be a numeric, date, or timestamp column from the 
> table in question."
>  
> But then in this doc: 
> [https://spark.apache.org/docs/2.4.3/api/scala/index.html#org.apache.spark.sql.DataFrameReader]
> in def jdbc(url: String, table: String, columnName: String, lowerBound: Long, 
> upperBound: Long, numPartitions: Int, connectionProperties: Properties): 
> DataFrame
> we have:
> columnName
> the name of a column of integral type that will be used for partitioning.
>  
> This appears to go back pretty far, to 1.6.3, but I'm not sure when this was 
> accurate.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28977) JDBC Dataframe Reader Doc Doesn't Match JDBC Data Source Page

2019-09-04 Thread Christopher Hoshino-Fish (Jira)
Christopher Hoshino-Fish created SPARK-28977:


 Summary: JDBC Dataframe Reader Doc Doesn't Match JDBC Data Source 
Page
 Key: SPARK-28977
 URL: https://issues.apache.org/jira/browse/SPARK-28977
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.4.3
Reporter: Christopher Hoshino-Fish
 Fix For: 2.4.3


[https://spark.apache.org/docs/2.4.3/sql-data-sources-jdbc.html]

Specifically in the partitionColumn section, this page says:

"{{partitionColumn}} must be a numeric, date, or timestamp column from the 
table in question."

 

But then in this doc: 
[https://spark.apache.org/docs/2.4.3/api/scala/index.html#org.apache.spark.sql.DataFrameReader]

in def jdbc(url: String, table: String, columnName: String, lowerBound: Long, 
upperBound: Long, numPartitions: Int, connectionProperties: Properties): 
DataFrame

we have:

columnName

the name of a column of integral type that will be used for partitioning.

 

This appears to go back pretty far, to 1.6.3, but I'm not sure when this was 
accurate.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23946) 2.3.0 and Latest ScalaDocs are linked to the wrong source code

2018-04-09 Thread Christopher Hoshino-Fish (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431590#comment-16431590
 ] 

Christopher Hoshino-Fish commented on SPARK-23946:
--

[~hyukjin.kwon] thanks for the feedback!

> 2.3.0 and Latest ScalaDocs are linked to the wrong source code
> --
>
> Key: SPARK-23946
> URL: https://issues.apache.org/jira/browse/SPARK-23946
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Christopher Hoshino-Fish
>Priority: Major
>  Labels: doc-impacting, docs-missing
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Currently the 2.3.0 and Latest scaladocs point towards Sameer's github for 
> the source code
>  
> [https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.functions$]
>  click on the Source link and it goes to:
>  
> [https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23946) 2.3.0 and Latest ScalaDocs are linked to the wrong source code

2018-04-09 Thread Christopher Hoshino-Fish (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christopher Hoshino-Fish updated SPARK-23946:
-
Description: 
Currently the 2.3.0 and Latest scaladocs point towards Sameer's github for the 
source code
 
[https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.functions$]
 click on the Source link and it goes to:
 
[https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala]

  was:
Currently the 2.3.0 scaladocs point towards Sameer's github for the source code
https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.functions$
click on the Source link and it goes to:
https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala


> 2.3.0 and Latest ScalaDocs are linked to the wrong source code
> --
>
> Key: SPARK-23946
> URL: https://issues.apache.org/jira/browse/SPARK-23946
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Christopher Hoshino-Fish
>Priority: Major
>  Labels: doc-impacting, docs-missing
> Fix For: 2.3.1
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Currently the 2.3.0 and Latest scaladocs point towards Sameer's github for 
> the source code
>  
> [https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.functions$]
>  click on the Source link and it goes to:
>  
> [https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23946) 2.3.0 and Latest ScalaDocs are linked to the wrong source code

2018-04-09 Thread Christopher Hoshino-Fish (JIRA)
Christopher Hoshino-Fish created SPARK-23946:


 Summary: 2.3.0 and Latest ScalaDocs are linked to the wrong source 
code
 Key: SPARK-23946
 URL: https://issues.apache.org/jira/browse/SPARK-23946
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 2.3.0
Reporter: Christopher Hoshino-Fish
 Fix For: 2.3.1


Currently the 2.3.0 scaladocs point towards Sameer's github for the source code
https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.functions$
click on the Source link and it goes to:
https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org