[jira] [Commented] (SPARK-25411) Implement range partition in Spark
[ https://issues.apache.org/jira/browse/SPARK-25411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936176#comment-16936176 ] Christopher Hoshino-Fish commented on SPARK-25411: -- I've also done this in the past to balance skewed partitions and get uniform query performance. Really effective, would be great to see this combined with computed/derived partitions > Implement range partition in Spark > -- > > Key: SPARK-25411 > URL: https://issues.apache.org/jira/browse/SPARK-25411 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wang, Gang >Priority: Major > Attachments: range partition design doc.pdf > > > In our product environment, there are some partitioned fact tables, which are > all quite huge. To accelerate join execution, we need make them also > bucketed. Than comes the problem, if the bucket number is large enough, there > may be too many files(files count = bucket number * partition count), which > may bring pressure to the HDFS. And if the bucket number is small, Spark will > launch equal number of tasks to read/write it. > > So, can we implement a new partition support range values, just like range > partition in Oracle/MySQL > ([https://docs.oracle.com/cd/E17952_01/mysql-5.7-en/partitioning-range.html]). > Say, we can partition by a date column, and make every two months as a > partition, or partitioned by a integer column, make interval of 1 as a > partition. > > Ideally, feature like range partition should be implemented in Hive. While, > it's been always hard to update Hive version in a prod environment, and much > lightweight and flexible if we implement it in Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28977) JDBC Dataframe Reader Doc Doesn't Match JDBC Data Source Page
[ https://issues.apache.org/jira/browse/SPARK-28977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923741#comment-16923741 ] Christopher Hoshino-Fish commented on SPARK-28977: -- thanks Sean! > JDBC Dataframe Reader Doc Doesn't Match JDBC Data Source Page > - > > Key: SPARK-28977 > URL: https://issues.apache.org/jira/browse/SPARK-28977 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.4.3 >Reporter: Christopher Hoshino-Fish >Assignee: Sean Owen >Priority: Minor > Fix For: 2.4.5, 3.0.0 > > > [https://spark.apache.org/docs/2.4.3/sql-data-sources-jdbc.html] > Specifically in the partitionColumn section, this page says: > "{{partitionColumn}} must be a numeric, date, or timestamp column from the > table in question." > > But then in this doc: > [https://spark.apache.org/docs/2.4.3/api/scala/index.html#org.apache.spark.sql.DataFrameReader] > in def jdbc(url: String, table: String, columnName: String, lowerBound: Long, > upperBound: Long, numPartitions: Int, connectionProperties: Properties): > DataFrame > we have: > columnName > the name of a column of integral type that will be used for partitioning. > > This appears to go back pretty far, to 1.6.3, but I'm not sure when this was > accurate. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28977) JDBC Dataframe Reader Doc Doesn't Match JDBC Data Source Page
Christopher Hoshino-Fish created SPARK-28977: Summary: JDBC Dataframe Reader Doc Doesn't Match JDBC Data Source Page Key: SPARK-28977 URL: https://issues.apache.org/jira/browse/SPARK-28977 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 2.4.3 Reporter: Christopher Hoshino-Fish Fix For: 2.4.3 [https://spark.apache.org/docs/2.4.3/sql-data-sources-jdbc.html] Specifically in the partitionColumn section, this page says: "{{partitionColumn}} must be a numeric, date, or timestamp column from the table in question." But then in this doc: [https://spark.apache.org/docs/2.4.3/api/scala/index.html#org.apache.spark.sql.DataFrameReader] in def jdbc(url: String, table: String, columnName: String, lowerBound: Long, upperBound: Long, numPartitions: Int, connectionProperties: Properties): DataFrame we have: columnName the name of a column of integral type that will be used for partitioning. This appears to go back pretty far, to 1.6.3, but I'm not sure when this was accurate. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23946) 2.3.0 and Latest ScalaDocs are linked to the wrong source code
[ https://issues.apache.org/jira/browse/SPARK-23946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431590#comment-16431590 ] Christopher Hoshino-Fish commented on SPARK-23946: -- [~hyukjin.kwon] thanks for the feedback! > 2.3.0 and Latest ScalaDocs are linked to the wrong source code > -- > > Key: SPARK-23946 > URL: https://issues.apache.org/jira/browse/SPARK-23946 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.3.0 >Reporter: Christopher Hoshino-Fish >Priority: Major > Labels: doc-impacting, docs-missing > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > Currently the 2.3.0 and Latest scaladocs point towards Sameer's github for > the source code > > [https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.functions$] > click on the Source link and it goes to: > > [https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23946) 2.3.0 and Latest ScalaDocs are linked to the wrong source code
[ https://issues.apache.org/jira/browse/SPARK-23946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christopher Hoshino-Fish updated SPARK-23946: - Description: Currently the 2.3.0 and Latest scaladocs point towards Sameer's github for the source code [https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.functions$] click on the Source link and it goes to: [https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala] was: Currently the 2.3.0 scaladocs point towards Sameer's github for the source code https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.functions$ click on the Source link and it goes to: https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala > 2.3.0 and Latest ScalaDocs are linked to the wrong source code > -- > > Key: SPARK-23946 > URL: https://issues.apache.org/jira/browse/SPARK-23946 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.3.0 >Reporter: Christopher Hoshino-Fish >Priority: Major > Labels: doc-impacting, docs-missing > Fix For: 2.3.1 > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > Currently the 2.3.0 and Latest scaladocs point towards Sameer's github for > the source code > > [https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.functions$] > click on the Source link and it goes to: > > [https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23946) 2.3.0 and Latest ScalaDocs are linked to the wrong source code
Christopher Hoshino-Fish created SPARK-23946: Summary: 2.3.0 and Latest ScalaDocs are linked to the wrong source code Key: SPARK-23946 URL: https://issues.apache.org/jira/browse/SPARK-23946 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 2.3.0 Reporter: Christopher Hoshino-Fish Fix For: 2.3.1 Currently the 2.3.0 scaladocs point towards Sameer's github for the source code https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.functions$ click on the Source link and it goes to: https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org