[ https://issues.apache.org/jira/browse/SPARK-25411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wang, Gang updated SPARK-25411: ------------------------------- Attachment: range partition design doc.pdf > Implement range partition in Spark > ---------------------------------- > > Key: SPARK-25411 > URL: https://issues.apache.org/jira/browse/SPARK-25411 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 2.3.0 > Reporter: Wang, Gang > Priority: Major > Attachments: range partition design doc.pdf > > > In our product environment, there are some partitioned fact tables, which are > all quite huge. To accelerate join execution, we need make them also > bucketed. Than comes the problem, if the bucket number is large enough, there > may be too many files(files count = bucket number * partition count), which > may bring pressure to the HDFS. And if the bucket number is small, Spark will > launch equal number of tasks to read/write it. > > So, can we implement a new partition support range values, just like range > partition in Oracle/MySQL > ([https://docs.oracle.com/cd/E17952_01/mysql-5.7-en/partitioning-range.html]). > Say, we can partition by a date column, and make every two months as a > partition, or partitioned by a integer column, make interval of 10000 as a > partition. > > Ideally, feature like range partition should be implemented in Hive. While, > it's been always hard to update Hive version in a prod environment, and much > lightweight and flexible if we implement it in Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org