[jira] [Commented] (SPARK-17487) Configurable bucketing info extraction
[ https://issues.apache.org/jira/browse/SPARK-17487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564701#comment-15564701 ] Reynold Xin commented on SPARK-17487: - Thanks - that makes sense! > Configurable bucketing info extraction > -- > > Key: SPARK-17487 > URL: https://issues.apache.org/jira/browse/SPARK-17487 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Tejas Patil >Priority: Minor > > Spark's uses a specific way to name bucketed files which is different from > Hive's bucketed file naming scheme. For making Spark support bucketing for > Hive tables, there needs to be a pluggable way to switch across these naming > schemes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17487) Configurable bucketing info extraction
[ https://issues.apache.org/jira/browse/SPARK-17487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563756#comment-15563756 ] Tejas Patil commented on SPARK-17487: - [~rxin] : Since Spark native tables and hive tables follow a different naming convention for bucketed files, I want to make Spark produce Hive compatible file names if the output is a hive bucketed table. Currently, the read and write code paths assume that everything is as per Spark's native bucketing scheme eg. [1]. I think after SPARK-16879, its possible to recognise if a given `CatalogTable` corresponds to a hive table or not [0]. So, the change here would be to propagate that information downstream where : - planning is done to assign splits for tasks to read. Each task should process a single bucket and for join same bucket of the 2 tables should be read by a given task. - when final output files are being written out. Similar stuff would be needed for hashing. I will avoid auto-detecting hash and bucketing scheme based on the actual filename and instead rely on which `Catalog` a table comes from as it seems more robust. [0] : https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L137 [1] : https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BucketingUtils.scala#L31 > Configurable bucketing info extraction > -- > > Key: SPARK-17487 > URL: https://issues.apache.org/jira/browse/SPARK-17487 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Tejas Patil >Priority: Minor > > Spark's uses a specific way to name bucketed files which is different from > Hive's bucketed file naming scheme. For making Spark support bucketing for > Hive tables, there needs to be a pluggable way to switch across these naming > schemes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17487) Configurable bucketing info extraction
[ https://issues.apache.org/jira/browse/SPARK-17487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15561489#comment-15561489 ] Reynold Xin commented on SPARK-17487: - [~tejasp] Given Spark uses different files names -- it seems like we can just recognize the file names to actually tell whether we should use Hive's hash function or Spark's. Is it what this PR is trying to do? > Configurable bucketing info extraction > -- > > Key: SPARK-17487 > URL: https://issues.apache.org/jira/browse/SPARK-17487 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Tejas Patil >Priority: Minor > > Spark's uses a specific way to name bucketed files which is different from > Hive's bucketed file naming scheme. For making Spark support bucketing for > Hive tables, there needs to be a pluggable way to switch across these naming > schemes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17487) Configurable bucketing info extraction
[ https://issues.apache.org/jira/browse/SPARK-17487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15479042#comment-15479042 ] Tejas Patil commented on SPARK-17487: - I have a WIP for this. I am looking for early feedback wrt approach : https://github.com/apache/spark/pull/15040 > Configurable bucketing info extraction > -- > > Key: SPARK-17487 > URL: https://issues.apache.org/jira/browse/SPARK-17487 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Tejas Patil >Priority: Minor > > Spark's uses a specific way to name bucketed files which is different from > Hive's bucketed file naming scheme. For making Spark support bucketing for > Hive tables, there needs to be a pluggable way to switch across these naming > schemes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17487) Configurable bucketing info extraction
[ https://issues.apache.org/jira/browse/SPARK-17487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15479043#comment-15479043 ] Apache Spark commented on SPARK-17487: -- User 'tejasapatil' has created a pull request for this issue: https://github.com/apache/spark/pull/15040 > Configurable bucketing info extraction > -- > > Key: SPARK-17487 > URL: https://issues.apache.org/jira/browse/SPARK-17487 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Tejas Patil >Priority: Minor > > Spark's uses a specific way to name bucketed files which is different from > Hive's bucketed file naming scheme. For making Spark support bucketing for > Hive tables, there needs to be a pluggable way to switch across these naming > schemes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org