[ https://issues.apache.org/jira/browse/SPARK-17487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15563756#comment-15563756 ]
Tejas Patil commented on SPARK-17487: ------------------------------------- [~rxin] : Since Spark native tables and hive tables follow a different naming convention for bucketed files, I want to make Spark produce Hive compatible file names if the output is a hive bucketed table. Currently, the read and write code paths assume that everything is as per Spark's native bucketing scheme eg. [1]. I think after SPARK-16879, its possible to recognise if a given `CatalogTable` corresponds to a hive table or not [0]. So, the change here would be to propagate that information downstream where : - planning is done to assign splits for tasks to read. Each task should process a single bucket and for join same bucket of the 2 tables should be read by a given task. - when final output files are being written out. Similar stuff would be needed for hashing. I will avoid auto-detecting hash and bucketing scheme based on the actual filename and instead rely on which `Catalog` a table comes from as it seems more robust. [0] : https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L137 [1] : https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BucketingUtils.scala#L31 > Configurable bucketing info extraction > -------------------------------------- > > Key: SPARK-17487 > URL: https://issues.apache.org/jira/browse/SPARK-17487 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0 > Reporter: Tejas Patil > Priority: Minor > > Spark's uses a specific way to name bucketed files which is different from > Hive's bucketed file naming scheme. For making Spark support bucketing for > Hive tables, there needs to be a pluggable way to switch across these naming > schemes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org