[ 
https://issues.apache.org/jira/browse/SPARK-27498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-27498:
----------------------------------
    Description: 
_Caveat: I can see how this could be intentional if Spark believes that the 
built-in Parquet code path is creating Hive-compatible bucketed files. However, 
I assume that is not the case and that this is an actual bug._
  
 Spark makes an effort to avoid corrupting bucketed Hive tables unless the user 
overrides this behavior by setting hive.enforce.bucketing and 
hive.enforce.sorting to false.

However, this behavior falls down when Spark uses the built-in Parquet code 
path to write to the table.

Here's an example.

In Hive, do this (I create a table where things work as expected, and one where 
things don't work as expected):
{noformat}
hive> create table sourcetable as select 1 a, 3 b, 7 c;
hive> drop table hivebuckettext1;
hive> create table hivebuckettext1 (a int, b int, c int) clustered by (a, b) 
sorted by (a, b asc) into 10 buckets stored as textfile;
hive> insert into hivebuckettext1 select * from sourcetable;
hive> drop table hivebucketparq1;
hive> create table hivebucketparq1 (a int, b int, c int) clustered by (a, b) 
sorted by (a, b asc) into 10 buckets stored as parquet;
hive> insert into hivebucketparq1 select * from sourcetable;
{noformat}
For the text table, things seem to work as expected:
{noformat}
scala> sql("insert into hivebuckettext1 select 1, 2, 3")
19/04/17 10:26:08 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
org.apache.spark.sql.AnalysisException: Output Hive table 
`default`.`hivebuckettext1` is bucketed but Spark currently does NOT populate 
bucketed output which is compatible with Hive.;
{noformat}
For the parquet table, the insert just happens:
{noformat}
scala> sql("insert into hivebucketparq1 select 1, 2, 3")
res1: org.apache.spark.sql.DataFrame = []
scala> 
{noformat}
Note also that Spark has changed the table definition of hivebucketparq1 (in 
the HMS!) so that it is no longer a bucketed table. I will file a separate Jira 
on this (SPARK-27497).

If you specify "spark.sql.hive.convertMetastoreParquet=false", things work as 
expected.

Basically, InsertIntoHiveTable respects hive.enforce.bucketing, but 
InsertIntoHadoopFsRelationCommand does not.

 

 

  was:
_Caveat: I can see how this could be intentional if Spark believes that the 
built-in Parquet code path is creating Hive-compatible bucketed files. However, 
I assume that is not the case and that this is an actual bug._
  
 Spark makes an effort to avoid corrupting Hive-bucketed tables unless the user 
overrides this behavior by setting hive.enforce.bucketing and 
hive.enforce.sorting to false.

However, this behavior falls down when Spark uses the built-in Parquet code 
path to write to the table.

Here's an example.

In Hive, do this (I create a table where things work as expected, and one where 
things don't work as expected):
{noformat}
hive> create table sourcetable as select 1 a, 3 b, 7 c;
hive> drop table hivebuckettext1;
hive> create table hivebuckettext1 (a int, b int, c int) clustered by (a, b) 
sorted by (a, b asc) into 10 buckets stored as textfile;
hive> insert into hivebuckettext1 select * from sourcetable;
hive> drop table hivebucketparq1;
hive> create table hivebucketparq1 (a int, b int, c int) clustered by (a, b) 
sorted by (a, b asc) into 10 buckets stored as parquet;
hive> insert into hivebucketparq1 select * from sourcetable;
{noformat}
For the text table, things seem to work as expected:
{noformat}
scala> sql("insert into hivebuckettext1 select 1, 2, 3")
19/04/17 10:26:08 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
org.apache.spark.sql.AnalysisException: Output Hive table 
`default`.`hivebuckettext1` is bucketed but Spark currently does NOT populate 
bucketed output which is compatible with Hive.;
{noformat}
For the parquet table, the insert just happens:
{noformat}
scala> sql("insert into hivebucketparq1 select 1, 2, 3")
res1: org.apache.spark.sql.DataFrame = []
scala> 
{noformat}
Note also that Spark has changed the table definition of hivebucketparq1 (in 
the HMS!) so that it is no longer a bucketed table. I will file a separate Jira 
on this (SPARK-27497).

If you specify "spark.sql.hive.convertMetastoreParquet=false", things work as 
expected.

Basically, InsertIntoHiveTable respects hive.enforce.bucketing, but 
InsertIntoHadoopFsRelationCommand does not.

 

 


> Built-in parquet code path does not respect hive.enforce.bucketing
> ------------------------------------------------------------------
>
>                 Key: SPARK-27498
>                 URL: https://issues.apache.org/jira/browse/SPARK-27498
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0, 3.0.0
>            Reporter: Bruce Robbins
>            Priority: Major
>
> _Caveat: I can see how this could be intentional if Spark believes that the 
> built-in Parquet code path is creating Hive-compatible bucketed files. 
> However, I assume that is not the case and that this is an actual bug._
>   
>  Spark makes an effort to avoid corrupting bucketed Hive tables unless the 
> user overrides this behavior by setting hive.enforce.bucketing and 
> hive.enforce.sorting to false.
> However, this behavior falls down when Spark uses the built-in Parquet code 
> path to write to the table.
> Here's an example.
> In Hive, do this (I create a table where things work as expected, and one 
> where things don't work as expected):
> {noformat}
> hive> create table sourcetable as select 1 a, 3 b, 7 c;
> hive> drop table hivebuckettext1;
> hive> create table hivebuckettext1 (a int, b int, c int) clustered by (a, b) 
> sorted by (a, b asc) into 10 buckets stored as textfile;
> hive> insert into hivebuckettext1 select * from sourcetable;
> hive> drop table hivebucketparq1;
> hive> create table hivebucketparq1 (a int, b int, c int) clustered by (a, b) 
> sorted by (a, b asc) into 10 buckets stored as parquet;
> hive> insert into hivebucketparq1 select * from sourcetable;
> {noformat}
> For the text table, things seem to work as expected:
> {noformat}
> scala> sql("insert into hivebuckettext1 select 1, 2, 3")
> 19/04/17 10:26:08 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> org.apache.spark.sql.AnalysisException: Output Hive table 
> `default`.`hivebuckettext1` is bucketed but Spark currently does NOT populate 
> bucketed output which is compatible with Hive.;
> {noformat}
> For the parquet table, the insert just happens:
> {noformat}
> scala> sql("insert into hivebucketparq1 select 1, 2, 3")
> res1: org.apache.spark.sql.DataFrame = []
> scala> 
> {noformat}
> Note also that Spark has changed the table definition of hivebucketparq1 (in 
> the HMS!) so that it is no longer a bucketed table. I will file a separate 
> Jira on this (SPARK-27497).
> If you specify "spark.sql.hive.convertMetastoreParquet=false", things work as 
> expected.
> Basically, InsertIntoHiveTable respects hive.enforce.bucketing, but 
> InsertIntoHadoopFsRelationCommand does not.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to