[jira] [Updated] (SPARK-29234) bucketed table created by Spark SQL DataFrame is in SequenceFile format

Suchintak Patnaik (Jira) Tue, 24 Sep 2019 11:59:09 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Suchintak Patnaik updated SPARK-29234:
--------------------------------------
    Description: 
When we create a bucketed table as follows, it's input and output format are 
getting displayed as SequenceFile format. But physically the files are getting 
created in HDFS as the format specified by the user e.g. orc,parquet,etc.

df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")

in Hive, DESCRIBE FORMATTED OrdersExample;

describe formatted ordersExample;
OK
# col_name              data_type               comment
col                     array<string>           from deserializer

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat:           
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Querying the same table in Hive is giving error.

select * from OrdersExample;
OK
Failed with exception java.io.IOException:java.io.IOException: 
hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-00000-55920574-eeb5-48b7-856d-e5c27e85ba12_00000.c000.snappy.orc
 not a SequenceFile

While reading the same table in Spark also giving error.

df = spark.


  was:
When we create a bucketed table as follows, it's input and output format are 
getting displayed as SequenceFile format. But physically the files are getting 
created in HDFS as the format specified by the user e.g. orc,parquet,etc.

df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")

in Hive, DESCRIBE FORMATTED OrdersExample;

describe formatted ordersExample;
OK
# col_name              data_type               comment
col                     array<string>           from deserializer

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat:           
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Querying the same table in Hive is giving error.

select * from OrdersExample;
OK
Failed with exception java.io.IOException:java.io.IOException: 
hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-00000-55920574-eeb5-48b7-856d-e5c27e85ba12_00000.c000.snappy.orc
 not a SequenceFile



> bucketed table created by Spark SQL DataFrame is in SequenceFile format
> -----------------------------------------------------------------------
>
>                 Key: SPARK-29234
>                 URL: https://issues.apache.org/jira/browse/SPARK-29234
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Suchintak Patnaik
>            Priority: Major
>
> When we create a bucketed table as follows, it's input and output format are 
> getting displayed as SequenceFile format. But physically the files are 
> getting created in HDFS as the format specified by the user e.g. 
> orc,parquet,etc.
> df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")
> in Hive, DESCRIBE FORMATTED OrdersExample;
> describe formatted ordersExample;
> OK
> # col_name              data_type               comment
> col                     array<string>           from deserializer
> # Storage Information
> SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat:            org.apache.hadoop.mapred.SequenceFileInputFormat
> OutputFormat:           
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> Querying the same table in Hive is giving error.
> select * from OrdersExample;
> OK
> Failed with exception java.io.IOException:java.io.IOException: 
> hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-00000-55920574-eeb5-48b7-856d-e5c27e85ba12_00000.c000.snappy.orc
>  not a SequenceFile
> While reading the same table in Spark also giving error.
> df = spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29234) bucketed table created by Spark SQL DataFrame is in SequenceFile format

Reply via email to