Hi, Sadha I have solved this problem. And in my case it was caused by the different compression suite between hive and spark. In detail,Hive take ZLIB as default ORC compression suite but Spark take SNAPPY. Finally, when I took the same compression suite, final table file produced by spark sql and hive sql were almost the same size.
In addition, using the same compression suite to compress the same ORC file, the final file size should not have much deviation. And maybe you also check your compression suite. | | 13289341606 | | 13289341...@163.com | ---- Replied Message ---- | From | Sadha Chilukoori<sage.quoti...@gmail.com> | | Date | 10/12/2022 08:27 | | To | Chartist<13289341...@163.com> | | Cc | <user@spark.apache.org> | | Subject | Re: Why the same INSERT OVERWRITE sql , final table file produced by spark sql is larger than hive sql? | I have faced the same problem, where hive and spark orc were using the snappy compression. Hive 2.1 Spark 2.4.8 I'm curious to learn what could be the root cause of this. -S On Tue, Oct 11, 2022, 2:18 AM Chartist <13289341...@163.com> wrote: Hi,All I encountered a problem as the e-mail subject described. And the followings are the details: SQL: insert overwrite table mytable partition(pt='20220518') select guid, user_new_id, sum_credit_score, sum_credit_score_change, platform_credit_score_change, bike_credit_score_change, evbike_credit_score_change, car_credit_score_change, slogan_type, bz_date from mytable where pt = '20220518’; mytable DDL: CREATE TABLE `mytable`( `guid` string COMMENT 'xxx', `user_new_id` bigint COMMENT 'xxx', `sum_credit_score` bigint COMMENT 'xxx', `sum_credit_score_change` bigint COMMENT 'xxx', `platform_credit_score_change` bigint COMMENT 'xxx', `bike_credit_score_change` bigint COMMENT 'xxx', `evbike_credit_score_change` bigint COMMENT 'xxx', `car_credit_score_change` bigint COMMENT 'xxx', `slogan_type` bigint COMMENT 'xxx', `bz_date` string COMMENT 'xxx') PARTITIONED BY ( `pt` string COMMENT 'increment_partition') ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs://flashHadoopUAT/user/hive/warehouse/mytable' TBLPROPERTIES ( 'spark.sql.create.version'='2.2 or prior', 'spark.sql.sources.schema.numPartCols'='1', 'spark.sql.sources.schema.numParts'='1', 'spark.sql.sources.schema.part.0'=‘xxx SOME OMITTED CONTENT xxx', 'spark.sql.sources.schema.partCol.0'='pt', 'transient_lastDdlTime'='1653484849’) ENV: hive version 2.1.1 spark version 2.4.4 hadoop fs -du -h Result: [hive sql]: 735.2 M /user/hive/warehouse/mytable/pt=20220518 [spark sql]: 1.1 G /user/hive/warehouse/mytable/pt=20220518 How could this happened? And if this is caused by the different version of orc? Any replies appreciated. | | 13289341606 | | 13289341...@163.com |