Why the same INSERT OVERWRITE sql , final table file produced by spark sql is larger than hive sql？

Chartist Tue, 11 Oct 2022 02:18:19 -0700


Hi，All



I encountered  a problem as the e-mail subject described. And the followings 
are the details:


SQL:
insert overwrite table mytable partition(pt='20220518') 

select guid, user_new_id, sum_credit_score, sum_credit_score_change, 
platform_credit_score_change, bike_credit_score_change, 
evbike_credit_score_change, car_credit_score_change, slogan_type, bz_date
from mytable where pt = '20220518’;


mytable DDL:
CREATE TABLE `mytable`(
 `guid` string COMMENT 'xxx',
 `user_new_id` bigint COMMENT 'xxx',
 `sum_credit_score` bigint COMMENT 'xxx',
 `sum_credit_score_change` bigint COMMENT 'xxx',
 `platform_credit_score_change` bigint COMMENT 'xxx',
 `bike_credit_score_change` bigint COMMENT 'xxx',
 `evbike_credit_score_change` bigint COMMENT 'xxx',
 `car_credit_score_change` bigint COMMENT 'xxx',
 `slogan_type` bigint COMMENT 'xxx',
 `bz_date` string COMMENT 'xxx')
PARTITIONED BY (
 `pt` string COMMENT 'increment_partition')
ROW FORMAT SERDE
 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
 'hdfs://flashHadoopUAT/user/hive/warehouse/mytable'
TBLPROPERTIES (
 'spark.sql.create.version'='2.2 or prior',
 'spark.sql.sources.schema.numPartCols'='1',
 'spark.sql.sources.schema.numParts'='1',
 'spark.sql.sources.schema.part.0'=‘xxx SOME OMITTED CONTENT xxx',
 'spark.sql.sources.schema.partCol.0'='pt',
 'transient_lastDdlTime'='1653484849’)


ENV：
hive version 2.1.1
spark version 2.4.4


hadoop fs -du -h Result：
[hive sql]: 
735.2 M  /user/hive/warehouse/mytable/pt=20220518
[spark sql]: 
1.1 G  /user/hive/warehouse/mytable/pt=20220518


How could this happened? And if this is caused by the different version of orc? 
Any replies appreciated.


| |
13289341606
|
|
13289341...@163.com
|

Why the same INSERT OVERWRITE sql , final table file produced by spark sql is larger than hive sql？

Reply via email to