Hi, Sadha
I have solved this problem. And in my case it was caused by the different
compression suite between hive and spark. In detail,Hive take ZLIB as default
ORC compression suite but Spark take SNAPPY. Finally, when I took the same
compression suite, final table file produced by spark sql and hive sql were
almost the same size.
In addition, using the same compression suite to compress the same ORC file,
the final file size should not have much deviation. And maybe you also check
your compression suite.
| |
13289341606
|
|
13289341...@163.com
|
Replied Message
| From | Sadha Chilukoori |
| Date | 10/12/2022 08:27 |
| To | Chartist<13289341...@163.com> |
| Cc | |
| Subject | Re: Why the same INSERT OVERWRITE sql , final table file produced
by spark sql is larger than hive sql? |
I have faced the same problem, where hive and spark orc were using the snappy
compression.
Hive 2.1
Spark 2.4.8
I'm curious to learn what could be the root cause of this.
-S
On Tue, Oct 11, 2022, 2:18 AM Chartist <13289341...@163.com> wrote:
Hi,All
I encountered a problem as the e-mail subject described. And the followings
are the details:
SQL:
insert overwrite table mytable partition(pt='20220518')
select guid, user_new_id, sum_credit_score, sum_credit_score_change,
platform_credit_score_change, bike_credit_score_change,
evbike_credit_score_change, car_credit_score_change, slogan_type, bz_date
from mytable where pt = '20220518’;
mytable DDL:
CREATE TABLE `mytable`(
`guid` string COMMENT 'xxx',
`user_new_id` bigint COMMENT 'xxx',
`sum_credit_score` bigint COMMENT 'xxx',
`sum_credit_score_change` bigint COMMENT 'xxx',
`platform_credit_score_change` bigint COMMENT 'xxx',
`bike_credit_score_change` bigint COMMENT 'xxx',
`evbike_credit_score_change` bigint COMMENT 'xxx',
`car_credit_score_change` bigint COMMENT 'xxx',
`slogan_type` bigint COMMENT 'xxx',
`bz_date` string COMMENT 'xxx')
PARTITIONED BY (
`pt` string COMMENT 'increment_partition')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://flashHadoopUAT/user/hive/warehouse/mytable'
TBLPROPERTIES (
'spark.sql.create.version'='2.2 or prior',
'spark.sql.sources.schema.numPartCols'='1',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'=‘xxx SOME OMITTED CONTENT xxx',
'spark.sql.sources.schema.partCol.0'='pt',
'transient_lastDdlTime'='1653484849’)
ENV:
hive version 2.1.1
spark version 2.4.4
hadoop fs -du -h Result:
[hive sql]:
735.2 M /user/hive/warehouse/mytable/pt=20220518
[spark sql]:
1.1 G /user/hive/warehouse/mytable/pt=20220518
How could this happened? And if this is caused by the different version of orc?
Any replies appreciated.
| |
13289341606
|
|
13289341...@163.com
|