Re: Why the same INSERT OVERWRITE sql , final table file produced by spark sql is larger than hive sql?

2022-10-12 Thread Chartist


Hi, Sadha
I have solved this problem. And in my case it was caused by the different 
compression suite between hive and spark. In detail,Hive take ZLIB as default 
ORC compression suite but Spark take SNAPPY. Finally, when I took the same 
compression suite, final table file produced by spark sql and hive sql were 
almost the same size.


In addition, using the same compression suite to compress the same ORC file, 
the final file size should not have much deviation. And maybe you also check 
your compression suite.

| |
13289341606
|
|
13289341...@163.com
|


 Replied Message 
| From | Sadha Chilukoori |
| Date | 10/12/2022 08:27 |
| To | Chartist<13289341...@163.com> |
| Cc |  |
| Subject | Re: Why the same INSERT OVERWRITE sql , final table file produced 
by spark sql is larger than hive sql? |
I have faced the same problem, where hive and spark orc were using the snappy 
compression.


Hive 2.1
Spark 2.4.8


I'm curious to learn what could be the root cause of this.


-S


On Tue, Oct 11, 2022, 2:18 AM Chartist <13289341...@163.com> wrote:



Hi,All


I encountered  a problem as the e-mail subject described. And the followings 
are the details:


SQL:
insert overwrite table mytable partition(pt='20220518') 

select guid, user_new_id, sum_credit_score, sum_credit_score_change, 
platform_credit_score_change, bike_credit_score_change, 
evbike_credit_score_change, car_credit_score_change, slogan_type, bz_date
from mytable where pt = '20220518’;


mytable DDL:
CREATE TABLE `mytable`(
 `guid` string COMMENT 'xxx',
 `user_new_id` bigint COMMENT 'xxx',
 `sum_credit_score` bigint COMMENT 'xxx',
 `sum_credit_score_change` bigint COMMENT 'xxx',
 `platform_credit_score_change` bigint COMMENT 'xxx',
 `bike_credit_score_change` bigint COMMENT 'xxx',
 `evbike_credit_score_change` bigint COMMENT 'xxx',
 `car_credit_score_change` bigint COMMENT 'xxx',
 `slogan_type` bigint COMMENT 'xxx',
 `bz_date` string COMMENT 'xxx')
PARTITIONED BY (
 `pt` string COMMENT 'increment_partition')
ROW FORMAT SERDE
 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
 'hdfs://flashHadoopUAT/user/hive/warehouse/mytable'
TBLPROPERTIES (
 'spark.sql.create.version'='2.2 or prior',
 'spark.sql.sources.schema.numPartCols'='1',
 'spark.sql.sources.schema.numParts'='1',
 'spark.sql.sources.schema.part.0'=‘xxx SOME OMITTED CONTENT xxx',
 'spark.sql.sources.schema.partCol.0'='pt',
 'transient_lastDdlTime'='1653484849’)


ENV:
hive version 2.1.1
spark version 2.4.4


hadoop fs -du -h Result:
[hive sql]: 
735.2 M  /user/hive/warehouse/mytable/pt=20220518
[spark sql]: 
1.1 G  /user/hive/warehouse/mytable/pt=20220518


How could this happened? And if this is caused by the different version of orc? 
Any replies appreciated.


| |
13289341606
|
|
13289341...@163.com
|



Why the same INSERT OVERWRITE sql , final table file produced by spark sql is larger than hive sql?

2022-10-11 Thread Chartist


Hi,All


I encountered  a problem as the e-mail subject described. And the followings 
are the details:


SQL:
insert overwrite table mytable partition(pt='20220518') 

select guid, user_new_id, sum_credit_score, sum_credit_score_change, 
platform_credit_score_change, bike_credit_score_change, 
evbike_credit_score_change, car_credit_score_change, slogan_type, bz_date
from mytable where pt = '20220518’;


mytable DDL:
CREATE TABLE `mytable`(
 `guid` string COMMENT 'xxx',
 `user_new_id` bigint COMMENT 'xxx',
 `sum_credit_score` bigint COMMENT 'xxx',
 `sum_credit_score_change` bigint COMMENT 'xxx',
 `platform_credit_score_change` bigint COMMENT 'xxx',
 `bike_credit_score_change` bigint COMMENT 'xxx',
 `evbike_credit_score_change` bigint COMMENT 'xxx',
 `car_credit_score_change` bigint COMMENT 'xxx',
 `slogan_type` bigint COMMENT 'xxx',
 `bz_date` string COMMENT 'xxx')
PARTITIONED BY (
 `pt` string COMMENT 'increment_partition')
ROW FORMAT SERDE
 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
 'hdfs://flashHadoopUAT/user/hive/warehouse/mytable'
TBLPROPERTIES (
 'spark.sql.create.version'='2.2 or prior',
 'spark.sql.sources.schema.numPartCols'='1',
 'spark.sql.sources.schema.numParts'='1',
 'spark.sql.sources.schema.part.0'=‘xxx SOME OMITTED CONTENT xxx',
 'spark.sql.sources.schema.partCol.0'='pt',
 'transient_lastDdlTime'='1653484849’)


ENV:
hive version 2.1.1
spark version 2.4.4


hadoop fs -du -h Result:
[hive sql]: 
735.2 M  /user/hive/warehouse/mytable/pt=20220518
[spark sql]: 
1.1 G  /user/hive/warehouse/mytable/pt=20220518


How could this happened? And if this is caused by the different version of orc? 
Any replies appreciated.


| |
13289341606
|
|
13289341...@163.com
|