Please help the question of repartition for dataset from partiitoned hive table

zhangliyun Wed, 21 Aug 2019 15:32:28 -0700



Hi All:
   i have a question about repartition api and sparksql partition. I have an 
table which partition key is day
```
./bin/spark-sql -e "CREATE TABLE t_original_partitioned_spark (cust_id int, 
loss double) PARTITIONED BY (day STRING) location 
'hdfs://localhost:9000/t_original_partitioned_spark'"


```
insert serveral data and now there are 2 partitions as two days ( 2019-05-30 
and 2019-05-20)
```
sqlContext.sql("insert into  t_original_partitioned_spark values 
(30,'0.3','2019-05-30'))
sqlContext.sql("insert into  t_original_partitioned_spark values 
(20,'0.2','2019-05-20'))




```


now i want to repartition the data to 1 partition as in actual case there maybe 
too much partitions ,i want to make fewer partitions.


I call repartition api and overwrite the the table. i hope now there is 1 
partition but actually there are two partitions when query by "show partitions 
default.t_original_partitioned_spark"
```


 val df = sqlContext.sql("select * from t_original_partitioned_spark")
df1=df.repartition(1)
df1.write.mode(org.apache.spark.sql.SaveMode.Overwrite).format("seq").insertInto(s"default.t_original_partitioned_spark")


```


my question is the actually partition number is decided by the num of 
repartition($num) or the hive table partitions if i use both of them?
Best Regards
Kelly Zhang

Please help the question of repartition for dataset from partiitoned hive table

Reply via email to