You can do this in 2 passes (not one)
A) save you dataset into hdfs with what you have.
B) calculate number of partition, n= (size of your dataset)/hdfs block size
Then run simple spark job to read and partition based on 'n'.

Hichame

From: felixcheun...@hotmail.com
Sent: January 19, 2019 2:06 PM
To: 28shivamsha...@gmail.com; user@spark.apache.org
Subject: Re: Persist Dataframe to HDFS considering HDFS Block Size.


You can call coalesce to combine partitions..


________________________________
From: Shivam Sharma <28shivamsha...@gmail.com>
Sent: Saturday, January 19, 2019 7:43 AM
To: user@spark.apache.org
Subject: Persist Dataframe to HDFS considering HDFS Block Size.

Hi All,

I wanted to persist dataframe on HDFS. Basically, I am inserting data into a 
HIVE table using Spark. Currently, at the time of writing to HIVE table I have 
set total shuffle partitions = 400 so total 400 files are being created which 
is not even considering HDFS block size. How can I tell spark to persist 
according to HDFS Blocks.

We have something like this HIVE which solves this problem:

set hive.merge.sparkfiles=true;
set hive.merge.smallfiles.avgsize=2048000000;
set hive.merge.size.per.task=4096000000;

Thanks

--
Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing Jabalpur
Mobile No- (+91) 8882114744
Email:- 28shivamsha...@gmail.com<mailto:28shivamsha...@gmail.com>
LinkedIn:-https://www.linkedin.com/in/28shivamsharma

Reply via email to