Hi Shivam, At the end, the file is taking its own space regardless of the block size. So if you're file is just a few ko bytes, it will take only this few ko bytes. But I've noticed that when the file is written, somehow a block is allocated and the Namenode consider that all the block size is used. I had this problem when writing a too much partitioned dataset ! But as soon as the file was written, the Namenode seems to know its true size and drop the "default block size"
Arnaud On Mon, Jan 21, 2019 at 9:01 AM Shivam Sharma <28shivamsha...@gmail.com> wrote: > Don't we have any property for it? > > One more quick question that if files created by Spark is less than HDFS > block size then the rest of Block space will become unavailable and remain > unutilized or it will be shared with other files? > > On Mon, Jan 21, 2019 at 1:30 PM Shivam Sharma <28shivamsha...@gmail.com> > wrote: > >> Don't we have any property for it? >> >> One more quick question that if files created by Spark is less than HDFS >> block size then the rest of Block space will become unavailable and remain >> unutilized or it will be shared with other files? >> >> On Sun, Jan 20, 2019 at 12:47 AM Hichame El Khalfi <hich...@elkhalfi.com> >> wrote: >> >>> You can do this in 2 passes (not one) >>> A) save you dataset into hdfs with what you have. >>> B) calculate number of partition, n= (size of your dataset)/hdfs block >>> size >>> Then run simple spark job to read and partition based on 'n'. >>> >>> Hichame >>> >>> *From:* felixcheun...@hotmail.com >>> *Sent:* January 19, 2019 2:06 PM >>> *To:* 28shivamsha...@gmail.com; user@spark.apache.org >>> *Subject:* Re: Persist Dataframe to HDFS considering HDFS Block Size. >>> >>> You can call coalesce to combine partitions.. >>> >>> >>> ------------------------------ >>> *From:* Shivam Sharma <28shivamsha...@gmail.com> >>> *Sent:* Saturday, January 19, 2019 7:43 AM >>> *To:* user@spark.apache.org >>> *Subject:* Persist Dataframe to HDFS considering HDFS Block Size. >>> >>> Hi All, >>> >>> I wanted to persist dataframe on HDFS. Basically, I am inserting data >>> into a HIVE table using Spark. Currently, at the time of writing to HIVE >>> table I have set total shuffle partitions = 400 so total 400 files are >>> being created which is not even considering HDFS block size. How can I tell >>> spark to persist according to HDFS Blocks. >>> >>> We have something like this HIVE which solves this problem: >>> >>> set hive.merge.sparkfiles=true; >>> set hive.merge.smallfiles.avgsize=2048000000; >>> set hive.merge.size.per.task=4096000000; >>> >>> Thanks >>> >>> -- >>> Shivam Sharma >>> Indian Institute Of Information Technology, Design and Manufacturing >>> Jabalpur >>> Mobile No- (+91) 8882114744 >>> Email:- 28shivamsha...@gmail.com >>> LinkedIn:-*https://www.linkedin.com/in/28shivamsharma >>> <https://www.linkedin.com/in/28shivamsharma>* >>> >> >> >> -- >> Shivam Sharma >> Indian Institute Of Information Technology, Design and Manufacturing >> Jabalpur >> Mobile No- (+91) 8882114744 >> Email:- 28shivamsha...@gmail.com >> LinkedIn:-*https://www.linkedin.com/in/28shivamsharma >> <https://www.linkedin.com/in/28shivamsharma>* >> > > > -- > Shivam Sharma > Indian Institute Of Information Technology, Design and Manufacturing > Jabalpur > Mobile No- (+91) 8882114744 > Email:- 28shivamsha...@gmail.com > LinkedIn:-*https://www.linkedin.com/in/28shivamsharma > <https://www.linkedin.com/in/28shivamsharma>* >