Thanks Arnaud On Mon, Jan 21, 2019 at 2:07 PM Arnaud LARROQUE <alarro...@gmail.com> wrote:
> Hi Shivam, > > At the end, the file is taking its own space regardless of the block size. > So if you're file is just a few ko bytes, it will take only this few ko > bytes. > But I've noticed that when the file is written, somehow a block is > allocated and the Namenode consider that all the block size is used. I had > this problem when writing a too much partitioned dataset ! > But as soon as the file was written, the Namenode seems to know its true > size and drop the "default block size" > > Arnaud > > On Mon, Jan 21, 2019 at 9:01 AM Shivam Sharma <28shivamsha...@gmail.com> > wrote: > >> Don't we have any property for it? >> >> One more quick question that if files created by Spark is less than HDFS >> block size then the rest of Block space will become unavailable and remain >> unutilized or it will be shared with other files? >> >> On Mon, Jan 21, 2019 at 1:30 PM Shivam Sharma <28shivamsha...@gmail.com> >> wrote: >> >>> Don't we have any property for it? >>> >>> One more quick question that if files created by Spark is less than HDFS >>> block size then the rest of Block space will become unavailable and remain >>> unutilized or it will be shared with other files? >>> >>> On Sun, Jan 20, 2019 at 12:47 AM Hichame El Khalfi <hich...@elkhalfi.com> >>> wrote: >>> >>>> You can do this in 2 passes (not one) >>>> A) save you dataset into hdfs with what you have. >>>> B) calculate number of partition, n= (size of your dataset)/hdfs block >>>> size >>>> Then run simple spark job to read and partition based on 'n'. >>>> >>>> Hichame >>>> >>>> *From:* felixcheun...@hotmail.com >>>> *Sent:* January 19, 2019 2:06 PM >>>> *To:* 28shivamsha...@gmail.com; user@spark.apache.org >>>> *Subject:* Re: Persist Dataframe to HDFS considering HDFS Block Size. >>>> >>>> You can call coalesce to combine partitions.. >>>> >>>> >>>> ------------------------------ >>>> *From:* Shivam Sharma <28shivamsha...@gmail.com> >>>> *Sent:* Saturday, January 19, 2019 7:43 AM >>>> *To:* user@spark.apache.org >>>> *Subject:* Persist Dataframe to HDFS considering HDFS Block Size. >>>> >>>> Hi All, >>>> >>>> I wanted to persist dataframe on HDFS. Basically, I am inserting data >>>> into a HIVE table using Spark. Currently, at the time of writing to HIVE >>>> table I have set total shuffle partitions = 400 so total 400 files are >>>> being created which is not even considering HDFS block size. How can I tell >>>> spark to persist according to HDFS Blocks. >>>> >>>> We have something like this HIVE which solves this problem: >>>> >>>> set hive.merge.sparkfiles=true; >>>> set hive.merge.smallfiles.avgsize=2048000000; >>>> set hive.merge.size.per.task=4096000000; >>>> >>>> Thanks >>>> >>>> -- >>>> Shivam Sharma >>>> Indian Institute Of Information Technology, Design and Manufacturing >>>> Jabalpur >>>> Mobile No- (+91) 8882114744 >>>> Email:- 28shivamsha...@gmail.com >>>> LinkedIn:-*https://www.linkedin.com/in/28shivamsharma >>>> <https://www.linkedin.com/in/28shivamsharma>* >>>> >>> >>> >>> -- >>> Shivam Sharma >>> Indian Institute Of Information Technology, Design and Manufacturing >>> Jabalpur >>> Mobile No- (+91) 8882114744 >>> Email:- 28shivamsha...@gmail.com >>> LinkedIn:-*https://www.linkedin.com/in/28shivamsharma >>> <https://www.linkedin.com/in/28shivamsharma>* >>> >> >> >> -- >> Shivam Sharma >> Indian Institute Of Information Technology, Design and Manufacturing >> Jabalpur >> Mobile No- (+91) 8882114744 >> Email:- 28shivamsha...@gmail.com >> LinkedIn:-*https://www.linkedin.com/in/28shivamsharma >> <https://www.linkedin.com/in/28shivamsharma>* >> > -- Shivam Sharma Indian Institute Of Information Technology, Design and Manufacturing Jabalpur Mobile No- (+91) 8882114744 Email:- 28shivamsha...@gmail.com LinkedIn:-*https://www.linkedin.com/in/28shivamsharma <https://www.linkedin.com/in/28shivamsharma>*