Re: Persist Dataframe to HDFS considering HDFS Block Size.

Shivam Sharma Mon, 21 Jan 2019 02:01:37 -0800

Thanks Arnaud

On Mon, Jan 21, 2019 at 2:07 PM Arnaud LARROQUE <alarro...@gmail.com> wrote:


> Hi Shivam,
>
> At the end, the file is taking its own space regardless of the block size.
> So if you're file is just a few ko bytes, it will take only this few ko
> bytes.
> But I've noticed that when the file is written, somehow a block is
> allocated and the Namenode consider that all the block size is used. I had
> this problem when writing a too much partitioned dataset !
> But as soon as the file was written, the Namenode seems to know its true
> size and drop the "default block size"
>
> Arnaud
>
> On Mon, Jan 21, 2019 at 9:01 AM Shivam Sharma <28shivamsha...@gmail.com>
> wrote:
>
>> Don't we have any property for it?
>>
>> One more quick question that if files created by Spark is less than HDFS
>> block size then the rest of Block space will become unavailable and remain
>> unutilized or it will be shared with other files?
>>
>> On Mon, Jan 21, 2019 at 1:30 PM Shivam Sharma <28shivamsha...@gmail.com>
>> wrote:
>>
>>> Don't we have any property for it?
>>>
>>> One more quick question that if files created by Spark is less than HDFS
>>> block size then the rest of Block space will become unavailable and remain
>>> unutilized or it will be shared with other files?
>>>
>>> On Sun, Jan 20, 2019 at 12:47 AM Hichame El Khalfi <hich...@elkhalfi.com>
>>> wrote:
>>>
>>>> You can do this in 2 passes (not one)
>>>> A) save you dataset into hdfs with what you have.
>>>> B) calculate number of partition, n= (size of your dataset)/hdfs block
>>>> size
>>>> Then run simple spark job to read and partition based on 'n'.
>>>>
>>>> Hichame
>>>>
>>>> *From:* felixcheun...@hotmail.com
>>>> *Sent:* January 19, 2019 2:06 PM
>>>> *To:* 28shivamsha...@gmail.com; user@spark.apache.org
>>>> *Subject:* Re: Persist Dataframe to HDFS considering HDFS Block Size.
>>>>
>>>> You can call coalesce to combine partitions..
>>>>
>>>>
>>>> ------------------------------
>>>> *From:* Shivam Sharma <28shivamsha...@gmail.com>
>>>> *Sent:* Saturday, January 19, 2019 7:43 AM
>>>> *To:* user@spark.apache.org
>>>> *Subject:* Persist Dataframe to HDFS considering HDFS Block Size.
>>>>
>>>> Hi All,
>>>>
>>>> I wanted to persist dataframe on HDFS. Basically, I am inserting data
>>>> into a HIVE table using Spark. Currently, at the time of writing to HIVE
>>>> table I have set total shuffle partitions = 400 so total 400 files are
>>>> being created which is not even considering HDFS block size. How can I tell
>>>> spark to persist according to HDFS Blocks.
>>>>
>>>> We have something like this HIVE which solves this problem:
>>>>
>>>> set hive.merge.sparkfiles=true;
>>>> set hive.merge.smallfiles.avgsize=2048000000;
>>>> set hive.merge.size.per.task=4096000000;
>>>>
>>>> Thanks
>>>>
>>>> --
>>>> Shivam Sharma
>>>> Indian Institute Of Information Technology, Design and Manufacturing
>>>> Jabalpur
>>>> Mobile No- (+91) 8882114744
>>>> Email:- 28shivamsha...@gmail.com
>>>> LinkedIn:-*https://www.linkedin.com/in/28shivamsharma
>>>> <https://www.linkedin.com/in/28shivamsharma>*
>>>>
>>>
>>>
>>> --
>>> Shivam Sharma
>>> Indian Institute Of Information Technology, Design and Manufacturing
>>> Jabalpur
>>> Mobile No- (+91) 8882114744
>>> Email:- 28shivamsha...@gmail.com
>>> LinkedIn:-*https://www.linkedin.com/in/28shivamsharma
>>> <https://www.linkedin.com/in/28shivamsharma>*
>>>
>>
>>
>> --
>> Shivam Sharma
>> Indian Institute Of Information Technology, Design and Manufacturing
>> Jabalpur
>> Mobile No- (+91) 8882114744
>> Email:- 28shivamsha...@gmail.com
>> LinkedIn:-*https://www.linkedin.com/in/28shivamsharma
>> <https://www.linkedin.com/in/28shivamsharma>*
>>
>

-- 
Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing
Jabalpur
Mobile No- (+91) 8882114744
Email:- 28shivamsha...@gmail.com
LinkedIn:-*https://www.linkedin.com/in/28shivamsharma
<https://www.linkedin.com/in/28shivamsharma>*

Re: Persist Dataframe to HDFS considering HDFS Block Size.

Reply via email to