Re: Persist Dataframe to HDFS considering HDFS Block Size.

2019-01-21 Thread Shivam Sharma
Thanks Arnaud

On Mon, Jan 21, 2019 at 2:07 PM Arnaud LARROQUE  wrote:

> Hi Shivam,
>
> At the end, the file is taking its own space regardless of the block size.
> So if you're file is just a few ko bytes, it will take only this few ko
> bytes.
> But I've noticed that when the file is written, somehow a block is
> allocated and the Namenode consider that all the block size is used. I had
> this problem when writing a too much partitioned dataset !
> But as soon as the file was written, the Namenode seems to know its true
> size and drop the "default block size"
>
> Arnaud
>
> On Mon, Jan 21, 2019 at 9:01 AM Shivam Sharma <28shivamsha...@gmail.com>
> wrote:
>
>> Don't we have any property for it?
>>
>> One more quick question that if files created by Spark is less than HDFS
>> block size then the rest of Block space will become unavailable and remain
>> unutilized or it will be shared with other files?
>>
>> On Mon, Jan 21, 2019 at 1:30 PM Shivam Sharma <28shivamsha...@gmail.com>
>> wrote:
>>
>>> Don't we have any property for it?
>>>
>>> One more quick question that if files created by Spark is less than HDFS
>>> block size then the rest of Block space will become unavailable and remain
>>> unutilized or it will be shared with other files?
>>>
>>> On Sun, Jan 20, 2019 at 12:47 AM Hichame El Khalfi 
>>> wrote:
>>>
>>>> You can do this in 2 passes (not one)
>>>> A) save you dataset into hdfs with what you have.
>>>> B) calculate number of partition, n= (size of your dataset)/hdfs block
>>>> size
>>>> Then run simple spark job to read and partition based on 'n'.
>>>>
>>>> Hichame
>>>>
>>>> *From:* felixcheun...@hotmail.com
>>>> *Sent:* January 19, 2019 2:06 PM
>>>> *To:* 28shivamsha...@gmail.com; user@spark.apache.org
>>>> *Subject:* Re: Persist Dataframe to HDFS considering HDFS Block Size.
>>>>
>>>> You can call coalesce to combine partitions..
>>>>
>>>>
>>>> --
>>>> *From:* Shivam Sharma <28shivamsha...@gmail.com>
>>>> *Sent:* Saturday, January 19, 2019 7:43 AM
>>>> *To:* user@spark.apache.org
>>>> *Subject:* Persist Dataframe to HDFS considering HDFS Block Size.
>>>>
>>>> Hi All,
>>>>
>>>> I wanted to persist dataframe on HDFS. Basically, I am inserting data
>>>> into a HIVE table using Spark. Currently, at the time of writing to HIVE
>>>> table I have set total shuffle partitions = 400 so total 400 files are
>>>> being created which is not even considering HDFS block size. How can I tell
>>>> spark to persist according to HDFS Blocks.
>>>>
>>>> We have something like this HIVE which solves this problem:
>>>>
>>>> set hive.merge.sparkfiles=true;
>>>> set hive.merge.smallfiles.avgsize=204800;
>>>> set hive.merge.size.per.task=409600;
>>>>
>>>> Thanks
>>>>
>>>> --
>>>> Shivam Sharma
>>>> Indian Institute Of Information Technology, Design and Manufacturing
>>>> Jabalpur
>>>> Mobile No- (+91) 8882114744
>>>> Email:- 28shivamsha...@gmail.com
>>>> LinkedIn:-*https://www.linkedin.com/in/28shivamsharma
>>>> <https://www.linkedin.com/in/28shivamsharma>*
>>>>
>>>
>>>
>>> --
>>> Shivam Sharma
>>> Indian Institute Of Information Technology, Design and Manufacturing
>>> Jabalpur
>>> Mobile No- (+91) 8882114744
>>> Email:- 28shivamsha...@gmail.com
>>> LinkedIn:-*https://www.linkedin.com/in/28shivamsharma
>>> <https://www.linkedin.com/in/28shivamsharma>*
>>>
>>
>>
>> --
>> Shivam Sharma
>> Indian Institute Of Information Technology, Design and Manufacturing
>> Jabalpur
>> Mobile No- (+91) 8882114744
>> Email:- 28shivamsha...@gmail.com
>> LinkedIn:-*https://www.linkedin.com/in/28shivamsharma
>> <https://www.linkedin.com/in/28shivamsharma>*
>>
>

-- 
Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing
Jabalpur
Mobile No- (+91) 8882114744
Email:- 28shivamsha...@gmail.com
LinkedIn:-*https://www.linkedin.com/in/28shivamsharma
<https://www.linkedin.com/in/28shivamsharma>*


Re: Persist Dataframe to HDFS considering HDFS Block Size.

2019-01-21 Thread Arnaud LARROQUE
Hi Shivam,

At the end, the file is taking its own space regardless of the block size.
So if you're file is just a few ko bytes, it will take only this few ko
bytes.
But I've noticed that when the file is written, somehow a block is
allocated and the Namenode consider that all the block size is used. I had
this problem when writing a too much partitioned dataset !
But as soon as the file was written, the Namenode seems to know its true
size and drop the "default block size"

Arnaud

On Mon, Jan 21, 2019 at 9:01 AM Shivam Sharma <28shivamsha...@gmail.com>
wrote:

> Don't we have any property for it?
>
> One more quick question that if files created by Spark is less than HDFS
> block size then the rest of Block space will become unavailable and remain
> unutilized or it will be shared with other files?
>
> On Mon, Jan 21, 2019 at 1:30 PM Shivam Sharma <28shivamsha...@gmail.com>
> wrote:
>
>> Don't we have any property for it?
>>
>> One more quick question that if files created by Spark is less than HDFS
>> block size then the rest of Block space will become unavailable and remain
>> unutilized or it will be shared with other files?
>>
>> On Sun, Jan 20, 2019 at 12:47 AM Hichame El Khalfi 
>> wrote:
>>
>>> You can do this in 2 passes (not one)
>>> A) save you dataset into hdfs with what you have.
>>> B) calculate number of partition, n= (size of your dataset)/hdfs block
>>> size
>>> Then run simple spark job to read and partition based on 'n'.
>>>
>>> Hichame
>>>
>>> *From:* felixcheun...@hotmail.com
>>> *Sent:* January 19, 2019 2:06 PM
>>> *To:* 28shivamsha...@gmail.com; user@spark.apache.org
>>> *Subject:* Re: Persist Dataframe to HDFS considering HDFS Block Size.
>>>
>>> You can call coalesce to combine partitions..
>>>
>>>
>>> --
>>> *From:* Shivam Sharma <28shivamsha...@gmail.com>
>>> *Sent:* Saturday, January 19, 2019 7:43 AM
>>> *To:* user@spark.apache.org
>>> *Subject:* Persist Dataframe to HDFS considering HDFS Block Size.
>>>
>>> Hi All,
>>>
>>> I wanted to persist dataframe on HDFS. Basically, I am inserting data
>>> into a HIVE table using Spark. Currently, at the time of writing to HIVE
>>> table I have set total shuffle partitions = 400 so total 400 files are
>>> being created which is not even considering HDFS block size. How can I tell
>>> spark to persist according to HDFS Blocks.
>>>
>>> We have something like this HIVE which solves this problem:
>>>
>>> set hive.merge.sparkfiles=true;
>>> set hive.merge.smallfiles.avgsize=204800;
>>> set hive.merge.size.per.task=409600;
>>>
>>> Thanks
>>>
>>> --
>>> Shivam Sharma
>>> Indian Institute Of Information Technology, Design and Manufacturing
>>> Jabalpur
>>> Mobile No- (+91) 8882114744
>>> Email:- 28shivamsha...@gmail.com
>>> LinkedIn:-*https://www.linkedin.com/in/28shivamsharma
>>> <https://www.linkedin.com/in/28shivamsharma>*
>>>
>>
>>
>> --
>> Shivam Sharma
>> Indian Institute Of Information Technology, Design and Manufacturing
>> Jabalpur
>> Mobile No- (+91) 8882114744
>> Email:- 28shivamsha...@gmail.com
>> LinkedIn:-*https://www.linkedin.com/in/28shivamsharma
>> <https://www.linkedin.com/in/28shivamsharma>*
>>
>
>
> --
> Shivam Sharma
> Indian Institute Of Information Technology, Design and Manufacturing
> Jabalpur
> Mobile No- (+91) 8882114744
> Email:- 28shivamsha...@gmail.com
> LinkedIn:-*https://www.linkedin.com/in/28shivamsharma
> <https://www.linkedin.com/in/28shivamsharma>*
>


Re: Persist Dataframe to HDFS considering HDFS Block Size.

2019-01-21 Thread Shivam Sharma
Don't we have any property for it?

One more quick question that if files created by Spark is less than HDFS
block size then the rest of Block space will become unavailable and remain
unutilized or it will be shared with other files?

On Mon, Jan 21, 2019 at 1:30 PM Shivam Sharma <28shivamsha...@gmail.com>
wrote:

> Don't we have any property for it?
>
> One more quick question that if files created by Spark is less than HDFS
> block size then the rest of Block space will become unavailable and remain
> unutilized or it will be shared with other files?
>
> On Sun, Jan 20, 2019 at 12:47 AM Hichame El Khalfi 
> wrote:
>
>> You can do this in 2 passes (not one)
>> A) save you dataset into hdfs with what you have.
>> B) calculate number of partition, n= (size of your dataset)/hdfs block
>> size
>> Then run simple spark job to read and partition based on 'n'.
>>
>> Hichame
>>
>> *From:* felixcheun...@hotmail.com
>> *Sent:* January 19, 2019 2:06 PM
>> *To:* 28shivamsha...@gmail.com; user@spark.apache.org
>> *Subject:* Re: Persist Dataframe to HDFS considering HDFS Block Size.
>>
>> You can call coalesce to combine partitions..
>>
>>
>> --
>> *From:* Shivam Sharma <28shivamsha...@gmail.com>
>> *Sent:* Saturday, January 19, 2019 7:43 AM
>> *To:* user@spark.apache.org
>> *Subject:* Persist Dataframe to HDFS considering HDFS Block Size.
>>
>> Hi All,
>>
>> I wanted to persist dataframe on HDFS. Basically, I am inserting data
>> into a HIVE table using Spark. Currently, at the time of writing to HIVE
>> table I have set total shuffle partitions = 400 so total 400 files are
>> being created which is not even considering HDFS block size. How can I tell
>> spark to persist according to HDFS Blocks.
>>
>> We have something like this HIVE which solves this problem:
>>
>> set hive.merge.sparkfiles=true;
>> set hive.merge.smallfiles.avgsize=204800;
>> set hive.merge.size.per.task=409600;
>>
>> Thanks
>>
>> --
>> Shivam Sharma
>> Indian Institute Of Information Technology, Design and Manufacturing
>> Jabalpur
>> Mobile No- (+91) 8882114744
>> Email:- 28shivamsha...@gmail.com
>> LinkedIn:-*https://www.linkedin.com/in/28shivamsharma
>> <https://www.linkedin.com/in/28shivamsharma>*
>>
>
>
> --
> Shivam Sharma
> Indian Institute Of Information Technology, Design and Manufacturing
> Jabalpur
> Mobile No- (+91) 8882114744
> Email:- 28shivamsha...@gmail.com
> LinkedIn:-*https://www.linkedin.com/in/28shivamsharma
> <https://www.linkedin.com/in/28shivamsharma>*
>


-- 
Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing
Jabalpur
Mobile No- (+91) 8882114744
Email:- 28shivamsha...@gmail.com
LinkedIn:-*https://www.linkedin.com/in/28shivamsharma
<https://www.linkedin.com/in/28shivamsharma>*


Re: Persist Dataframe to HDFS considering HDFS Block Size.

2019-01-19 Thread Hichame El Khalfi
You can do this in 2 passes (not one)
A) save you dataset into hdfs with what you have.
B) calculate number of partition, n= (size of your dataset)/hdfs block size
Then run simple spark job to read and partition based on 'n'.

Hichame

From: felixcheun...@hotmail.com
Sent: January 19, 2019 2:06 PM
To: 28shivamsha...@gmail.com; user@spark.apache.org
Subject: Re: Persist Dataframe to HDFS considering HDFS Block Size.


You can call coalesce to combine partitions..



From: Shivam Sharma <28shivamsha...@gmail.com>
Sent: Saturday, January 19, 2019 7:43 AM
To: user@spark.apache.org
Subject: Persist Dataframe to HDFS considering HDFS Block Size.

Hi All,

I wanted to persist dataframe on HDFS. Basically, I am inserting data into a 
HIVE table using Spark. Currently, at the time of writing to HIVE table I have 
set total shuffle partitions = 400 so total 400 files are being created which 
is not even considering HDFS block size. How can I tell spark to persist 
according to HDFS Blocks.

We have something like this HIVE which solves this problem:

set hive.merge.sparkfiles=true;
set hive.merge.smallfiles.avgsize=204800;
set hive.merge.size.per.task=409600;

Thanks

--
Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing Jabalpur
Mobile No- (+91) 8882114744
Email:- 28shivamsha...@gmail.com<mailto:28shivamsha...@gmail.com>
LinkedIn:-https://www.linkedin.com/in/28shivamsharma


Re: Persist Dataframe to HDFS considering HDFS Block Size.

2019-01-19 Thread Felix Cheung
You can call coalesce to combine partitions..



From: Shivam Sharma <28shivamsha...@gmail.com>
Sent: Saturday, January 19, 2019 7:43 AM
To: user@spark.apache.org
Subject: Persist Dataframe to HDFS considering HDFS Block Size.

Hi All,

I wanted to persist dataframe on HDFS. Basically, I am inserting data into a 
HIVE table using Spark. Currently, at the time of writing to HIVE table I have 
set total shuffle partitions = 400 so total 400 files are being created which 
is not even considering HDFS block size. How can I tell spark to persist 
according to HDFS Blocks.

We have something like this HIVE which solves this problem:

set hive.merge.sparkfiles=true;
set hive.merge.smallfiles.avgsize=204800;
set hive.merge.size.per.task=409600;

Thanks

--
Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing Jabalpur
Mobile No- (+91) 8882114744
Email:- 28shivamsha...@gmail.com<mailto:28shivamsha...@gmail.com>
LinkedIn:-https://www.linkedin.com/in/28shivamsharma


Persist Dataframe to HDFS considering HDFS Block Size.

2019-01-19 Thread Shivam Sharma
Hi All,

I wanted to persist dataframe on HDFS. Basically, I am inserting data into
a HIVE table using Spark. Currently, at the time of writing to HIVE table I
have set total shuffle partitions = 400 so total 400 files are being
created which is not even considering HDFS block size. How can I tell spark
to persist according to HDFS Blocks.

We have something like this HIVE which solves this problem:

set hive.merge.sparkfiles=true;
set hive.merge.smallfiles.avgsize=204800;
set hive.merge.size.per.task=409600;

Thanks

-- 
Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing
Jabalpur
Mobile No- (+91) 8882114744
Email:- 28shivamsha...@gmail.com
LinkedIn:-*https://www.linkedin.com/in/28shivamsharma
*