Re: Unsubscribe

2019-01-19 Thread Sherman Tong
Unsubscribe

Sent from sherman's mobile device and sorry for being short.

> On Jan 19, 2019, at 2:02 PM, Aditya Gautam  wrote:
> 
> 


Re: Persist Dataframe to HDFS considering HDFS Block Size.

2019-01-19 Thread Hichame El Khalfi
You can do this in 2 passes (not one)
A) save you dataset into hdfs with what you have.
B) calculate number of partition, n= (size of your dataset)/hdfs block size
Then run simple spark job to read and partition based on 'n'.

Hichame

From: felixcheun...@hotmail.com
Sent: January 19, 2019 2:06 PM
To: 28shivamsha...@gmail.com; user@spark.apache.org
Subject: Re: Persist Dataframe to HDFS considering HDFS Block Size.


You can call coalesce to combine partitions..



From: Shivam Sharma <28shivamsha...@gmail.com>
Sent: Saturday, January 19, 2019 7:43 AM
To: user@spark.apache.org
Subject: Persist Dataframe to HDFS considering HDFS Block Size.

Hi All,

I wanted to persist dataframe on HDFS. Basically, I am inserting data into a 
HIVE table using Spark. Currently, at the time of writing to HIVE table I have 
set total shuffle partitions = 400 so total 400 files are being created which 
is not even considering HDFS block size. How can I tell spark to persist 
according to HDFS Blocks.

We have something like this HIVE which solves this problem:

set hive.merge.sparkfiles=true;
set hive.merge.smallfiles.avgsize=204800;
set hive.merge.size.per.task=409600;

Thanks

--
Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing Jabalpur
Mobile No- (+91) 8882114744
Email:- 28shivamsha...@gmail.com
LinkedIn:-https://www.linkedin.com/in/28shivamsharma


Re: Persist Dataframe to HDFS considering HDFS Block Size.

2019-01-19 Thread Felix Cheung
You can call coalesce to combine partitions..



From: Shivam Sharma <28shivamsha...@gmail.com>
Sent: Saturday, January 19, 2019 7:43 AM
To: user@spark.apache.org
Subject: Persist Dataframe to HDFS considering HDFS Block Size.

Hi All,

I wanted to persist dataframe on HDFS. Basically, I am inserting data into a 
HIVE table using Spark. Currently, at the time of writing to HIVE table I have 
set total shuffle partitions = 400 so total 400 files are being created which 
is not even considering HDFS block size. How can I tell spark to persist 
according to HDFS Blocks.

We have something like this HIVE which solves this problem:

set hive.merge.sparkfiles=true;
set hive.merge.smallfiles.avgsize=204800;
set hive.merge.size.per.task=409600;

Thanks

--
Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing Jabalpur
Mobile No- (+91) 8882114744
Email:- 28shivamsha...@gmail.com
LinkedIn:-https://www.linkedin.com/in/28shivamsharma


Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-19 Thread Felix Cheung
To clarify, yarn actually supports excluding node right when requesting 
resources. It’s spark that doesn’t provide a way to populate such a blacklist.

If you can change yarn config, the equivalent is node label: 
https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/NodeLabel.html




From: Li Gao 
Sent: Saturday, January 19, 2019 8:43 AM
To: Felix Cheung
Cc: Serega Sheypak; user
Subject: Re: Spark on Yarn, is it possible to manually blacklist nodes before 
running spark job?

on yarn it is impossible afaik. on kubernetes you can use taints to keep 
certain nodes outside of spark

On Fri, Jan 18, 2019 at 9:35 PM Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
Not as far as I recall...



From: Serega Sheypak mailto:serega.shey...@gmail.com>>
Sent: Friday, January 18, 2019 3:21 PM
To: user
Subject: Spark on Yarn, is it possible to manually blacklist nodes before 
running spark job?

Hi, is there any possibility to tell Scheduler to blacklist specific nodes in 
advance?


Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-19 Thread Li Gao
on yarn it is impossible afaik. on kubernetes you can use taints to keep
certain nodes outside of spark

On Fri, Jan 18, 2019 at 9:35 PM Felix Cheung 
wrote:

> Not as far as I recall...
>
>
> --
> *From:* Serega Sheypak 
> *Sent:* Friday, January 18, 2019 3:21 PM
> *To:* user
> *Subject:* Spark on Yarn, is it possible to manually blacklist nodes
> before running spark job?
>
> Hi, is there any possibility to tell Scheduler to blacklist specific nodes
> in advance?
>


Persist Dataframe to HDFS considering HDFS Block Size.

2019-01-19 Thread Shivam Sharma
Hi All,

I wanted to persist dataframe on HDFS. Basically, I am inserting data into
a HIVE table using Spark. Currently, at the time of writing to HIVE table I
have set total shuffle partitions = 400 so total 400 files are being
created which is not even considering HDFS block size. How can I tell spark
to persist according to HDFS Blocks.

We have something like this HIVE which solves this problem:

set hive.merge.sparkfiles=true;
set hive.merge.smallfiles.avgsize=204800;
set hive.merge.size.per.task=409600;

Thanks

-- 
Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing
Jabalpur
Mobile No- (+91) 8882114744
Email:- 28shivamsha...@gmail.com
LinkedIn:-*https://www.linkedin.com/in/28shivamsharma
*