Unsubscribe
Sent from sherman's mobile device and sorry for being short.
> On Jan 19, 2019, at 2:02 PM, Aditya Gautam wrote:
>
>
You can do this in 2 passes (not one)
A) save you dataset into hdfs with what you have.
B) calculate number of partition, n= (size of your dataset)/hdfs block size
Then run simple spark job to read and partition based on 'n'.
Hichame
From: felixcheun...@hotmail.com
Sent: January 19, 2019 2:06 PM
You can call coalesce to combine partitions..
From: Shivam Sharma <28shivamsha...@gmail.com>
Sent: Saturday, January 19, 2019 7:43 AM
To: user@spark.apache.org
Subject: Persist Dataframe to HDFS considering HDFS Block Size.
Hi All,
I wanted to persist dataframe
To clarify, yarn actually supports excluding node right when requesting
resources. It’s spark that doesn’t provide a way to populate such a blacklist.
If you can change yarn config, the equivalent is node label:
https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/NodeLabel.html
on yarn it is impossible afaik. on kubernetes you can use taints to keep
certain nodes outside of spark
On Fri, Jan 18, 2019 at 9:35 PM Felix Cheung
wrote:
> Not as far as I recall...
>
>
> --
> *From:* Serega Sheypak
> *Sent:* Friday, January 18, 2019 3:21 PM
> *To:
Hi All,
I wanted to persist dataframe on HDFS. Basically, I am inserting data into
a HIVE table using Spark. Currently, at the time of writing to HIVE table I
have set total shuffle partitions = 400 so total 400 files are being
created which is not even considering HDFS block size. How can I tell