Hi all,
Can somebody put some light on this pls?
Thanks,
Aakash.
-- Forwarded message --
From: "Aakash Basu" <aakash.spark@gmail.com>
Date: 15-Jun-2017 2:57 PM
Subject: Repartition vs PartitionBy Help/Understanding needed
To: "user" <user@
Hi all,
Everybody is giving a difference between coalesce and repartition, but
nowhere I found a difference between partitionBy and repartition. My
question is, is it better to write a data set in parquet partitioning by a
column and then reading the respective directories to work on that column
yes i am trying to do so. but it will try to repartition whole data.. can't
we split a large partition(data skewed partition) into multiple partitions
(any idea on this.).
On Sun, Oct 18, 2015 at 1:55 AM, Adrian Tanase wrote:
> If the dataset allows it you can try to write a
Hi folks
I need to reparation large set of data around(300G) as i see some portions have
large data(data skew)
i have pairRDDs [({},{}),({},{}),({},{})]
what is the best way to solve the the problem
-
To unsubscribe, e-mail:
You can use coalesce function, if you want to reduce the number of
partitions. This one minimizes the data shuffle.
-Raghav
On Sat, Oct 17, 2015 at 1:02 PM, shahid qadri
wrote:
> Hi folks
>
> I need to reparation large set of data around(300G) as i see some portions
>
yes i know about that,its in case to reduce partitions. the point here is
the data is skewed to few partitions..
On Sat, Oct 17, 2015 at 6:27 PM, Raghavendra Pandey <
raghavendra.pan...@gmail.com> wrote:
> You can use coalesce function, if you want to reduce the number of
> partitions. This one
If the dataset allows it you can try to write a custom partitioner to help
spark distribute the data more uniformly.
Sent from my iPhone
On 17 Oct 2015, at 16:14, shahid ashraf
> wrote:
yes i know about that,its in case to reduce partitions. the