Fwd: Repartition vs PartitionBy Help/Understanding needed

2017-06-16 Thread Aakash Basu
Hi all, Can somebody put some light on this pls? Thanks, Aakash. -- Forwarded message -- From: "Aakash Basu" <aakash.spark@gmail.com> Date: 15-Jun-2017 2:57 PM Subject: Repartition vs PartitionBy Help/Understanding needed To: "user" <user@

Repartition vs PartitionBy Help/Understanding needed

2017-06-15 Thread Aakash Basu
Hi all, Everybody is giving a difference between coalesce and repartition, but nowhere I found a difference between partitionBy and repartition. My question is, is it better to write a data set in parquet partitioning by a column and then reading the respective directories to work on that column

Re: repartition vs partitionby

2015-10-18 Thread shahid ashraf
yes i am trying to do so. but it will try to repartition whole data.. can't we split a large partition(data skewed partition) into multiple partitions (any idea on this.). On Sun, Oct 18, 2015 at 1:55 AM, Adrian Tanase wrote: > If the dataset allows it you can try to write a

repartition vs partitionby

2015-10-17 Thread shahid qadri
Hi folks I need to reparation large set of data around(300G) as i see some portions have large data(data skew) i have pairRDDs [({},{}),({},{}),({},{})] what is the best way to solve the the problem - To unsubscribe, e-mail:

Re: repartition vs partitionby

2015-10-17 Thread Raghavendra Pandey
You can use coalesce function, if you want to reduce the number of partitions. This one minimizes the data shuffle. -Raghav On Sat, Oct 17, 2015 at 1:02 PM, shahid qadri wrote: > Hi folks > > I need to reparation large set of data around(300G) as i see some portions >

Re: repartition vs partitionby

2015-10-17 Thread shahid ashraf
yes i know about that,its in case to reduce partitions. the point here is the data is skewed to few partitions.. On Sat, Oct 17, 2015 at 6:27 PM, Raghavendra Pandey < raghavendra.pan...@gmail.com> wrote: > You can use coalesce function, if you want to reduce the number of > partitions. This one

Re: repartition vs partitionby

2015-10-17 Thread Adrian Tanase
If the dataset allows it you can try to write a custom partitioner to help spark distribute the data more uniformly. Sent from my iPhone On 17 Oct 2015, at 16:14, shahid ashraf > wrote: yes i know about that,its in case to reduce partitions. the