Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
I'm not 100% sure, but a naive repartition() seems to cause a shuffle. If this is actually happening, it's just wasteful overhead. The ambition is to say "divide the data into partitions, but make sure you don't move it in doing so". On Tue, Aug 28, 2018 at 2:06 PM, Patrick McCarthy wrote: >

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
I'm not 100% sure, but a naive repartition() seems to cause a shuffle. If this is actually happening, it's just wasteful overhead. On Tue, Aug 28, 2018 at 1:03 PM, Sonal Goyal wrote: > Hi Patrick, > > Sorry is there something here that helps you beyond repartition(number of > partitons) or

Re: [External Sender] Pitfalls of partitioning by host?

2018-08-28 Thread Jayesh Lalwani
If you group by the host that you have computed using the UDF, Spark is always going to shuffle your dataset, even if the end result is that all the new partitions look exactly like the old partitions, just placed on differrent nodes. Remember the hostname will probably hash differrently than the

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Sonal Goyal
Hi Patrick, Sorry is there something here that helps you beyond repartition(number of partitons) or calling your udf on foreachPartition? If your data is on disk, Spark is already partitioning it for you by rows. How is adding the host info helping? Thanks, Sonal Nube Technologies

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
Mostly I'm guessing that it adds efficiency to a job where partitioning is required but shuffling is not. For example, if I want to apply a UDF to 1tb of records on disk, I might need to repartition(5) to get the task size down to an acceptable size for my cluster. If I don't care that it's

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
Mostly I'm guessing that it adds efficiency to a job where partitioning is required but shuffling is not. For example, if I want to apply a UDF to 1tb of records on disk, I might need to repartition(5) to get the task size down to an acceptable size for my cluster. If I don't care that it's

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
Mostly I'm guessing that it adds efficiency to a job where partitioning is required but shuffling is not. For example, if I want to apply a UDF to 1tb of records on disk, I might need to repartition(5) to get the task size down to an acceptable size for my cluster. If I don't care that it's

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
Mostly I'm guessing that it adds efficiency to a job where partitioning is required but shuffling is not. For example, if I want to apply a UDF to 1tb of records on disk, I might need to repartition(5) to get the task size down to an acceptable size for my cluster. If I don't care that it's

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
Mostly I'm guessing that it adds efficiency to a job where partitioning is required but shuffling is not. For example, if I want to apply a UDF to 1tb of records on disk, I might need to repartition(5) to get the task size down to an acceptable size for my cluster. If I don't care that it's

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
Mostly I'm guessing that it adds efficiency to a job where partitioning is required but shuffling is not. For example, if I want to apply a UDF to 1tb of records on disk, I might need to repartition(5) to get the task size down to an acceptable size for my cluster. If I don't care that it's

Re: Pitfalls of partitioning by host?

2018-08-27 Thread Michael Artz
Well if we think of shuffling as a necessity to perform an operation, then the problem would be that you are adding a ln aggregation stage to a job that is going to get shuffled anyway. Like if you need to join two datasets, then Spark will still shuffle the data, whether they are grouped by

Pitfalls of partitioning by host?

2018-08-27 Thread Patrick McCarthy
When debugging some behavior on my YARN cluster I wrote the following PySpark UDF to figure out what host was operating on what row of data: @F.udf(T.StringType()) def add_hostname(x): import socket return str(socket.gethostname()) It occurred to me that I could use this to enforce