I'm not 100% sure, but a naive repartition() seems to cause a shuffle. If
this is actually happening, it's just wasteful overhead. The ambition is to
say "divide the data into partitions, but make sure you don't move it in
doing so".
On Tue, Aug 28, 2018 at 2:06 PM, Patrick McCarthy
wrote:
>
I'm not 100% sure, but a naive repartition() seems to cause a shuffle. If
this is actually happening, it's just wasteful overhead.
On Tue, Aug 28, 2018 at 1:03 PM, Sonal Goyal wrote:
> Hi Patrick,
>
> Sorry is there something here that helps you beyond repartition(number of
> partitons) or
If you group by the host that you have computed using the UDF, Spark is
always going to shuffle your dataset, even if the end result is that all
the new partitions look exactly like the old partitions, just placed on
differrent nodes. Remember the hostname will probably hash differrently
than the
Hi Patrick,
Sorry is there something here that helps you beyond repartition(number of
partitons) or calling your udf on foreachPartition? If your data is on
disk, Spark is already partitioning it for you by rows. How is adding the
host info helping?
Thanks,
Sonal
Nube Technologies
Mostly I'm guessing that it adds efficiency to a job where partitioning is
required but shuffling is not.
For example, if I want to apply a UDF to 1tb of records on disk, I might
need to repartition(5) to get the task size down to an acceptable size
for my cluster. If I don't care that it's
Mostly I'm guessing that it adds efficiency to a job where partitioning is
required but shuffling is not.
For example, if I want to apply a UDF to 1tb of records on disk, I might
need to repartition(5) to get the task size down to an acceptable size
for my cluster. If I don't care that it's
Mostly I'm guessing that it adds efficiency to a job where partitioning is
required but shuffling is not.
For example, if I want to apply a UDF to 1tb of records on disk, I might
need to repartition(5) to get the task size down to an acceptable size
for my cluster. If I don't care that it's
Mostly I'm guessing that it adds efficiency to a job where partitioning is
required but shuffling is not.
For example, if I want to apply a UDF to 1tb of records on disk, I might
need to repartition(5) to get the task size down to an acceptable size
for my cluster. If I don't care that it's
Mostly I'm guessing that it adds efficiency to a job where partitioning is
required but shuffling is not.
For example, if I want to apply a UDF to 1tb of records on disk, I might
need to repartition(5) to get the task size down to an acceptable size
for my cluster. If I don't care that it's
Mostly I'm guessing that it adds efficiency to a job where partitioning is
required but shuffling is not.
For example, if I want to apply a UDF to 1tb of records on disk, I might
need to repartition(5) to get the task size down to an acceptable size
for my cluster. If I don't care that it's
Well if we think of shuffling as a necessity to perform an operation, then the
problem would be that you are adding a ln aggregation stage to a job that is
going to get shuffled anyway. Like if you need to join two datasets, then
Spark will still shuffle the data, whether they are grouped by
When debugging some behavior on my YARN cluster I wrote the following
PySpark UDF to figure out what host was operating on what row of data:
@F.udf(T.StringType())
def add_hostname(x):
import socket
return str(socket.gethostname())
It occurred to me that I could use this to enforce
12 matches
Mail list logo