When debugging some behavior on my YARN cluster I wrote the following
PySpark UDF to figure out what host was operating on what row of data:

@F.udf(T.StringType())
def add_hostname(x):

    import socket

    return str(socket.gethostname())

It occurred to me that I could use this to enforce node-locality for other
operations:

df.withColumn('host', add_hostname(x)).groupBy('host').apply(...)

When working on a big job without obvious partition keys, this seems like a
very straightforward way to avoid a shuffle, but it seems too easy.

What problems would I introduce by trying to partition on hostname like
this?

Reply via email to