When debugging some behavior on my YARN cluster I wrote the following PySpark UDF to figure out what host was operating on what row of data:
@F.udf(T.StringType()) def add_hostname(x): import socket return str(socket.gethostname()) It occurred to me that I could use this to enforce node-locality for other operations: df.withColumn('host', add_hostname(x)).groupBy('host').apply(...) When working on a big job without obvious partition keys, this seems like a very straightforward way to avoid a shuffle, but it seems too easy. What problems would I introduce by trying to partition on hostname like this?