Are we not supposed to be using udfs anymore? I copied an example straight from a book and I'm getting weird results and I think it's because the book is using a much older version of Spark. The code below is pretty straight forward but I'm getting an error none the less. I've been doing a bunch of googling and not getting much results.
from pyspark.sql import SparkSession from pyspark.sql.functions import * from pyspark.sql.types import * spark = SparkSession \ .builder \ .appName("Python Spark SQL basic example") \ .getOrCreate() df = spark.read.csv("full201801.dat",header="true") columntransform = udf(lambda x: 'Non-Fat Dry Milk' if x == '23040010' else 'foo', StringType()) df.select(df.PRODUCT_NC, columntransform(df.PRODUCT_NC).alias('COMMODITY')).show() Error. Py4JJavaError: An error occurred while calling o110.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "c:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 242, in main File "c:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 144, in read_udfs File "c:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 120, in read_single_udf File "c:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 60, in read_command File "c:\spark\python\lib\pyspark.zip\pyspark\serializers.py", line 171, in _read_with_length return self.loads(obj) File "c:\spark\python\lib\pyspark.zip\pyspark\serializers.py", line 566, in loads return pickle.loads(obj, encoding=encoding) TypeError: _fill_function() missing 4 required positional arguments: 'defaults', 'dict', 'module', and 'closure_values' B.