Are we not supposed to be using udfs anymore? I copied an example straight from 
a book and I'm getting weird results and I think it's because the book is using 
a much older version of Spark.  The code below is pretty straight forward but 
I'm getting an error none the less. I've been doing a bunch of googling and not 
getting much results.

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()

df = spark.read.csv("full201801.dat",header="true")

columntransform = udf(lambda x: 'Non-Fat Dry Milk' if x == '23040010' else 
'foo', StringType())

df.select(df.PRODUCT_NC, 
columntransform(df.PRODUCT_NC).alias('COMMODITY')).show()

Error.
Py4JJavaError: An error occurred while calling o110.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 
2, localhost, executor driver): org.apache.spark.api.python.PythonException: 
Traceback (most recent call last):
  File "c:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 242, in main
  File "c:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 144, in 
read_udfs
  File "c:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 120, in 
read_single_udf
  File "c:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 60, in 
read_command
  File "c:\spark\python\lib\pyspark.zip\pyspark\serializers.py", line 171, in 
_read_with_length
    return self.loads(obj)
  File "c:\spark\python\lib\pyspark.zip\pyspark\serializers.py", line 566, in 
loads
    return pickle.loads(obj, encoding=encoding)
TypeError: _fill_function() missing 4 required positional arguments: 
'defaults', 'dict', 'module', and 'closure_values'


B.



Reply via email to