Hello, I want to create an UDF which modifies one column value depending on value of some other column. But Python version of the code fails always in column value comparison. Below are simple examples, scala version works as expected but Python version throws an execption. Am I missing something obvious? As can be seen from PySpark exception I'm using Spark 2.0.1.
-Perttu import org.apache.spark.sql.functions.udf val df = spark.createDataFrame(List(("a",1), ("b",2), ("c", 3))).withColumnRenamed("_1", "name").withColumnRenamed("_2", "value") def myUdf = udf((name: String, value: Int) => {if (name == "c") { value * 2 } else { value }}) df.withColumn("udf", myUdf(df("name"), df("value"))).show() +----+-----+---+ |name|value|udf| +----+-----+---+ | a| 1| 1| | b| 2| 2| | c| 3| 6| +----+-----+---+ from pyspark.sql.types import StringType, IntegerType import pyspark.sql.functions as F df = sqlContext.createDataFrame((('a',1), ('b',2), ('c', 3)), ('name','value')) def my_udf(name, value): if name == 'c': return value * 2 return value F.udf(my_udf, IntegerType()) df.withColumn("udf", my_udf(df.name, df.value)).show() --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-6-10032e941fc4> in <module>() ----> 1 df.withColumn("udf", my_udf(df.name, df.value)).show() <ipython-input-5-c103a6066373> in my_udf(name, value) 3 4 def my_udf(name, value): ----> 5 if name == 'c': 6 return value * 2 7 return value /home/ec2-user/spark-2.0.1-bin-hadoop2.4/python/pyspark/sql/column.pyc in __nonzero__(self) 425 426 def __nonzero__(self): --> 427 raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', " 428 "'~' for 'not' when building DataFrame boolean expressions.") 429 __bool__ = __nonzero__ ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.