Hello,

I want to create an UDF which modifies one column value depending on value
of some other column. But Python version of the code fails always in column
value comparison. Below are simple examples, scala version works as
expected but Python version throws an execption. Am I missing something
obvious? As can be seen from PySpark exception I'm using Spark 2.0.1.

-Perttu

import org.apache.spark.sql.functions.udf
val df = spark.createDataFrame(List(("a",1), ("b",2), ("c",
3))).withColumnRenamed("_1", "name").withColumnRenamed("_2", "value")
def myUdf = udf((name: String, value: Int) => {if (name == "c") { value * 2
} else { value }})
df.withColumn("udf", myUdf(df("name"), df("value"))).show()
+----+-----+---+
|name|value|udf|
+----+-----+---+
|   a|    1|  1|
|   b|    2|  2|
|   c|    3|  6|
+----+-----+---+


from pyspark.sql.types import StringType, IntegerType
import pyspark.sql.functions as F

df = sqlContext.createDataFrame((('a',1), ('b',2), ('c', 3)),
('name','value'))

def my_udf(name, value):
    if name == 'c':
        return value * 2
    return value
F.udf(my_udf, IntegerType())

df.withColumn("udf", my_udf(df.name, df.value)).show()

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-10032e941fc4> in <module>()
----> 1 df.withColumn("udf", my_udf(df.name, df.value)).show()

<ipython-input-5-c103a6066373> in my_udf(name, value)
      3
      4 def my_udf(name, value):
----> 5     if name == 'c':
      6         return value * 2
      7     return value

/home/ec2-user/spark-2.0.1-bin-hadoop2.4/python/pyspark/sql/column.pyc in
__nonzero__(self)
    425
    426     def __nonzero__(self):
--> 427         raise ValueError("Cannot convert column into bool: please
use '&' for 'and', '|' for 'or', "
    428                          "'~' for 'not' when building DataFrame
boolean expressions.")
    429     __bool__ = __nonzero__

ValueError: Cannot convert column into bool: please use '&' for 'and', '|'
for 'or', '~' for 'not' when building DataFrame boolean expressions.

Reply via email to