So it was something obvious, thanks! -Perttu
to 10. marraskuuta 2016 klo 21.19 Davies Liu <dav...@databricks.com> kirjoitti: > On Thu, Nov 10, 2016 at 11:14 AM, Perttu Ranta-aho <ranta...@iki.fi> > wrote: > > Hello, > > > > I want to create an UDF which modifies one column value depending on > value > > of some other column. But Python version of the code fails always in > column > > value comparison. Below are simple examples, scala version works as > expected > > but Python version throws an execption. Am I missing something obvious? > As > > can be seen from PySpark exception I'm using Spark 2.0.1. > > > > -Perttu > > > > import org.apache.spark.sql.functions.udf > > val df = spark.createDataFrame(List(("a",1), ("b",2), ("c", > > 3))).withColumnRenamed("_1", "name").withColumnRenamed("_2", "value") > > def myUdf = udf((name: String, value: Int) => {if (name == "c") { value > * 2 > > } else { value }}) > > df.withColumn("udf", myUdf(df("name"), df("value"))).show() > > +----+-----+---+ > > |name|value|udf| > > +----+-----+---+ > > | a| 1| 1| > > | b| 2| 2| > > | c| 3| 6| > > +----+-----+---+ > > > > > > from pyspark.sql.types import StringType, IntegerType > > import pyspark.sql.functions as F > > > > df = sqlContext.createDataFrame((('a',1), ('b',2), ('c', 3)), > > ('name','value')) > > > > def my_udf(name, value): > > if name == 'c': > > return value * 2 > > return value > > F.udf(my_udf, IntegerType()) > > udf = F.udf(my_udf, IntegerType()) > df.withColumn("udf", udf(df.name, df.value)).show() > > > > > df.withColumn("udf", my_udf(df.name, df.value)).show() > > > > > --------------------------------------------------------------------------- > > ValueError Traceback (most recent call > last) > > <ipython-input-6-10032e941fc4> in <module>() > > ----> 1 df.withColumn("udf", my_udf(df.name, df.value)).show() > > > > <ipython-input-5-c103a6066373> in my_udf(name, value) > > 3 > > 4 def my_udf(name, value): > > ----> 5 if name == 'c': > > 6 return value * 2 > > 7 return value > > > > /home/ec2-user/spark-2.0.1-bin-hadoop2.4/python/pyspark/sql/column.pyc in > > __nonzero__(self) > > 425 > > 426 def __nonzero__(self): > > --> 427 raise ValueError("Cannot convert column into bool: please > > use '&' for 'and', '|' for 'or', " > > 428 "'~' for 'not' when building DataFrame > > boolean expressions.") > > 429 __bool__ = __nonzero__ > > > > ValueError: Cannot convert column into bool: please use '&' for 'and', > '|' > > for 'or', '~' for 'not' when building DataFrame boolean expressions. >