On Thu, Nov 10, 2016 at 11:14 AM, Perttu Ranta-aho <ranta...@iki.fi> wrote:
> Hello,
>
> I want to create an UDF which modifies one column value depending on value
> of some other column. But Python version of the code fails always in column
> value comparison. Below are simple examples, scala version works as expected
> but Python version throws an execption. Am I missing something obvious? As
> can be seen from PySpark exception I'm using Spark 2.0.1.
>
> -Perttu
>
> import org.apache.spark.sql.functions.udf
> val df = spark.createDataFrame(List(("a",1), ("b",2), ("c",
> 3))).withColumnRenamed("_1", "name").withColumnRenamed("_2", "value")
> def myUdf = udf((name: String, value: Int) => {if (name == "c") { value * 2
> } else { value }})
> df.withColumn("udf", myUdf(df("name"), df("value"))).show()
> +----+-----+---+
> |name|value|udf|
> +----+-----+---+
> |   a|    1|  1|
> |   b|    2|  2|
> |   c|    3|  6|
> +----+-----+---+
>
>
> from pyspark.sql.types import StringType, IntegerType
> import pyspark.sql.functions as F
>
> df = sqlContext.createDataFrame((('a',1), ('b',2), ('c', 3)),
> ('name','value'))
>
> def my_udf(name, value):
>     if name == 'c':
>         return value * 2
>     return value
> F.udf(my_udf, IntegerType())

udf = F.udf(my_udf, IntegerType())
df.withColumn("udf", udf(df.name, df.value)).show()

>
> df.withColumn("udf", my_udf(df.name, df.value)).show()
>
> ---------------------------------------------------------------------------
> ValueError                                Traceback (most recent call last)
> <ipython-input-6-10032e941fc4> in <module>()
> ----> 1 df.withColumn("udf", my_udf(df.name, df.value)).show()
>
> <ipython-input-5-c103a6066373> in my_udf(name, value)
>       3
>       4 def my_udf(name, value):
> ----> 5     if name == 'c':
>       6         return value * 2
>       7     return value
>
> /home/ec2-user/spark-2.0.1-bin-hadoop2.4/python/pyspark/sql/column.pyc in
> __nonzero__(self)
>     425
>     426     def __nonzero__(self):
> --> 427         raise ValueError("Cannot convert column into bool: please
> use '&' for 'and', '|' for 'or', "
>     428                          "'~' for 'not' when building DataFrame
> boolean expressions.")
>     429     __bool__ = __nonzero__
>
> ValueError: Cannot convert column into bool: please use '&' for 'and', '|'
> for 'or', '~' for 'not' when building DataFrame boolean expressions.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to