Re: UDF with column value comparison fails with PySpark

Perttu Ranta-aho Thu, 10 Nov 2016 11:48:24 -0800

So it was something obvious, thanks!

-Perttu


to 10. marraskuuta 2016 klo 21.19 Davies Liu <dav...@databricks.com>
kirjoitti:

> On Thu, Nov 10, 2016 at 11:14 AM, Perttu Ranta-aho <ranta...@iki.fi>
> wrote:
> > Hello,
> >
> > I want to create an UDF which modifies one column value depending on
> value
> > of some other column. But Python version of the code fails always in
> column
> > value comparison. Below are simple examples, scala version works as
> expected
> > but Python version throws an execption. Am I missing something obvious?
> As
> > can be seen from PySpark exception I'm using Spark 2.0.1.
> >
> > -Perttu
> >
> > import org.apache.spark.sql.functions.udf
> > val df = spark.createDataFrame(List(("a",1), ("b",2), ("c",
> > 3))).withColumnRenamed("_1", "name").withColumnRenamed("_2", "value")
> > def myUdf = udf((name: String, value: Int) => {if (name == "c") { value
> * 2
> > } else { value }})
> > df.withColumn("udf", myUdf(df("name"), df("value"))).show()
> > +----+-----+---+
> > |name|value|udf|
> > +----+-----+---+
> > |   a|    1|  1|
> > |   b|    2|  2|
> > |   c|    3|  6|
> > +----+-----+---+
> >
> >
> > from pyspark.sql.types import StringType, IntegerType
> > import pyspark.sql.functions as F
> >
> > df = sqlContext.createDataFrame((('a',1), ('b',2), ('c', 3)),
> > ('name','value'))
> >
> > def my_udf(name, value):
> >     if name == 'c':
> >         return value * 2
> >     return value
> > F.udf(my_udf, IntegerType())
>
> udf = F.udf(my_udf, IntegerType())
> df.withColumn("udf", udf(df.name, df.value)).show()
>
> >
> > df.withColumn("udf", my_udf(df.name, df.value)).show()
> >
> >
> ---------------------------------------------------------------------------
> > ValueError                                Traceback (most recent call
> last)
> > <ipython-input-6-10032e941fc4> in <module>()
> > ----> 1 df.withColumn("udf", my_udf(df.name, df.value)).show()
> >
> > <ipython-input-5-c103a6066373> in my_udf(name, value)
> >       3
> >       4 def my_udf(name, value):
> > ----> 5     if name == 'c':
> >       6         return value * 2
> >       7     return value
> >
> > /home/ec2-user/spark-2.0.1-bin-hadoop2.4/python/pyspark/sql/column.pyc in
> > __nonzero__(self)
> >     425
> >     426     def __nonzero__(self):
> > --> 427         raise ValueError("Cannot convert column into bool: please
> > use '&' for 'and', '|' for 'or', "
> >     428                          "'~' for 'not' when building DataFrame
> > boolean expressions.")
> >     429     __bool__ = __nonzero__
> >
> > ValueError: Cannot convert column into bool: please use '&' for 'and',
> '|'
> > for 'or', '~' for 'not' when building DataFrame boolean expressions.
>

Re: UDF with column value comparison fails with PySpark

Reply via email to