In master branch, behavior is the same.

Suggest opening a JIRA if you haven't done so.

On Wed, May 11, 2016 at 6:55 AM, Tony Jin <linbojin...@gmail.com> wrote:

> Hi guys,
>
> I have a problem about spark DataFrame. My spark version is 1.6.1.
> Basically, i used udf and df.withColumn to create a "new" column, and then
> i filter the values on this new columns and call show(action). I see the
> udf function (which is used to by withColumn to create the new column) is
> called twice(duplicated). And if filter on "old" column, udf only run once
> which is expected. I attached the example codes, line 30~38 shows the
> problem.
>
>  Anyone knows the internal reason? Can you give me any advices? Thank you
> very much.
>
>
> 1
> 2
> 3
> 4
> 5
> 6
> 7
> 8
> 9
> 10
> 11
> 12
> 13
> 14
> 15
> 16
> 17
> 18
> 19
> 20
> 21
> 22
> 23
> 24
> 25
> 26
> 27
> 28
> 29
> 30
> 31
> 32
> 33
> 34
> 35
> 36
> 37
> 38
> 39
> 40
> 41
> 42
> 43
> 44
> 45
> 46
> 47
>
> scala> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions._
>
> scala> val df = sc.parallelize(Seq(("a", "b"), ("a1", 
> "b1"))).toDF("old","old1")
> df: org.apache.spark.sql.DataFrame = [old: string, old1: string]
>
> scala> val udfFunc = udf((s: String) => {println(s"running udf($s)"); s })
> udfFunc: org.apache.spark.sql.UserDefinedFunction = 
> UserDefinedFunction(<function1>,StringType,List(StringType))
>
> scala> val newDF = df.withColumn("new", udfFunc(df("old")))
> newDF: org.apache.spark.sql.DataFrame = [old: string, old1: string, new: 
> string]
>
> scala> newDF.show
> running udf(a)
> running udf(a1)
> +---+----+---+
> |old|old1|new|
> +---+----+---+
> |  a|   b|  a|
> | a1|  b1| a1|
> +---+----+---+
>
>
> scala> val filteredOnNewColumnDF = newDF.filter("new <> 'a1'")
> filteredOnNewColumnDF: org.apache.spark.sql.DataFrame = [old: string, old1: 
> string, new: string]
>
> scala> val filteredOnOldColumnDF = newDF.filter("old <> 'a1'")
> filteredOnOldColumnDF: org.apache.spark.sql.DataFrame = [old: string, old1: 
> string, new: string]
>
> scala> filteredOnNewColumnDF.show
> running udf(a)
> running udf(a)
> running udf(a1)
> +---+----+---+
> |old|old1|new|
> +---+----+---+
> |  a|   b|  a|
> +---+----+---+
>
>
> scala> filteredOnOldColumnDF.show
> running udf(a)
> +---+----+---+
> |old|old1|new|
> +---+----+---+
> |  a|   b|  a|
> +---+----+---+
>
>
>
> Best wishes.
> By Linbo
>
>

Reply via email to