Hagai Attias created SPARK-26510:
------------------------------------

             Summary: Spark 2.3 change of behavior (vs 1.6) when caching a 
dataframe and using 'createOrReplaceTempView'
                 Key: SPARK-26510
                 URL: https://issues.apache.org/jira/browse/SPARK-26510
             Project: Spark
          Issue Type: Bug
          Components: Spark Core, SQL
    Affects Versions: 2.3.0
            Reporter: Hagai Attias


It seems that there's a change of behaviour between 1.6 and 2.3 when caching a 
Dataframe and saving it as a temp table. In 1.6, the following code executed 
{{printUDF}} once. The equivalent code in 2.3 (or even same as is) executes it 
twice.
 
{{{code:title=RegisterTest.scala|borderStyle=solid} }}
 
{{val rdd = session.sparkContext.parallelize(Seq(1, 2, 3)).map(Row(_))}}
{{val schema = StructType(StructField("num", IntegerType) :: Nil)}}

{{val df1 = session.createDataFrame(rdd, schema)}}
{{df1.createOrReplaceTempView("data_table")}}

{{session.udf.register("printUDF",}}(x:Int) => {
 print(x)
 x
}){{)}}

{{val df2 = session.sql("select printUDF(num) result from data_table").cache()}}

{{df2.collect() //execute cache}}

{{val df3 = df2.select("result")}}

{{df3.collect()}}
 
 
{{{code}}}
 
1.6 prints 123 while 2.3 prints 123123, thus evaluating the dataframe twice. I 
managed to overcome by skipping the temporary table and selecting directly from 
the cached dataframe, but was wondering if that is an expected behavior / known 
issue.
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to