Hagai Attias created SPARK-26510: ------------------------------------ Summary: Spark 2.3 change of behavior (vs 1.6) when caching a dataframe and using 'createOrReplaceTempView' Key: SPARK-26510 URL: https://issues.apache.org/jira/browse/SPARK-26510 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 2.3.0 Reporter: Hagai Attias
It seems that there's a change of behaviour between 1.6 and 2.3 when caching a Dataframe and saving it as a temp table. In 1.6, the following code executed {{printUDF}} once. The equivalent code in 2.3 (or even same as is) executes it twice. {{{code:title=RegisterTest.scala|borderStyle=solid} }} {{val rdd = session.sparkContext.parallelize(Seq(1, 2, 3)).map(Row(_))}} {{val schema = StructType(StructField("num", IntegerType) :: Nil)}} {{val df1 = session.createDataFrame(rdd, schema)}} {{df1.createOrReplaceTempView("data_table")}} {{session.udf.register("printUDF",}}(x:Int) => { print(x) x }){{)}} {{val df2 = session.sql("select printUDF(num) result from data_table").cache()}} {{df2.collect() //execute cache}} {{val df3 = df2.select("result")}} {{df3.collect()}} {{{code}}} 1.6 prints 123 while 2.3 prints 123123, thus evaluating the dataframe twice. I managed to overcome by skipping the temporary table and selecting directly from the cached dataframe, but was wondering if that is an expected behavior / known issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org