[jira] [Updated] (SPARK-34002) Broken UDF Encoding

2021-01-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34002:
--
Fix Version/s: (was: 3.1.0)
   3.1.1

> Broken UDF Encoding
> ---
>
> Key: SPARK-34002
> URL: https://issues.apache.org/jira/browse/SPARK-34002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
> Environment: Windows 10 Surface book 3
> Local Spark
> IntelliJ Idea
>  
>Reporter: Mark Hamilton
>Assignee: L. C. Hsieh
>Priority: Blocker
> Fix For: 3.1.1
>
>
> UDFs can behave differently depending on if a dataframe is cached, despite 
> the dataframe being identical
>  
> Repro:
>  
> {code:java}
> import org.apache.spark.sql.expressions.UserDefinedFunction 
> import org.apache.spark.sql.functions.{col, udf}
> case class Bar(a: Int)
>  
> import spark.implicits._
> def f1(bar: Bar): Option[Bar] = {
>  None
> }
> def f2(bar: Bar): Option[Bar] = {
>  Option(bar)
> }
> val udf1: UserDefinedFunction = udf(f1 _)
> val udf2: UserDefinedFunction = udf(f2 _)
> // Commenting in the cache will make this example work
> val df = (1 to 10).map(i => Tuple1(Bar(1))).toDF("c0")//.cache()
> val newDf = df
>  .withColumn("c1", udf1(col("c0")))
>  .withColumn("c2", udf2(col("c1")))
> newDf.show()
> {code}
>  
> Error:
> Testing started at 12:58 AM ...Testing started at 12:58 AM ..."C:\Program 
> Files\Java\jdk1.8.0_271\bin\java.exe" "-javaagent:C:\Program 
> Files\JetBrains\IntelliJ IDEA 2020.2.3\lib\idea_rt.jar=56657:C:\Program 
> Files\JetBrains\IntelliJ IDEA 2020.2.3\bin" -Dfile.encoding=UTF-8 -classpath 
> "C:\Users\marhamil\AppData\Roaming\JetBrains\IntelliJIdea2020.2\plugins\Scala\lib\runners.jar;C:\Program
>  Files\Java\jdk1.8.0_271\jre\lib\charsets.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\deploy.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\access-bridge-64.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\cldrdata.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\dnsns.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\jaccess.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\jfxrt.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\localedata.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\nashorn.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunec.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunjce_provider.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunmscapi.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunpkcs11.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\zipfs.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\javaws.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jce.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jfr.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jfxswt.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jsse.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\management-agent.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\plugin.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\resources.jar;C:\Program 
> 

[jira] [Updated] (SPARK-34002) Broken UDF Encoding

2021-01-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34002:
--
Target Version/s: 3.1.1

> Broken UDF Encoding
> ---
>
> Key: SPARK-34002
> URL: https://issues.apache.org/jira/browse/SPARK-34002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
> Environment: Windows 10 Surface book 3
> Local Spark
> IntelliJ Idea
>  
>Reporter: Mark Hamilton
>Priority: Blocker
>
> UDFs can behave differently depending on if a dataframe is cached, despite 
> the dataframe being identical
>  
> Repro:
>  
> {code:java}
> import org.apache.spark.sql.expressions.UserDefinedFunction 
> import org.apache.spark.sql.functions.{col, udf}
> case class Bar(a: Int)
>  
> import spark.implicits._
> def f1(bar: Bar): Option[Bar] = {
>  None
> }
> def f2(bar: Bar): Option[Bar] = {
>  Option(bar)
> }
> val udf1: UserDefinedFunction = udf(f1 _)
> val udf2: UserDefinedFunction = udf(f2 _)
> // Commenting in the cache will make this example work
> val df = (1 to 10).map(i => Tuple1(Bar(1))).toDF("c0")//.cache()
> val newDf = df
>  .withColumn("c1", udf1(col("c0")))
>  .withColumn("c2", udf2(col("c1")))
> newDf.show()
> {code}
>  
> Error:
> Testing started at 12:58 AM ...Testing started at 12:58 AM ..."C:\Program 
> Files\Java\jdk1.8.0_271\bin\java.exe" "-javaagent:C:\Program 
> Files\JetBrains\IntelliJ IDEA 2020.2.3\lib\idea_rt.jar=56657:C:\Program 
> Files\JetBrains\IntelliJ IDEA 2020.2.3\bin" -Dfile.encoding=UTF-8 -classpath 
> "C:\Users\marhamil\AppData\Roaming\JetBrains\IntelliJIdea2020.2\plugins\Scala\lib\runners.jar;C:\Program
>  Files\Java\jdk1.8.0_271\jre\lib\charsets.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\deploy.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\access-bridge-64.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\cldrdata.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\dnsns.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\jaccess.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\jfxrt.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\localedata.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\nashorn.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunec.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunjce_provider.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunmscapi.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunpkcs11.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\zipfs.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\javaws.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jce.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jfr.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jfxswt.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jsse.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\management-agent.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\plugin.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\resources.jar;C:\Program 
> 

[jira] [Updated] (SPARK-34002) Broken UDF Encoding

2021-01-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34002:
--
Priority: Blocker  (was: Major)

> Broken UDF Encoding
> ---
>
> Key: SPARK-34002
> URL: https://issues.apache.org/jira/browse/SPARK-34002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
> Environment: Windows 10 Surface book 3
> Local Spark
> IntelliJ Idea
>  
>Reporter: Mark Hamilton
>Priority: Blocker
>
> UDFs can behave differently depending on if a dataframe is cached, despite 
> the dataframe being identical
>  
> Repro:
>  
> {code:java}
> import org.apache.spark.sql.expressions.UserDefinedFunction 
> import org.apache.spark.sql.functions.{col, udf}
> case class Bar(a: Int)
>  
> import spark.implicits._
> def f1(bar: Bar): Option[Bar] = {
>  None
> }
> def f2(bar: Bar): Option[Bar] = {
>  Option(bar)
> }
> val udf1: UserDefinedFunction = udf(f1 _)
> val udf2: UserDefinedFunction = udf(f2 _)
> // Commenting in the cache will make this example work
> val df = (1 to 10).map(i => Tuple1(Bar(1))).toDF("c0")//.cache()
> val newDf = df
>  .withColumn("c1", udf1(col("c0")))
>  .withColumn("c2", udf2(col("c1")))
> newDf.show()
> {code}
>  
> Error:
> Testing started at 12:58 AM ...Testing started at 12:58 AM ..."C:\Program 
> Files\Java\jdk1.8.0_271\bin\java.exe" "-javaagent:C:\Program 
> Files\JetBrains\IntelliJ IDEA 2020.2.3\lib\idea_rt.jar=56657:C:\Program 
> Files\JetBrains\IntelliJ IDEA 2020.2.3\bin" -Dfile.encoding=UTF-8 -classpath 
> "C:\Users\marhamil\AppData\Roaming\JetBrains\IntelliJIdea2020.2\plugins\Scala\lib\runners.jar;C:\Program
>  Files\Java\jdk1.8.0_271\jre\lib\charsets.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\deploy.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\access-bridge-64.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\cldrdata.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\dnsns.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\jaccess.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\jfxrt.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\localedata.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\nashorn.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunec.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunjce_provider.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunmscapi.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunpkcs11.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\zipfs.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\javaws.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jce.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jfr.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jfxswt.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jsse.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\management-agent.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\plugin.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\resources.jar;C:\Program 
> 

[jira] [Updated] (SPARK-34002) Broken UDF Encoding

2021-01-09 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-34002:

Affects Version/s: (was: 3.0.1)
   3.1.0

> Broken UDF Encoding
> ---
>
> Key: SPARK-34002
> URL: https://issues.apache.org/jira/browse/SPARK-34002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
> Environment: Windows 10 Surface book 3
> Local Spark
> IntelliJ Idea
>  
>Reporter: Mark Hamilton
>Priority: Major
>
> UDFs can behave differently depending on if a dataframe is cached, despite 
> the dataframe being identical
>  
> Repro:
>  
> {code:java}
> import org.apache.spark.sql.expressions.UserDefinedFunction 
> import org.apache.spark.sql.functions.{col, udf}
> case class Bar(a: Int)
>  
> import spark.implicits._
> def f1(bar: Bar): Option[Bar] = {
>  None
> }
> def f2(bar: Bar): Option[Bar] = {
>  Option(bar)
> }
> val udf1: UserDefinedFunction = udf(f1 _)
> val udf2: UserDefinedFunction = udf(f2 _)
> // Commenting in the cache will make this example work
> val df = (1 to 10).map(i => Tuple1(Bar(1))).toDF("c0")//.cache()
> val newDf = df
>  .withColumn("c1", udf1(col("c0")))
>  .withColumn("c2", udf2(col("c1")))
> newDf.show()
> {code}
>  
> Error:
> Testing started at 12:58 AM ...Testing started at 12:58 AM ..."C:\Program 
> Files\Java\jdk1.8.0_271\bin\java.exe" "-javaagent:C:\Program 
> Files\JetBrains\IntelliJ IDEA 2020.2.3\lib\idea_rt.jar=56657:C:\Program 
> Files\JetBrains\IntelliJ IDEA 2020.2.3\bin" -Dfile.encoding=UTF-8 -classpath 
> "C:\Users\marhamil\AppData\Roaming\JetBrains\IntelliJIdea2020.2\plugins\Scala\lib\runners.jar;C:\Program
>  Files\Java\jdk1.8.0_271\jre\lib\charsets.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\deploy.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\access-bridge-64.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\cldrdata.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\dnsns.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\jaccess.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\jfxrt.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\localedata.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\nashorn.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunec.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunjce_provider.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunmscapi.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunpkcs11.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\zipfs.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\javaws.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jce.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jfr.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jfxswt.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jsse.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\management-agent.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\plugin.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\resources.jar;C:\Program 
> 

[jira] [Updated] (SPARK-34002) Broken UDF Encoding

2021-01-09 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-34002:

Affects Version/s: 3.2.0

> Broken UDF Encoding
> ---
>
> Key: SPARK-34002
> URL: https://issues.apache.org/jira/browse/SPARK-34002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
> Environment: Windows 10 Surface book 3
> Local Spark
> IntelliJ Idea
>  
>Reporter: Mark Hamilton
>Priority: Major
>
> UDFs can behave differently depending on if a dataframe is cached, despite 
> the dataframe being identical
>  
> Repro:
>  
> {code:java}
> import org.apache.spark.sql.expressions.UserDefinedFunction 
> import org.apache.spark.sql.functions.{col, udf}
> case class Bar(a: Int)
>  
> import spark.implicits._
> def f1(bar: Bar): Option[Bar] = {
>  None
> }
> def f2(bar: Bar): Option[Bar] = {
>  Option(bar)
> }
> val udf1: UserDefinedFunction = udf(f1 _)
> val udf2: UserDefinedFunction = udf(f2 _)
> // Commenting in the cache will make this example work
> val df = (1 to 10).map(i => Tuple1(Bar(1))).toDF("c0")//.cache()
> val newDf = df
>  .withColumn("c1", udf1(col("c0")))
>  .withColumn("c2", udf2(col("c1")))
> newDf.show()
> {code}
>  
> Error:
> Testing started at 12:58 AM ...Testing started at 12:58 AM ..."C:\Program 
> Files\Java\jdk1.8.0_271\bin\java.exe" "-javaagent:C:\Program 
> Files\JetBrains\IntelliJ IDEA 2020.2.3\lib\idea_rt.jar=56657:C:\Program 
> Files\JetBrains\IntelliJ IDEA 2020.2.3\bin" -Dfile.encoding=UTF-8 -classpath 
> "C:\Users\marhamil\AppData\Roaming\JetBrains\IntelliJIdea2020.2\plugins\Scala\lib\runners.jar;C:\Program
>  Files\Java\jdk1.8.0_271\jre\lib\charsets.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\deploy.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\access-bridge-64.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\cldrdata.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\dnsns.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\jaccess.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\jfxrt.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\localedata.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\nashorn.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunec.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunjce_provider.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunmscapi.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunpkcs11.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\zipfs.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\javaws.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jce.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jfr.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jfxswt.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jsse.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\management-agent.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\plugin.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\resources.jar;C:\Program 
> 

[jira] [Updated] (SPARK-34002) Broken UDF Encoding

2021-01-05 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-34002:
-
Component/s: (was: Spark Core)
 SQL

> Broken UDF Encoding
> ---
>
> Key: SPARK-34002
> URL: https://issues.apache.org/jira/browse/SPARK-34002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
> Environment: Windows 10 Surface book 3
> Local Spark
> IntelliJ Idea
>  
>Reporter: Mark Hamilton
>Priority: Major
>
> UDFs can behave differently depending on if a dataframe is cached, despite 
> the dataframe being identical
>  
> Repro:
>  
> {code:java}
> import org.apache.spark.sql.expressions.UserDefinedFunction 
> import org.apache.spark.sql.functions.{col, udf}
> case class Bar(a: Int)
>  
> import spark.implicits._
> def f1(bar: Bar): Option[Bar] = {
>  None
> }
> def f2(bar: Bar): Option[Bar] = {
>  Option(bar)
> }
> val udf1: UserDefinedFunction = udf(f1 _)
> val udf2: UserDefinedFunction = udf(f2 _)
> // Commenting in the cache will make this example work
> val df = (1 to 10).map(i => Tuple1(Bar(1))).toDF("c0")//.cache()
> val newDf = df
>  .withColumn("c1", udf1(col("c0")))
>  .withColumn("c2", udf2(col("c1")))
> newDf.show()
> {code}
>  
> Error:
> Testing started at 12:58 AM ...Testing started at 12:58 AM ..."C:\Program 
> Files\Java\jdk1.8.0_271\bin\java.exe" "-javaagent:C:\Program 
> Files\JetBrains\IntelliJ IDEA 2020.2.3\lib\idea_rt.jar=56657:C:\Program 
> Files\JetBrains\IntelliJ IDEA 2020.2.3\bin" -Dfile.encoding=UTF-8 -classpath 
> "C:\Users\marhamil\AppData\Roaming\JetBrains\IntelliJIdea2020.2\plugins\Scala\lib\runners.jar;C:\Program
>  Files\Java\jdk1.8.0_271\jre\lib\charsets.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\deploy.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\access-bridge-64.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\cldrdata.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\dnsns.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\jaccess.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\jfxrt.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\localedata.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\nashorn.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunec.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunjce_provider.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunmscapi.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunpkcs11.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\zipfs.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\javaws.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jce.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jfr.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jfxswt.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jsse.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\management-agent.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\plugin.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\resources.jar;C:\Program 
> 

[jira] [Updated] (SPARK-34002) Broken UDF Encoding

2021-01-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34002:
-
Priority: Major  (was: Critical)

> Broken UDF Encoding
> ---
>
> Key: SPARK-34002
> URL: https://issues.apache.org/jira/browse/SPARK-34002
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
> Environment: Windows 10 Surface book 3
> Local Spark
> IntelliJ Idea
>  
>Reporter: Mark Hamilton
>Priority: Major
>
> UDFs can behave differently depending on if a dataframe is cached, despite 
> the dataframe being identical
>  
> Repro:
>  
> {code:java}
> import org.apache.spark.sql.expressions.UserDefinedFunction 
> import org.apache.spark.sql.functions.{col, udf}
> case class Bar(a: Int)
>  
> import spark.implicits._
> def f1(bar: Bar): Option[Bar] = {
>  None
> }
> def f2(bar: Bar): Option[Bar] = {
>  Option(bar)
> }
> val udf1: UserDefinedFunction = udf(f1 _)
> val udf2: UserDefinedFunction = udf(f2 _)
> // Commenting in the cache will make this example work
> val df = (1 to 10).map(i => Tuple1(Bar(1))).toDF("c0")//.cache()
> val newDf = df
>  .withColumn("c1", udf1(col("c0")))
>  .withColumn("c2", udf2(col("c1")))
> newDf.show()
> {code}
>  
> Error:
> Testing started at 12:58 AM ...Testing started at 12:58 AM ..."C:\Program 
> Files\Java\jdk1.8.0_271\bin\java.exe" "-javaagent:C:\Program 
> Files\JetBrains\IntelliJ IDEA 2020.2.3\lib\idea_rt.jar=56657:C:\Program 
> Files\JetBrains\IntelliJ IDEA 2020.2.3\bin" -Dfile.encoding=UTF-8 -classpath 
> "C:\Users\marhamil\AppData\Roaming\JetBrains\IntelliJIdea2020.2\plugins\Scala\lib\runners.jar;C:\Program
>  Files\Java\jdk1.8.0_271\jre\lib\charsets.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\deploy.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\access-bridge-64.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\cldrdata.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\dnsns.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\jaccess.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\jfxrt.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\localedata.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\nashorn.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunec.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunjce_provider.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunmscapi.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\sunpkcs11.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\ext\zipfs.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\javaws.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jce.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jfr.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jfxswt.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\jsse.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\management-agent.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\plugin.jar;C:\Program 
> Files\Java\jdk1.8.0_271\jre\lib\resources.jar;C:\Program 
> 

[jira] [Updated] (SPARK-34002) Broken UDF Encoding

2021-01-04 Thread Mark Hamilton (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamilton updated SPARK-34002:
--
Description: 
UDFs can behave differently depending on if a dataframe is cached, despite the 
dataframe being identical

 

Repro:

 
{code:java}
import org.apache.spark.sql.expressions.UserDefinedFunction 
import org.apache.spark.sql.functions.{col, udf}

case class Bar(a: Int)
 
import spark.implicits._

def f1(bar: Bar): Option[Bar] = {
 None
}

def f2(bar: Bar): Option[Bar] = {
 Option(bar)
}

val udf1: UserDefinedFunction = udf(f1 _)
val udf2: UserDefinedFunction = udf(f2 _)

// Commenting in the cache will make this example work
val df = (1 to 10).map(i => Tuple1(Bar(1))).toDF("c0")//.cache()
val newDf = df
 .withColumn("c1", udf1(col("c0")))
 .withColumn("c2", udf2(col("c1")))
newDf.show()
{code}
 

Error:

Testing started at 12:58 AM ...Testing started at 12:58 AM ..."C:\Program 
Files\Java\jdk1.8.0_271\bin\java.exe" "-javaagent:C:\Program 
Files\JetBrains\IntelliJ IDEA 2020.2.3\lib\idea_rt.jar=56657:C:\Program 
Files\JetBrains\IntelliJ IDEA 2020.2.3\bin" -Dfile.encoding=UTF-8 -classpath 
"C:\Users\marhamil\AppData\Roaming\JetBrains\IntelliJIdea2020.2\plugins\Scala\lib\runners.jar;C:\Program
 Files\Java\jdk1.8.0_271\jre\lib\charsets.jar;C:\Program 
Files\Java\jdk1.8.0_271\jre\lib\deploy.jar;C:\Program 
Files\Java\jdk1.8.0_271\jre\lib\ext\access-bridge-64.jar;C:\Program 
Files\Java\jdk1.8.0_271\jre\lib\ext\cldrdata.jar;C:\Program 
Files\Java\jdk1.8.0_271\jre\lib\ext\dnsns.jar;C:\Program 
Files\Java\jdk1.8.0_271\jre\lib\ext\jaccess.jar;C:\Program 
Files\Java\jdk1.8.0_271\jre\lib\ext\jfxrt.jar;C:\Program 
Files\Java\jdk1.8.0_271\jre\lib\ext\localedata.jar;C:\Program 
Files\Java\jdk1.8.0_271\jre\lib\ext\nashorn.jar;C:\Program 
Files\Java\jdk1.8.0_271\jre\lib\ext\sunec.jar;C:\Program 
Files\Java\jdk1.8.0_271\jre\lib\ext\sunjce_provider.jar;C:\Program 
Files\Java\jdk1.8.0_271\jre\lib\ext\sunmscapi.jar;C:\Program 
Files\Java\jdk1.8.0_271\jre\lib\ext\sunpkcs11.jar;C:\Program 
Files\Java\jdk1.8.0_271\jre\lib\ext\zipfs.jar;C:\Program 
Files\Java\jdk1.8.0_271\jre\lib\javaws.jar;C:\Program 
Files\Java\jdk1.8.0_271\jre\lib\jce.jar;C:\Program 
Files\Java\jdk1.8.0_271\jre\lib\jfr.jar;C:\Program 
Files\Java\jdk1.8.0_271\jre\lib\jfxswt.jar;C:\Program 
Files\Java\jdk1.8.0_271\jre\lib\jsse.jar;C:\Program 
Files\Java\jdk1.8.0_271\jre\lib\management-agent.jar;C:\Program 
Files\Java\jdk1.8.0_271\jre\lib\plugin.jar;C:\Program 
Files\Java\jdk1.8.0_271\jre\lib\resources.jar;C:\Program 

[jira] [Updated] (SPARK-34002) Broken UDF Encoding

2021-01-04 Thread Mark Hamilton (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamilton updated SPARK-34002:
--
Description: 
UDFs can behave differently depending on if a dataframe is cached, despite the 
dataframe being identical

 

Repro:

 
{code:java}
import org.apache.spark.sql.expressions.UserDefinedFunction 
import org.apache.spark.sql.functions.{col, udf}

case class Bar(a: Int)
 
import spark.implicits._

def f1(bar: Bar): Option[Bar] = {
 None
}

def f2(bar: Bar): Option[Bar] = {
 Option(bar)
}

val udf1: UserDefinedFunction = udf(f1 _)
val udf2: UserDefinedFunction = udf(f2 _)

// Commenting in the cache will make this example work
val df = (1 to 10).map(i => Tuple1(Bar(1))).toDF("c0")//.cache()
val newDf = df
 .withColumn("c1", udf1(col("c0")))
 .withColumn("c2", udf2(col("c1")))
newDf.show()
{code}
 

Error:
{code:java}
Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties21/01/05 00:52:57 INFO SparkContext: 
Running Spark version 3.0.121/01/05 00:52:57 WARN NativeCodeLoader: Unable to 
load native-hadoop library for your platform... using builtin-java classes 
where applicable21/01/05 00:52:57 INFO ResourceUtils: 
==21/01/05 00:52:57 
INFO ResourceUtils: Resources for spark.driver:
21/01/05 00:52:57 INFO ResourceUtils: 
==21/01/05 00:52:57 
INFO SparkContext: Submitted application: JsonOutputParserSuite21/01/05 
00:52:57 INFO SparkContext: Spark 
configuration:spark.app.name=JsonOutputParserSuitespark.driver.maxResultSize=6gspark.logConf=truespark.master=local[*]spark.sql.crossJoin.enabled=truespark.sql.shuffle.partitions=20spark.sql.warehouse.dir=file:/code/mmlspark/spark-warehouse21/01/05
 00:52:58 INFO SecurityManager: Changing view acls to: marhamil21/01/05 
00:52:58 INFO SecurityManager: Changing modify acls to: marhamil21/01/05 
00:52:58 INFO SecurityManager: Changing view acls groups to: 21/01/05 00:52:58 
INFO SecurityManager: Changing modify acls groups to: 21/01/05 00:52:58 INFO 
SecurityManager: SecurityManager: authentication disabled; ui acls disabled; 
users  with view permissions: Set(marhamil); groups with view permissions: 
Set(); users  with modify permissions: Set(marhamil); groups with modify 
permissions: Set()21/01/05 00:52:58 INFO Utils: Successfully started service 
'sparkDriver' on port 52315.21/01/05 00:52:58 INFO SparkEnv: Registering 
MapOutputTracker21/01/05 00:52:58 INFO SparkEnv: Registering 
BlockManagerMaster21/01/05 00:52:58 INFO BlockManagerMasterEndpoint: Using 
org.apache.spark.storage.DefaultTopologyMapper for getting topology 
information21/01/05 00:52:58 INFO BlockManagerMasterEndpoint: 
BlockManagerMasterEndpoint up21/01/05 00:52:58 INFO SparkEnv: Registering 
BlockManagerMasterHeartbeat21/01/05 00:52:58 INFO DiskBlockManager: Created 
local directory at 
C:\Users\marhamil\AppData\Local\Temp\blockmgr-9a5c80ef-ade6-41ac-9933-a26f6c29171921/01/05
 00:52:58 INFO MemoryStore: MemoryStore started with capacity 4.0 GiB21/01/05 
00:52:59 INFO SparkEnv: Registering OutputCommitCoordinator21/01/05 00:52:59 
INFO Utils: Successfully started service 'SparkUI' on port 4040.21/01/05 
00:52:59 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
http://host.docker.internal:404021/01/05 00:52:59 INFO Executor: Starting 
executor ID driver on host host.docker.internal21/01/05 00:52:59 INFO Utils: 
Successfully started service 
'org.apache.spark.network.netty.NettyBlockTransferService' on port 
52359.21/01/05 00:52:59 INFO NettyBlockTransferService: Server created on 
host.docker.internal:5235921/01/05 00:52:59 INFO BlockManager: Using 
org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
policy21/01/05 00:52:59 INFO BlockManagerMaster: Registering BlockManager 
BlockManagerId(driver, host.docker.internal, 52359, None)21/01/05 00:52:59 INFO 
BlockManagerMasterEndpoint: Registering block manager 
host.docker.internal:52359 with 4.0 GiB RAM, BlockManagerId(driver, 
host.docker.internal, 52359, None)21/01/05 00:52:59 INFO BlockManagerMaster: 
Registered BlockManager BlockManagerId(driver, host.docker.internal, 52359, 
None)21/01/05 00:52:59 INFO BlockManager: Initialized BlockManager: 
BlockManagerId(driver, host.docker.internal, 52359, None)21/01/05 00:53:00 WARN 
SharedState: Not allowing to set spark.sql.warehouse.dir or 
hive.metastore.warehouse.dir in SparkSession's options, it should be set 
statically for cross-session usagesFailed to execute user defined 
function(JsonOutputParserSuite$$Lambda$574/51376124: (struct) => 
struct)org.apache.spark.SparkException: Failed to execute user defined 
function(JsonOutputParserSuite$$Lambda$574/51376124: (struct) => 
struct) at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1130) at 

[jira] [Updated] (SPARK-34002) Broken UDF Encoding

2021-01-04 Thread Mark Hamilton (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamilton updated SPARK-34002:
--
Description: 
UDFs can behave differently depending on if a dataframe is cached, despite the 
dataframe being identical

 

Repro:

 
{code:java}
case class Bar(a: Int)
 
import spark.implicits._

def f1(bar: Bar): Option[Bar] = {
 None
}

def f2(bar: Bar): Option[Bar] = {
 Option(bar)
}

val udf1: UserDefinedFunction = udf(f1 _)
val udf2: UserDefinedFunction = udf(f2 _)

// Commenting in the cache will make this example work
val df = (1 to 10).map(i => Tuple1(Bar(1))).toDF("c0")//.cache()
val newDf = df
 .withColumn("c1", udf1(col("c0")))
 .withColumn("c2", udf2(col("c1")))
newDf.show()
{code}
 

Error:
{code:java}
Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties21/01/05 00:52:57 INFO SparkContext: 
Running Spark version 3.0.121/01/05 00:52:57 WARN NativeCodeLoader: Unable to 
load native-hadoop library for your platform... using builtin-java classes 
where applicable21/01/05 00:52:57 INFO ResourceUtils: 
==21/01/05 00:52:57 
INFO ResourceUtils: Resources for spark.driver:
21/01/05 00:52:57 INFO ResourceUtils: 
==21/01/05 00:52:57 
INFO SparkContext: Submitted application: JsonOutputParserSuite21/01/05 
00:52:57 INFO SparkContext: Spark 
configuration:spark.app.name=JsonOutputParserSuitespark.driver.maxResultSize=6gspark.logConf=truespark.master=local[*]spark.sql.crossJoin.enabled=truespark.sql.shuffle.partitions=20spark.sql.warehouse.dir=file:/code/mmlspark/spark-warehouse21/01/05
 00:52:58 INFO SecurityManager: Changing view acls to: marhamil21/01/05 
00:52:58 INFO SecurityManager: Changing modify acls to: marhamil21/01/05 
00:52:58 INFO SecurityManager: Changing view acls groups to: 21/01/05 00:52:58 
INFO SecurityManager: Changing modify acls groups to: 21/01/05 00:52:58 INFO 
SecurityManager: SecurityManager: authentication disabled; ui acls disabled; 
users  with view permissions: Set(marhamil); groups with view permissions: 
Set(); users  with modify permissions: Set(marhamil); groups with modify 
permissions: Set()21/01/05 00:52:58 INFO Utils: Successfully started service 
'sparkDriver' on port 52315.21/01/05 00:52:58 INFO SparkEnv: Registering 
MapOutputTracker21/01/05 00:52:58 INFO SparkEnv: Registering 
BlockManagerMaster21/01/05 00:52:58 INFO BlockManagerMasterEndpoint: Using 
org.apache.spark.storage.DefaultTopologyMapper for getting topology 
information21/01/05 00:52:58 INFO BlockManagerMasterEndpoint: 
BlockManagerMasterEndpoint up21/01/05 00:52:58 INFO SparkEnv: Registering 
BlockManagerMasterHeartbeat21/01/05 00:52:58 INFO DiskBlockManager: Created 
local directory at 
C:\Users\marhamil\AppData\Local\Temp\blockmgr-9a5c80ef-ade6-41ac-9933-a26f6c29171921/01/05
 00:52:58 INFO MemoryStore: MemoryStore started with capacity 4.0 GiB21/01/05 
00:52:59 INFO SparkEnv: Registering OutputCommitCoordinator21/01/05 00:52:59 
INFO Utils: Successfully started service 'SparkUI' on port 4040.21/01/05 
00:52:59 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
http://host.docker.internal:404021/01/05 00:52:59 INFO Executor: Starting 
executor ID driver on host host.docker.internal21/01/05 00:52:59 INFO Utils: 
Successfully started service 
'org.apache.spark.network.netty.NettyBlockTransferService' on port 
52359.21/01/05 00:52:59 INFO NettyBlockTransferService: Server created on 
host.docker.internal:5235921/01/05 00:52:59 INFO BlockManager: Using 
org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
policy21/01/05 00:52:59 INFO BlockManagerMaster: Registering BlockManager 
BlockManagerId(driver, host.docker.internal, 52359, None)21/01/05 00:52:59 INFO 
BlockManagerMasterEndpoint: Registering block manager 
host.docker.internal:52359 with 4.0 GiB RAM, BlockManagerId(driver, 
host.docker.internal, 52359, None)21/01/05 00:52:59 INFO BlockManagerMaster: 
Registered BlockManager BlockManagerId(driver, host.docker.internal, 52359, 
None)21/01/05 00:52:59 INFO BlockManager: Initialized BlockManager: 
BlockManagerId(driver, host.docker.internal, 52359, None)21/01/05 00:53:00 WARN 
SharedState: Not allowing to set spark.sql.warehouse.dir or 
hive.metastore.warehouse.dir in SparkSession's options, it should be set 
statically for cross-session usagesFailed to execute user defined 
function(JsonOutputParserSuite$$Lambda$574/51376124: (struct) => 
struct)org.apache.spark.SparkException: Failed to execute user defined 
function(JsonOutputParserSuite$$Lambda$574/51376124: (struct) => 
struct) at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1130) at 
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:156)
 at