[jira] [Commented] (SPARK-20525) ClassCast exception when interpreting UDFs from a String in spark-shell

2020-09-22 Thread Igor Kamyshnikov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-20525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200284#comment-17200284
 ] 

Igor Kamyshnikov commented on SPARK-20525:
--

I bet the issue is in JDK, but it could be solved in scala if they get rid of 
writeReplace/List$SerializationProxy. I've left some details 
[here|https://issues.apache.org/jira/browse/SPARK-19938?focusedCommentId=17200272&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17200272]
 in SPARK-19938.

> ClassCast exception when interpreting UDFs from a String in spark-shell
> ---
>
> Key: SPARK-20525
> URL: https://issues.apache.org/jira/browse/SPARK-20525
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell
>Affects Versions: 2.1.0
> Environment: OS X 10.11.6, spark-2.1.0-bin-hadoop2.7, Scala version 
> 2.11.8 (bundled w/ Spark), Java 1.8.0_121
>Reporter: Dave Knoester
>Priority: Major
>  Labels: bulk-closed
> Attachments: UdfTest.scala
>
>
> I'm trying to interpret a string containing Scala code from inside a Spark 
> session. Everything is working fine, except for User Defined Function-like 
> things (UDFs, map, flatMap, etc).  This is a blocker for production launch of 
> a large number of Spark jobs.
> I've been able to boil the problem down to a number of spark-shell examples, 
> shown below.  Because it's reproducible in the spark-shell, these related 
> issues **don't apply**:
> https://issues.apache.org/jira/browse/SPARK-9219
> https://issues.apache.org/jira/browse/SPARK-18075
> https://issues.apache.org/jira/browse/SPARK-19938
> http://apache-spark-developers-list.1001551.n3.nabble.com/This-Exception-has-been-really-hard-to-trace-td19362.html
> https://community.mapr.com/thread/21488-spark-error-scalacollectionseq-in-instance-of-orgapachesparkrddmappartitionsrdd
> https://github.com/scala/bug/issues/9237
> Any help is appreciated!
> 
> Repro: 
> Run each of the below from a spark-shell.  
> Preamble:
> import scala.tools.nsc.GenericRunnerSettings
> import scala.tools.nsc.interpreter.IMain
> val settings = new GenericRunnerSettings( println _ )
> settings.usejavacp.value = true
> val interpreter = new IMain(settings, new java.io.PrintWriter(System.out))
> interpreter.bind("spark", spark);
> These work:
> // works:
> interpreter.interpret("val x = 5")
> // works:
> interpreter.interpret("import spark.implicits._\nval df = 
> spark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.show")
> These do not work:
> // doesn't work, fails with seq/RDD serialization error:
> interpreter.interpret("import org.apache.spark.sql.functions._\nimport 
> spark.implicits._\nval upper: String => String = _.toUpperCase\nval upperUDF 
> = 
> udf(upper)\nspark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.withColumn(\"UPPER\",
>  upperUDF($\"value\")).show")
> // doesn't work, fails with seq/RDD serialization error:
> interpreter.interpret("import org.apache.spark.sql.functions._\nimport 
> spark.implicits._\nval upper: String => String = 
> _.toUpperCase\nspark.udf.register(\"myUpper\", 
> upper)\nspark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.withColumn(\"UPPER\",
>  callUDF(\"myUpper\", ($\"value\"))).show")
> The not-working ones fail with this exception:
> Caused by: java.lang.ClassCastException: cannot assign instance of 
> scala.collection.immutable.List$SerializationProxy to field 
> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type 
> scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
>   at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
>   at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2237)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80)
>   at org.apache.spark.scheduler.Task.run

[jira] [Commented] (SPARK-20525) ClassCast exception when interpreting UDFs from a String in spark-shell

2018-03-12 Thread UFO (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16396372#comment-16396372
 ] 

UFO commented on SPARK-20525:
-

I also met the same problem. Have you solved it?

> ClassCast exception when interpreting UDFs from a String in spark-shell
> ---
>
> Key: SPARK-20525
> URL: https://issues.apache.org/jira/browse/SPARK-20525
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell
>Affects Versions: 2.1.0
> Environment: OS X 10.11.6, spark-2.1.0-bin-hadoop2.7, Scala version 
> 2.11.8 (bundled w/ Spark), Java 1.8.0_121
>Reporter: Dave Knoester
>Priority: Major
> Attachments: UdfTest.scala
>
>
> I'm trying to interpret a string containing Scala code from inside a Spark 
> session. Everything is working fine, except for User Defined Function-like 
> things (UDFs, map, flatMap, etc).  This is a blocker for production launch of 
> a large number of Spark jobs.
> I've been able to boil the problem down to a number of spark-shell examples, 
> shown below.  Because it's reproducible in the spark-shell, these related 
> issues **don't apply**:
> https://issues.apache.org/jira/browse/SPARK-9219
> https://issues.apache.org/jira/browse/SPARK-18075
> https://issues.apache.org/jira/browse/SPARK-19938
> http://apache-spark-developers-list.1001551.n3.nabble.com/This-Exception-has-been-really-hard-to-trace-td19362.html
> https://community.mapr.com/thread/21488-spark-error-scalacollectionseq-in-instance-of-orgapachesparkrddmappartitionsrdd
> https://github.com/scala/bug/issues/9237
> Any help is appreciated!
> 
> Repro: 
> Run each of the below from a spark-shell.  
> Preamble:
> import scala.tools.nsc.GenericRunnerSettings
> import scala.tools.nsc.interpreter.IMain
> val settings = new GenericRunnerSettings( println _ )
> settings.usejavacp.value = true
> val interpreter = new IMain(settings, new java.io.PrintWriter(System.out))
> interpreter.bind("spark", spark);
> These work:
> // works:
> interpreter.interpret("val x = 5")
> // works:
> interpreter.interpret("import spark.implicits._\nval df = 
> spark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.show")
> These do not work:
> // doesn't work, fails with seq/RDD serialization error:
> interpreter.interpret("import org.apache.spark.sql.functions._\nimport 
> spark.implicits._\nval upper: String => String = _.toUpperCase\nval upperUDF 
> = 
> udf(upper)\nspark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.withColumn(\"UPPER\",
>  upperUDF($\"value\")).show")
> // doesn't work, fails with seq/RDD serialization error:
> interpreter.interpret("import org.apache.spark.sql.functions._\nimport 
> spark.implicits._\nval upper: String => String = 
> _.toUpperCase\nspark.udf.register(\"myUpper\", 
> upper)\nspark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.withColumn(\"UPPER\",
>  callUDF(\"myUpper\", ($\"value\"))).show")
> The not-working ones fail with this exception:
> Caused by: java.lang.ClassCastException: cannot assign instance of 
> scala.collection.immutable.List$SerializationProxy to field 
> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type 
> scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
>   at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
>   at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2237)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was 

[jira] [Commented] (SPARK-20525) ClassCast exception when interpreting UDFs from a String in spark-shell

2017-04-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15990505#comment-15990505
 ] 

Sean Owen commented on SPARK-20525:
---

This is probably a classloader issue in the end too, which is I believe the 
nature of some of the other duplicates. It doesn't recognize that a List can be 
assigned to a Seq because they're not from the same instances of the classes. 
Even if this isn't just a usage problem, unless you have a change to propose, 
I'd call this unsupported usage.

> ClassCast exception when interpreting UDFs from a String in spark-shell
> ---
>
> Key: SPARK-20525
> URL: https://issues.apache.org/jira/browse/SPARK-20525
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell
>Affects Versions: 2.1.0
> Environment: OS X 10.11.6, spark-2.1.0-bin-hadoop2.7, Scala version 
> 2.11.8 (bundled w/ Spark), Java 1.8.0_121
>Reporter: Dave Knoester
> Attachments: UdfTest.scala
>
>
> I'm trying to interpret a string containing Scala code from inside a Spark 
> session. Everything is working fine, except for User Defined Function-like 
> things (UDFs, map, flatMap, etc).  This is a blocker for production launch of 
> a large number of Spark jobs.
> I've been able to boil the problem down to a number of spark-shell examples, 
> shown below.  Because it's reproducible in the spark-shell, these related 
> issues **don't apply**:
> https://issues.apache.org/jira/browse/SPARK-9219
> https://issues.apache.org/jira/browse/SPARK-18075
> https://issues.apache.org/jira/browse/SPARK-19938
> http://apache-spark-developers-list.1001551.n3.nabble.com/This-Exception-has-been-really-hard-to-trace-td19362.html
> https://community.mapr.com/thread/21488-spark-error-scalacollectionseq-in-instance-of-orgapachesparkrddmappartitionsrdd
> https://github.com/scala/bug/issues/9237
> Any help is appreciated!
> 
> Repro: 
> Run each of the below from a spark-shell.  
> Preamble:
> import scala.tools.nsc.GenericRunnerSettings
> import scala.tools.nsc.interpreter.IMain
> val settings = new GenericRunnerSettings( println _ )
> settings.usejavacp.value = true
> val interpreter = new IMain(settings, new java.io.PrintWriter(System.out))
> interpreter.bind("spark", spark);
> These work:
> // works:
> interpreter.interpret("val x = 5")
> // works:
> interpreter.interpret("import spark.implicits._\nval df = 
> spark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.show")
> These do not work:
> // doesn't work, fails with seq/RDD serialization error:
> interpreter.interpret("import org.apache.spark.sql.functions._\nimport 
> spark.implicits._\nval upper: String => String = _.toUpperCase\nval upperUDF 
> = 
> udf(upper)\nspark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.withColumn(\"UPPER\",
>  upperUDF($\"value\")).show")
> // doesn't work, fails with seq/RDD serialization error:
> interpreter.interpret("import org.apache.spark.sql.functions._\nimport 
> spark.implicits._\nval upper: String => String = 
> _.toUpperCase\nspark.udf.register(\"myUpper\", 
> upper)\nspark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.withColumn(\"UPPER\",
>  callUDF(\"myUpper\", ($\"value\"))).show")
> The not-working ones fail with this exception:
> Caused by: java.lang.ClassCastException: cannot assign instance of 
> scala.collection.immutable.List$SerializationProxy to field 
> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type 
> scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
>   at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
>   at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2237)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$

[jira] [Commented] (SPARK-20525) ClassCast exception when interpreting UDFs from a String in spark-shell

2017-04-30 Thread Dave Knoester (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15990489#comment-15990489
 ] 

Dave Knoester commented on SPARK-20525:
---

This is the binary distribution (official version), I didn't build it.  After 
un-tarring, this is just ./bin/spark-shell.  

I realize this isn't a "normal" usage, but a quick google reveals that it's not 
an uncommon request (interpreting a string of Scala in spark).

What's more, I believe that it *should* be supported - In other words, what's 
breaking serde compatibility here?


> ClassCast exception when interpreting UDFs from a String in spark-shell
> ---
>
> Key: SPARK-20525
> URL: https://issues.apache.org/jira/browse/SPARK-20525
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell
>Affects Versions: 2.1.0
> Environment: OS X 10.11.6, spark-2.1.0-bin-hadoop2.7, Scala version 
> 2.11.8 (bundled w/ Spark), Java 1.8.0_121
>Reporter: Dave Knoester
> Attachments: UdfTest.scala
>
>
> I'm trying to interpret a string containing Scala code from inside a Spark 
> session. Everything is working fine, except for User Defined Function-like 
> things (UDFs, map, flatMap, etc).  This is a blocker for production launch of 
> a large number of Spark jobs.
> I've been able to boil the problem down to a number of spark-shell examples, 
> shown below.  Because it's reproducible in the spark-shell, these related 
> issues **don't apply**:
> https://issues.apache.org/jira/browse/SPARK-9219
> https://issues.apache.org/jira/browse/SPARK-18075
> https://issues.apache.org/jira/browse/SPARK-19938
> http://apache-spark-developers-list.1001551.n3.nabble.com/This-Exception-has-been-really-hard-to-trace-td19362.html
> https://community.mapr.com/thread/21488-spark-error-scalacollectionseq-in-instance-of-orgapachesparkrddmappartitionsrdd
> https://github.com/scala/bug/issues/9237
> Any help is appreciated!
> 
> Repro: 
> Run each of the below from a spark-shell.  
> Preamble:
> import scala.tools.nsc.GenericRunnerSettings
> import scala.tools.nsc.interpreter.IMain
> val settings = new GenericRunnerSettings( println _ )
> settings.usejavacp.value = true
> val interpreter = new IMain(settings, new java.io.PrintWriter(System.out))
> interpreter.bind("spark", spark);
> These work:
> // works:
> interpreter.interpret("val x = 5")
> // works:
> interpreter.interpret("import spark.implicits._\nval df = 
> spark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.show")
> These do not work:
> // doesn't work, fails with seq/RDD serialization error:
> interpreter.interpret("import org.apache.spark.sql.functions._\nimport 
> spark.implicits._\nval upper: String => String = _.toUpperCase\nval upperUDF 
> = 
> udf(upper)\nspark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.withColumn(\"UPPER\",
>  upperUDF($\"value\")).show")
> // doesn't work, fails with seq/RDD serialization error:
> interpreter.interpret("import org.apache.spark.sql.functions._\nimport 
> spark.implicits._\nval upper: String => String = 
> _.toUpperCase\nspark.udf.register(\"myUpper\", 
> upper)\nspark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.withColumn(\"UPPER\",
>  callUDF(\"myUpper\", ($\"value\"))).show")
> The not-working ones fail with this exception:
> Caused by: java.lang.ClassCastException: cannot assign instance of 
> scala.collection.immutable.List$SerializationProxy to field 
> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type 
> scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
>   at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
>   at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2237)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80)
>   at org.apache.spark.scheduler.Task.run(Task.scala:9

[jira] [Commented] (SPARK-20525) ClassCast exception when interpreting UDFs from a String in spark-shell

2017-04-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15990380#comment-15990380
 ] 

Sean Owen commented on SPARK-20525:
---

How did you build, how did you run? this isn't a standard spark-shell usage, 
because you're doing things like instantiating your own intepreter. These are 
not supported, not by this project.

> ClassCast exception when interpreting UDFs from a String in spark-shell
> ---
>
> Key: SPARK-20525
> URL: https://issues.apache.org/jira/browse/SPARK-20525
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell
>Affects Versions: 2.1.0
> Environment: OS X 10.11.6, spark-2.1.0-bin-hadoop2.7, Scala version 
> 2.11.8 (bundled w/ Spark), Java 1.8.0_121
>Reporter: Dave Knoester
> Attachments: UdfTest.scala
>
>
> I'm trying to interpret a string containing Scala code from inside a Spark 
> session. Everything is working fine, except for User Defined Function-like 
> things (UDFs, map, flatMap, etc).  This is a blocker for production launch of 
> a large number of Spark jobs.
> I've been able to boil the problem down to a number of spark-shell examples, 
> shown below.  Because it's reproducible in the spark-shell, these related 
> issues **don't apply**:
> https://issues.apache.org/jira/browse/SPARK-9219
> https://issues.apache.org/jira/browse/SPARK-18075
> https://issues.apache.org/jira/browse/SPARK-19938
> http://apache-spark-developers-list.1001551.n3.nabble.com/This-Exception-has-been-really-hard-to-trace-td19362.html
> https://community.mapr.com/thread/21488-spark-error-scalacollectionseq-in-instance-of-orgapachesparkrddmappartitionsrdd
> https://github.com/scala/bug/issues/9237
> Any help is appreciated!
> 
> Repro: 
> Run each of the below from a spark-shell.  
> Preamble:
> import scala.tools.nsc.GenericRunnerSettings
> import scala.tools.nsc.interpreter.IMain
> val settings = new GenericRunnerSettings( println _ )
> settings.usejavacp.value = true
> val interpreter = new IMain(settings, new java.io.PrintWriter(System.out))
> interpreter.bind("spark", spark);
> These work:
> // works:
> interpreter.interpret("val x = 5")
> // works:
> interpreter.interpret("import spark.implicits._\nval df = 
> spark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.show")
> These do not work:
> // doesn't work, fails with seq/RDD serialization error:
> interpreter.interpret("import org.apache.spark.sql.functions._\nimport 
> spark.implicits._\nval upper: String => String = _.toUpperCase\nval upperUDF 
> = 
> udf(upper)\nspark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.withColumn(\"UPPER\",
>  upperUDF($\"value\")).show")
> // doesn't work, fails with seq/RDD serialization error:
> interpreter.interpret("import org.apache.spark.sql.functions._\nimport 
> spark.implicits._\nval upper: String => String = 
> _.toUpperCase\nspark.udf.register(\"myUpper\", 
> upper)\nspark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.withColumn(\"UPPER\",
>  callUDF(\"myUpper\", ($\"value\"))).show")
> The not-working ones fail with this exception:
> Caused by: java.lang.ClassCastException: cannot assign instance of 
> scala.collection.immutable.List$SerializationProxy to field 
> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type 
> scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
>   at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
>   at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2237)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.

[jira] [Commented] (SPARK-20525) ClassCast exception when interpreting UDFs from a String in spark-shell

2017-04-30 Thread Dave Knoester (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15990378#comment-15990378
 ] 

Dave Knoester commented on SPARK-20525:
---

I reopened it because the error is reproducible in spark-shell, using only the 
dependencies packaged with spark-2.1.0-bin-hadoop2.7, and specifically the 
bundled Scala (version 2.11.8).  Because of this, I believe the exception above 
is a new issue.




> ClassCast exception when interpreting UDFs from a String in spark-shell
> ---
>
> Key: SPARK-20525
> URL: https://issues.apache.org/jira/browse/SPARK-20525
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell
>Affects Versions: 2.1.0
> Environment: OS X 10.11.6, spark-2.1.0-bin-hadoop2.7, Scala version 
> 2.11.8 (bundled w/ Spark), Java 1.8.0_121
>Reporter: Dave Knoester
> Attachments: UdfTest.scala
>
>
> I'm trying to interpret a string containing Scala code from inside a Spark 
> session. Everything is working fine, except for User Defined Function-like 
> things (UDFs, map, flatMap, etc).  This is a blocker for production launch of 
> a large number of Spark jobs.
> I've been able to boil the problem down to a number of spark-shell examples, 
> shown below.  Because it's reproducible in the spark-shell, these related 
> issues **don't apply**:
> https://issues.apache.org/jira/browse/SPARK-9219
> https://issues.apache.org/jira/browse/SPARK-18075
> https://issues.apache.org/jira/browse/SPARK-19938
> http://apache-spark-developers-list.1001551.n3.nabble.com/This-Exception-has-been-really-hard-to-trace-td19362.html
> https://community.mapr.com/thread/21488-spark-error-scalacollectionseq-in-instance-of-orgapachesparkrddmappartitionsrdd
> https://github.com/scala/bug/issues/9237
> Any help is appreciated!
> 
> Repro: 
> Run each of the below from a spark-shell.  
> Preamble:
> import scala.tools.nsc.GenericRunnerSettings
> import scala.tools.nsc.interpreter.IMain
> val settings = new GenericRunnerSettings( println _ )
> settings.usejavacp.value = true
> val interpreter = new IMain(settings, new java.io.PrintWriter(System.out))
> interpreter.bind("spark", spark);
> These work:
> // works:
> interpreter.interpret("val x = 5")
> // works:
> interpreter.interpret("import spark.implicits._\nval df = 
> spark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.show")
> These do not work:
> // doesn't work, fails with seq/RDD serialization error:
> interpreter.interpret("import org.apache.spark.sql.functions._\nimport 
> spark.implicits._\nval upper: String => String = _.toUpperCase\nval upperUDF 
> = 
> udf(upper)\nspark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.withColumn(\"UPPER\",
>  upperUDF($\"value\")).show")
> // doesn't work, fails with seq/RDD serialization error:
> interpreter.interpret("import org.apache.spark.sql.functions._\nimport 
> spark.implicits._\nval upper: String => String = 
> _.toUpperCase\nspark.udf.register(\"myUpper\", 
> upper)\nspark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.withColumn(\"UPPER\",
>  callUDF(\"myUpper\", ($\"value\"))).show")
> The not-working ones fail with this exception:
> Caused by: java.lang.ClassCastException: cannot assign instance of 
> scala.collection.immutable.List$SerializationProxy to field 
> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type 
> scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
>   at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
>   at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2237)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.

[jira] [Commented] (SPARK-20525) ClassCast exception when interpreting UDFs from a String in spark-shell

2017-04-28 Thread Dave Knoester (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15989340#comment-15989340
 ] 

Dave Knoester commented on SPARK-20525:
---

In that case, spark-shell shouldn't be able to run this code, yet it does:

import spark.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._

val upper: String => String = _.toUpperCase
val upperUDF = udf(upper)
val df = spark.sparkContext.parallelize(Seq("not 
interpreted","foo","bar")).toDF.withColumn("UPPER", upperUDF($"value"))
df.show()

> ClassCast exception when interpreting UDFs from a String in spark-shell
> ---
>
> Key: SPARK-20525
> URL: https://issues.apache.org/jira/browse/SPARK-20525
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell
>Affects Versions: 2.1.0
> Environment: OS X 10.11.6, spark-2.1.0-bin-hadoop2.7, Scala version 
> 2.11.8 (bundled w/ Spark), Java 1.8.0_121
>Reporter: Dave Knoester
>Priority: Blocker
>
> I'm trying to interpret a string containing Scala code from inside a Spark 
> session. Everything is working fine, except for User Defined Function-like 
> things (UDFs, map, flatMap, etc).  This is a blocker for production launch of 
> a large number of Spark jobs.
> I've been able to boil the problem down to a number of spark-shell examples, 
> shown below.  Because it's reproducible in the spark-shell, these related 
> issues **don't apply**:
> https://issues.apache.org/jira/browse/SPARK-9219
> https://issues.apache.org/jira/browse/SPARK-18075
> https://issues.apache.org/jira/browse/SPARK-19938
> http://apache-spark-developers-list.1001551.n3.nabble.com/This-Exception-has-been-really-hard-to-trace-td19362.html
> https://community.mapr.com/thread/21488-spark-error-scalacollectionseq-in-instance-of-orgapachesparkrddmappartitionsrdd
> https://github.com/scala/bug/issues/9237
> Any help is appreciated!
> 
> Repro: 
> Run each of the below from a spark-shell.  
> Preamble:
> import scala.tools.nsc.GenericRunnerSettings
> import scala.tools.nsc.interpreter.IMain
> val settings = new GenericRunnerSettings( println _ )
> settings.usejavacp.value = true
> val interpreter = new IMain(settings, new java.io.PrintWriter(System.out))
> interpreter.bind("spark", spark);
> These work:
> // works:
> interpreter.interpret("val x = 5")
> // works:
> interpreter.interpret("import spark.implicits._\nval df = 
> spark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.show")
> These do not work:
> // doesn't work, fails with seq/RDD serialization error:
> interpreter.interpret("import org.apache.spark.sql.functions._\nimport 
> spark.implicits._\nval upper: String => String = _.toUpperCase\nval upperUDF 
> = 
> udf(upper)\nspark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.withColumn(\"UPPER\",
>  upperUDF($\"value\")).show")
> // doesn't work, fails with seq/RDD serialization error:
> interpreter.interpret("import org.apache.spark.sql.functions._\nimport 
> spark.implicits._\nval upper: String => String = 
> _.toUpperCase\nspark.udf.register(\"myUpper\", 
> upper)\nspark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.withColumn(\"UPPER\",
>  callUDF(\"myUpper\", ($\"value\"))).show")
> The not-working ones fail with this exception:
> Caused by: java.lang.ClassCastException: cannot assign instance of 
> scala.collection.immutable.List$SerializationProxy to field 
> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type 
> scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
>   at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
>   at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2237)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache

[jira] [Commented] (SPARK-20525) ClassCast exception when interpreting UDFs from a String in spark-shell

2017-04-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15989222#comment-15989222
 ] 

Sean Owen commented on SPARK-20525:
---

It's not really clear what you're reporting, but I don't expect Scala shell 
interpreter code to work in a distributed environment, right?  Please format 
and clarify the example.

> ClassCast exception when interpreting UDFs from a String in spark-shell
> ---
>
> Key: SPARK-20525
> URL: https://issues.apache.org/jira/browse/SPARK-20525
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell
>Affects Versions: 2.1.0
> Environment: OS X 10.11.6, spark-2.1.0-bin-hadoop2.7, Scala version 
> 2.11.8 (bundled w/ Spark), Java 1.8.0_121
>Reporter: Dave Knoester
>Priority: Blocker
>
> I'm trying to interpret a string containing Scala code from inside a Spark 
> session. Everything is working fine, except for User Defined Function-like 
> things (UDFs, map, flatMap, etc).  This is a blocker for production launch of 
> a large number of Spark jobs.
> I've been able to boil the problem down to a number of spark-shell examples, 
> shown below.  Because it's reproducible in the spark-shell, these related 
> issues **don't apply**:
> https://issues.apache.org/jira/browse/SPARK-9219
> https://issues.apache.org/jira/browse/SPARK-18075
> https://issues.apache.org/jira/browse/SPARK-19938
> http://apache-spark-developers-list.1001551.n3.nabble.com/This-Exception-has-been-really-hard-to-trace-td19362.html
> https://community.mapr.com/thread/21488-spark-error-scalacollectionseq-in-instance-of-orgapachesparkrddmappartitionsrdd
> https://github.com/scala/bug/issues/9237
> Any help is appreciated!
> 
> Repro: 
> Run each of the below from a spark-shell.  
> Preamble:
> import scala.tools.nsc.GenericRunnerSettings
> import scala.tools.nsc.interpreter.IMain
> val settings = new GenericRunnerSettings( println _ )
> settings.usejavacp.value = true
> val interpreter = new IMain(settings, new java.io.PrintWriter(System.out))
> interpreter.bind("spark", spark);
> These work:
> // works:
> interpreter.interpret("val x = 5")
> // works:
> interpreter.interpret("import spark.implicits._\nval df = 
> spark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.show")
> // works:
> val upper: String => String = _.toUpperCase
> spark.udf.register("myUpper", upper)
> interpreter.interpret("import org.apache.spark.sql.functions._\nimport 
> spark.implicits._\nval upper: String => String = _.toUpperCase\nval upperUDF 
> = 
> udf(upper)\nspark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.withColumn(\"UPPER\",
>  callUDF(\"myUpper\", ($\"value\"))).show")
> These do not work:
> // doesn't work, fails with seq/RDD serialization error:
> interpreter.interpret("import org.apache.spark.sql.functions._\nimport 
> spark.implicits._\nval upper: String => String = _.toUpperCase\nval upperUDF 
> = 
> udf(upper)\nspark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.withColumn(\"UPPER\",
>  upperUDF($\"value\")).show")
> // doesn't work, fails with seq/RDD serialization error:
> interpreter.interpret("import org.apache.spark.sql.functions._\nimport 
> spark.implicits._\nval upper: String => String = 
> _.toUpperCase\nspark.udf.register(\"myUpper\", 
> upper)\nspark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.withColumn(\"UPPER\",
>  callUDF(\"myUpper\", ($\"value\"))).show")
> The not-working ones fail with this exception:
> Caused by: java.lang.ClassCastException: cannot assign instance of 
> scala.collection.immutable.List$SerializationProxy to field 
> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type 
> scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
>   at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
>   at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2237)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
>   at 
> org.apache.spark.serializer.JavaSerializerIn