PR is https://github.com/apache/spark/pull/2074. ------------------------------ From: Yin Huai <huaiyin....@gmail.com> Sent: 8/20/2014 10:56 PM To: Vida Ha <v...@databricks.com> Cc: tianyi <tia...@asiainfo.com>; Fengyun RAO <raofeng...@gmail.com>; user@spark.apache.org Subject: Re: Got NotSerializableException when access broadcast variable
If you want to filter the table name, you can use hc.sql("show tables").filter(row => !"test".equals(row.getString(0)))) Seems making functionRegistry transient can fix the error. On Wed, Aug 20, 2014 at 8:53 PM, Vida Ha <v...@databricks.com> wrote: > Hi, > > I doubt the the broadcast variable is your problem, since you are seeing: > > org.apache.spark.SparkException: Task not serializable > Caused by: java.io.NotSerializableException: org.apache.spark.sql > .hive.HiveContext$$anon$3 > > We have a knowledgebase article that explains why this happens - it's a > very common error I see users triggering on the mailing list: > > > https://github.com/databricks/spark-knowledgebase/blob/master/troubleshooting/javaionotserializableexception.md > > Are you using the HiveContext within a tranformation that is called on an > RDD? That will definitely create a problem. > > -Vida > > > > > > On Wed, Aug 20, 2014 at 1:20 AM, tianyi <tia...@asiainfo.com> wrote: > >> Thanks for help. >> >> I run this script again with "bin/spark-shell --conf >> spark.serializer=org.apache.spark.serializer.KryoSerializer” >> >> in the console, I can see: >> >> scala> sc.getConf.getAll.foreach(println) >> (spark.tachyonStore.folderName,spark-eaabe986-03cb-41bd-bde5-993c7db3f048) >> (spark.driver.host,10.1.51.127) >> >> (spark.executor.extraJavaOptions,-Dsun.io.serialization.extendedDebugInfo=true) >> (spark.serializer,org.apache.spark.serializer.KryoSerializer) >> (spark.repl.class.uri,http://10.1.51.127:51319) >> (spark.app.name,Spark shell) >> >> (spark.driver.extraJavaOptions,-Dsun.io.serialization.extendedDebugInfo=true) >> (spark.fileserver.uri,http://10.1.51.127:51322) >> (spark.jars,) >> (spark.driver.port,51320) >> (spark.master,local[*]) >> >> But it fails again with the same error. >> >> >> >> >> On Aug 20, 2014, at 15:59, Fengyun RAO <raofeng...@gmail.com> wrote: >> >> try: >> >> sparkConf.set("spark.serializer", >> "org.apache.spark.serializer.KryoSerializer") >> >> >> 2014-08-20 14:27 GMT+08:00 田毅 <tia...@asiainfo.com>: >> >> Hi everyone! >>> >>> I got a exception when i run my script with spark-shell: >>> >>> I added >>> >>> SPARK_JAVA_OPTS="-Dsun.io.serialization.extendedDebugInfo=true" >>> >>> in spark-env.sh to show the following stack: >>> >>> >>> org.apache.spark.SparkException: Task not serializable >>> at >>> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) >>> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) >>> at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) >>> at org.apache.spark.rdd.RDD.filter(RDD.scala:282) >>> at org.apache.spark.sql.SchemaRDD.filter(SchemaRDD.scala:460) >>> at $iwC$$iwC$$iwC$$iwC.<init>(<console>:18) >>> at $iwC$$iwC$$iwC.<init>(<console>:23) >>> at $iwC$$iwC.<init>(<console>:25) >>> at $iwC.<init>(<console>:27) >>> at <init>(<console>:29) >>> at .<init>(<console>:33) >>> at .<clinit>(<console>) >>> at .<init>(<console>:7) >>> at .<clinit>(<console>) >>> at $print(<console>) >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> at >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >>> at >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>> at java.lang.reflect.Method.invoke(Method.java:601) >>> at >>> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) >>> at >>> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) >>> …… >>> Caused by: java.io.NotSerializableException: >>> org.apache.spark.sql.hive.HiveContext$$anon$3 >>> - field (class "org.apache.spark.sql.hive.HiveContext", name: >>> "functionRegistry", type: "class >>> org.apache.spark.sql.hive.HiveFunctionRegistry") >>> - object (class "org.apache.spark.sql.hive.HiveContext", >>> org.apache.spark.sql.hive.HiveContext@4648e685) >>> - field (class "$iwC$$iwC$$iwC$$iwC", name: "hc", type: "class >>> org.apache.spark.sql.hive.HiveContext") >>> - object (class "$iwC$$iwC$$iwC$$iwC", $iwC$$iwC$$iwC$$iwC@23d652ef) >>> - field (class "$iwC$$iwC$$iwC", name: "$iw", type: "class >>> $iwC$$iwC$$iwC$$iwC") >>> - object (class "$iwC$$iwC$$iwC", $iwC$$iwC$$iwC@71cc14f1) >>> - field (class "$iwC$$iwC", name: "$iw", type: "class $iwC$$iwC$$iwC") >>> - object (class "$iwC$$iwC", $iwC$$iwC@74eca89e) >>> - field (class "$iwC", name: "$iw", type: "class $iwC$$iwC") >>> - object (class "$iwC", $iwC@685c4cc4) >>> - field (class "$line9.$read", name: "$iw", type: "class $iwC") >>> - object (class "$line9.$read", $line9.$read@519f9aae) >>> - field (class "$iwC$$iwC$$iwC", name: "$VAL7", type: "class >>> $line9.$read") >>> - object (class "$iwC$$iwC$$iwC", $iwC$$iwC$$iwC@4b996858) >>> - field (class "$iwC$$iwC$$iwC$$iwC", name: "$outer", type: "class >>> $iwC$$iwC$$iwC") >>> - object (class "$iwC$$iwC$$iwC$$iwC", $iwC$$iwC$$iwC$$iwC@31d646d4) >>> - field (class "$iwC$$iwC$$iwC$$iwC$$anonfun$1", name: "$outer", type: >>> "class $iwC$$iwC$$iwC$$iwC") >>> - root object (class "$iwC$$iwC$$iwC$$iwC$$anonfun$1", <function1>) >>> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) >>> at >>> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528) >>> >>> I write some simple script to reproduce this problem. >>> >>> case 1 : >>> val barr1 = sc.broadcast("test") >>> val sret = sc.parallelize(1 to 10, 2) >>> val ret = sret.filter(row => !barr1.equals("test")) >>> ret.collect.foreach(println) >>> >>> It’s working fine with local mode and yarn-client mode. >>> >>> case 2 : >>> val barr1 = sc.broadcast("test") >>> val hc = new org.apache.spark.sql.hive.HiveContext(sc) >>> val sret = hc.sql("show tables") >>> val ret = sret.filter(row => !barr1.equals("test")) >>> ret.collect.foreach(println) >>> >>> It will throw java.io.NotSerializableException: >>> org.apache.spark.sql.hive.HiveContext >>> with local mode and yarn-client mode >>> >>> But it working fine if I write the same code in a scala file and run in >>> Intellij IDEA. >>> >>> import org.apache.spark.{SparkConf, SparkContext} >>> >>> object TestBroadcast2 { >>> def main(args: Array[String]) { >>> val sparkConf = new SparkConf().setAppName("Broadcast >>> Test").setMaster("local[3]") >>> val sc = new SparkContext(sparkConf) >>> val barr1 = sc.broadcast("test") >>> val hc = new org.apache.spark.sql.hive.HiveContext(sc) >>> val sret = hc.sql("show tables") >>> val ret = sret.filter(row => !barr1.equals("test")) >>> ret.collect.foreach(println) >>> } >>> } >>> >>> >>> >>> >>> >>> >> >> >