RE: HiveContext cannot be serialized

Haopu Wang Mon, 16 Feb 2015 16:45:57 -0800

Reynold and Michael, thank you so much for the quick response.


This problem also happens on branch-1.1, would you mind resolving it on
branch-1.1 also? Thanks again!

 

________________________________

From: Reynold Xin [mailto:r...@databricks.com] 
Sent: Tuesday, February 17, 2015 3:44 AM
To: Michael Armbrust
Cc: Haopu Wang; dev@spark.apache.org
Subject: Re: HiveContext cannot be serialized

 

I submitted a patch

 

https://github.com/apache/spark/pull/4628

 

On Mon, Feb 16, 2015 at 10:59 AM, Michael Armbrust
<mich...@databricks.com> wrote:

I was suggesting you mark the variable that is holding the HiveContext
'@transient' since the scala compiler is not correctly propagating this
through the tuple extraction.  This is only a workaround.  We can also
remove the tuple extraction.

 

On Mon, Feb 16, 2015 at 10:47 AM, Reynold Xin <r...@databricks.com>
wrote:

Michael - it is already transient. This should probably considered a bug
in the scala compiler, but we can easily work around it by removing the
use of destructuring binding.

 

On Mon, Feb 16, 2015 at 10:41 AM, Michael Armbrust
<mich...@databricks.com> wrote:

I'd suggest marking the HiveContext as @transient since its not valid to
use it on the slaves anyway.


On Mon, Feb 16, 2015 at 4:27 AM, Haopu Wang <hw...@qilinsoft.com> wrote:

> When I'm investigating this issue (in the end of this email), I take a
> look at HiveContext's code and find this change
>
(https://github.com/apache/spark/commit/64945f868443fbc59cb34b34c16d782d
> da0fb63d#diff-ff50aea397a607b79df9bec6f2a841db):
>
>
>
> -  @transient protected[hive] lazy val hiveconf = new
> HiveConf(classOf[SessionState])
>
> -  @transient protected[hive] lazy val sessionState = {
>
> -    val ss = new SessionState(hiveconf)
>
> -    setConf(hiveconf.getAllProperties)  // Have SQLConf pick up the
> initial set of HiveConf.
>
> -    ss
>
> -  }
>
> +  @transient protected[hive] lazy val (hiveconf, sessionState) =
>
> +    Option(SessionState.get())
>
> +      .orElse {
>
>
>
> With the new change, Scala compiler always generate a Tuple2 field of
> HiveContext as below:
>
>
>
>     private Tuple2 x$3;
>
>     private transient OutputStream outputBuffer;
>
>     private transient HiveConf hiveconf;
>
>     private transient SessionState sessionState;
>
>     private transient HiveMetastoreCatalog catalog;
>
>
>
> That "x$3" field's key is HiveConf object that cannot be serialized.
So
> can you suggest how to resolve this issue? Thank you very much!
>
>
>
> ================================
>
>
>
> I have a streaming application which registered temp table on a
> HiveContext for each batch duration.
>
> The application runs well in Spark 1.1.0. But I get below error from
> 1.1.1.
>
> Do you have any suggestions to resolve it? Thank you!
>
>
>
> java.io.NotSerializableException: org.apache.hadoop.hive.conf.HiveConf
>
>     - field (class "scala.Tuple2", name: "_1", type: "class
> java.lang.Object")
>
>     - object (class "scala.Tuple2", (Configuration: core-default.xml,
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml,
> yarn-site.xml, hdfs-default.xml, hdfs-site.xml,
>
org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@2158ce23,org.apa
> che.hadoop.hive.ql.session.SessionState@49b6eef9))
>
>     - field (class "org.apache.spark.sql.hive.HiveContext", name:
"x$3",
> type: "class scala.Tuple2")
>
>     - object (class "org.apache.spark.sql.hive.HiveContext",
> org.apache.spark.sql.hive.HiveContext@4e6e66a4)
>
>     - field (class
> "example.BaseQueryableDStream$$anonfun$registerTempTable$2", name:
> "sqlContext$1", type: "class org.apache.spark.sql.SQLContext")
>
>    - object (class
> "example.BaseQueryableDStream$$anonfun$registerTempTable$2",
> <function1>)
>
>     - field (class
> "org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1",
> name: "foreachFunc$1", type: "interface scala.Function1")
>
>     - object (class
> "org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1",
> <function2>)
>
>     - field (class
"org.apache.spark.streaming.dstream.ForEachDStream",
> name:
"org$apache$spark$streaming$dstream$ForEachDStream$$foreachFunc",
> type: "interface scala.Function2")
>
>     - object (class
"org.apache.spark.streaming.dstream.ForEachDStream",
> org.apache.spark.streaming.dstream.ForEachDStream@5ccbdc20)
>
>     - element of array (index: 0)
>
>     - array (class "[Ljava.lang.Object;", size: 16)
>
>     - field (class "scala.collection.mutable.ArrayBuffer", name:
> "array", type: "class [Ljava.lang.Object;")
>
>     - object (class "scala.collection.mutable.ArrayBuffer",
>
ArrayBuffer(org.apache.spark.streaming.dstream.ForEachDStream@5ccbdc20))
>
>     - field (class "org.apache.spark.streaming.DStreamGraph", name:
> "outputStreams", type: "class scala.collection.mutable.ArrayBuffer")
>
>     - custom writeObject data (class
> "org.apache.spark.streaming.DStreamGraph")
>
>     - object (class "org.apache.spark.streaming.DStreamGraph",
> org.apache.spark.streaming.DStreamGraph@776ae7da)
>
>     - field (class "org.apache.spark.streaming.Checkpoint", name:
> "graph", type: "class org.apache.spark.streaming.DStreamGraph")
>
>     - root object (class "org.apache.spark.streaming.Checkpoint",
> org.apache.spark.streaming.Checkpoint@5eade065)
>
>     at java.io.ObjectOutputStream.writeObject0(Unknown Source)
>
>
>
>
>
>
>
>

RE: HiveContext cannot be serialized

Reply via email to