Re: HiveContext cannot be serialized

Reynold Xin Mon, 16 Feb 2015 11:46:30 -0800

I submitted a patch

https://github.com/apache/spark/pull/4628


On Mon, Feb 16, 2015 at 10:59 AM, Michael Armbrust <mich...@databricks.com>
wrote:

> I was suggesting you mark the variable that is holding the HiveContext
> '@transient' since the scala compiler is not correctly propagating this
> through the tuple extraction.  This is only a workaround.  We can also
> remove the tuple extraction.
>
> On Mon, Feb 16, 2015 at 10:47 AM, Reynold Xin <r...@databricks.com> wrote:
>
>> Michael - it is already transient. This should probably considered a bug
>> in the scala compiler, but we can easily work around it by removing the use
>> of destructuring binding.
>>
>> On Mon, Feb 16, 2015 at 10:41 AM, Michael Armbrust <
>> mich...@databricks.com> wrote:
>>
>>> I'd suggest marking the HiveContext as @transient since its not valid to
>>> use it on the slaves anyway.
>>>
>>> On Mon, Feb 16, 2015 at 4:27 AM, Haopu Wang <hw...@qilinsoft.com> wrote:
>>>
>>> > When I'm investigating this issue (in the end of this email), I take a
>>> > look at HiveContext's code and find this change
>>> > (
>>> https://github.com/apache/spark/commit/64945f868443fbc59cb34b34c16d782d
>>> > da0fb63d#diff-ff50aea397a607b79df9bec6f2a841db):
>>> >
>>> >
>>> >
>>> > -  @transient protected[hive] lazy val hiveconf = new
>>> > HiveConf(classOf[SessionState])
>>> >
>>> > -  @transient protected[hive] lazy val sessionState = {
>>> >
>>> > -    val ss = new SessionState(hiveconf)
>>> >
>>> > -    setConf(hiveconf.getAllProperties)  // Have SQLConf pick up the
>>> > initial set of HiveConf.
>>> >
>>> > -    ss
>>> >
>>> > -  }
>>> >
>>> > +  @transient protected[hive] lazy val (hiveconf, sessionState) =
>>> >
>>> > +    Option(SessionState.get())
>>> >
>>> > +      .orElse {
>>> >
>>> >
>>> >
>>> > With the new change, Scala compiler always generate a Tuple2 field of
>>> > HiveContext as below:
>>> >
>>> >
>>> >
>>> >     private Tuple2 x$3;
>>> >
>>> >     private transient OutputStream outputBuffer;
>>> >
>>> >     private transient HiveConf hiveconf;
>>> >
>>> >     private transient SessionState sessionState;
>>> >
>>> >     private transient HiveMetastoreCatalog catalog;
>>> >
>>> >
>>> >
>>> > That "x$3" field's key is HiveConf object that cannot be serialized. So
>>> > can you suggest how to resolve this issue? Thank you very much!
>>> >
>>> >
>>> >
>>> > ================================
>>> >
>>> >
>>> >
>>> > I have a streaming application which registered temp table on a
>>> > HiveContext for each batch duration.
>>> >
>>> > The application runs well in Spark 1.1.0. But I get below error from
>>> > 1.1.1.
>>> >
>>> > Do you have any suggestions to resolve it? Thank you!
>>> >
>>> >
>>> >
>>> > java.io.NotSerializableException: org.apache.hadoop.hive.conf.HiveConf
>>> >
>>> >     - field (class "scala.Tuple2", name: "_1", type: "class
>>> > java.lang.Object")
>>> >
>>> >     - object (class "scala.Tuple2", (Configuration: core-default.xml,
>>> > core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml,
>>> > yarn-site.xml, hdfs-default.xml, hdfs-site.xml,
>>> > org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@2158ce23
>>> ,org.apa
>>> > che.hadoop.hive.ql.session.SessionState@49b6eef9))
>>> >
>>> >     - field (class "org.apache.spark.sql.hive.HiveContext", name:
>>> "x$3",
>>> > type: "class scala.Tuple2")
>>> >
>>> >     - object (class "org.apache.spark.sql.hive.HiveContext",
>>> > org.apache.spark.sql.hive.HiveContext@4e6e66a4)
>>> >
>>> >     - field (class
>>> > "example.BaseQueryableDStream$$anonfun$registerTempTable$2", name:
>>> > "sqlContext$1", type: "class org.apache.spark.sql.SQLContext")
>>> >
>>> >    - object (class
>>> > "example.BaseQueryableDStream$$anonfun$registerTempTable$2",
>>> > <function1>)
>>> >
>>> >     - field (class
>>> > "org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1",
>>> > name: "foreachFunc$1", type: "interface scala.Function1")
>>> >
>>> >     - object (class
>>> > "org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1",
>>> > <function2>)
>>> >
>>> >     - field (class "org.apache.spark.streaming.dstream.ForEachDStream",
>>> > name: "org$apache$spark$streaming$dstream$ForEachDStream$$foreachFunc",
>>> > type: "interface scala.Function2")
>>> >
>>> >     - object (class
>>> "org.apache.spark.streaming.dstream.ForEachDStream",
>>> > org.apache.spark.streaming.dstream.ForEachDStream@5ccbdc20)
>>> >
>>> >     - element of array (index: 0)
>>> >
>>> >     - array (class "[Ljava.lang.Object;", size: 16)
>>> >
>>> >     - field (class "scala.collection.mutable.ArrayBuffer", name:
>>> > "array", type: "class [Ljava.lang.Object;")
>>> >
>>> >     - object (class "scala.collection.mutable.ArrayBuffer",
>>> > ArrayBuffer(org.apache.spark.streaming.dstream.ForEachDStream@5ccbdc20
>>> ))
>>> >
>>> >     - field (class "org.apache.spark.streaming.DStreamGraph", name:
>>> > "outputStreams", type: "class scala.collection.mutable.ArrayBuffer")
>>> >
>>> >     - custom writeObject data (class
>>> > "org.apache.spark.streaming.DStreamGraph")
>>> >
>>> >     - object (class "org.apache.spark.streaming.DStreamGraph",
>>> > org.apache.spark.streaming.DStreamGraph@776ae7da)
>>> >
>>> >     - field (class "org.apache.spark.streaming.Checkpoint", name:
>>> > "graph", type: "class org.apache.spark.streaming.DStreamGraph")
>>> >
>>> >     - root object (class "org.apache.spark.streaming.Checkpoint",
>>> > org.apache.spark.streaming.Checkpoint@5eade065)
>>> >
>>> >     at java.io.ObjectOutputStream.writeObject0(Unknown Source)
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>>
>>
>>
>

Re: HiveContext cannot be serialized

Reply via email to