org.apache.spark.sql.sources.DDLException: Unsupported dataType: [1.1] failure: ``varchar'' expected but identifier char found in spark-sql

2015-02-16 Thread Qiuzhuang Lian
Hi,

I am not sure this has been reported already or not, I run into this error
under spark-sql shell as build from newest of spark git trunk,

spark-sql> describe qiuzhuang_hcatlog_import;
15/02/17 14:38:36 ERROR SparkSQLDriver: Failed in [describe
qiuzhuang_hcatlog_import]
org.apache.spark.sql.sources.DDLException: Unsupported dataType: [1.1]
failure: ``varchar'' expected but identifier char found

char(32)
^
at org.apache.spark.sql.sources.DDLParser.parseType(ddl.scala:52)
at
org.apache.spark.sql.hive.MetastoreRelation$SchemaAttribute.toAttribute(HiveMetastoreCatalog.scala:664)
at
org.apache.spark.sql.hive.MetastoreRelation$$anonfun$23.apply(HiveMetastoreCatalog.scala:674)
at
org.apache.spark.sql.hive.MetastoreRelation$$anonfun$23.apply(HiveMetastoreCatalog.scala:674)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at
org.apache.spark.sql.hive.MetastoreRelation.(HiveMetastoreCatalog.scala:674)
at
org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:185)
at org.apache.spark.sql.hive.HiveContext$$anon$2.org
$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:234)

As in hive 0.131, console, this commands works,

hive> describe qiuzhuang_hcatlog_import;
OK
id  char(32)
assistant_novarchar(20)
assistant_name  varchar(32)
assistant_type  int
grade   int
shop_no varchar(20)
shop_name   varchar(64)
organ_novarchar(20)
organ_name  varchar(20)
entry_date  string
education   int
commission  decimal(8,2)
tel varchar(20)
address varchar(100)
identity_card   varchar(25)
sex int
birthdaystring
employee_type   int
status  int
remark  varchar(255)
create_user_no  varchar(20)
create_user varchar(32)
create_time string
update_user_no  varchar(20)
update_user varchar(32)
update_time string
Time taken: 0.49 seconds, Fetched: 26 row(s)
hive>


Regards,
Qiuzhuang


RE: HiveContext cannot be serialized

2015-02-16 Thread Haopu Wang
Reynold and Michael, thank you so much for the quick response.

 

This problem also happens on branch-1.1, would you mind resolving it on
branch-1.1 also? Thanks again!

 



From: Reynold Xin [mailto:r...@databricks.com] 
Sent: Tuesday, February 17, 2015 3:44 AM
To: Michael Armbrust
Cc: Haopu Wang; dev@spark.apache.org
Subject: Re: HiveContext cannot be serialized

 

I submitted a patch

 

https://github.com/apache/spark/pull/4628

 

On Mon, Feb 16, 2015 at 10:59 AM, Michael Armbrust
 wrote:

I was suggesting you mark the variable that is holding the HiveContext
'@transient' since the scala compiler is not correctly propagating this
through the tuple extraction.  This is only a workaround.  We can also
remove the tuple extraction.

 

On Mon, Feb 16, 2015 at 10:47 AM, Reynold Xin 
wrote:

Michael - it is already transient. This should probably considered a bug
in the scala compiler, but we can easily work around it by removing the
use of destructuring binding.

 

On Mon, Feb 16, 2015 at 10:41 AM, Michael Armbrust
 wrote:

I'd suggest marking the HiveContext as @transient since its not valid to
use it on the slaves anyway.


On Mon, Feb 16, 2015 at 4:27 AM, Haopu Wang  wrote:

> When I'm investigating this issue (in the end of this email), I take a
> look at HiveContext's code and find this change
>
(https://github.com/apache/spark/commit/64945f868443fbc59cb34b34c16d782d
> da0fb63d#diff-ff50aea397a607b79df9bec6f2a841db):
>
>
>
> -  @transient protected[hive] lazy val hiveconf = new
> HiveConf(classOf[SessionState])
>
> -  @transient protected[hive] lazy val sessionState = {
>
> -val ss = new SessionState(hiveconf)
>
> -setConf(hiveconf.getAllProperties)  // Have SQLConf pick up the
> initial set of HiveConf.
>
> -ss
>
> -  }
>
> +  @transient protected[hive] lazy val (hiveconf, sessionState) =
>
> +Option(SessionState.get())
>
> +  .orElse {
>
>
>
> With the new change, Scala compiler always generate a Tuple2 field of
> HiveContext as below:
>
>
>
> private Tuple2 x$3;
>
> private transient OutputStream outputBuffer;
>
> private transient HiveConf hiveconf;
>
> private transient SessionState sessionState;
>
> private transient HiveMetastoreCatalog catalog;
>
>
>
> That "x$3" field's key is HiveConf object that cannot be serialized.
So
> can you suggest how to resolve this issue? Thank you very much!
>
>
>
> 
>
>
>
> I have a streaming application which registered temp table on a
> HiveContext for each batch duration.
>
> The application runs well in Spark 1.1.0. But I get below error from
> 1.1.1.
>
> Do you have any suggestions to resolve it? Thank you!
>
>
>
> java.io.NotSerializableException: org.apache.hadoop.hive.conf.HiveConf
>
> - field (class "scala.Tuple2", name: "_1", type: "class
> java.lang.Object")
>
> - object (class "scala.Tuple2", (Configuration: core-default.xml,
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml,
> yarn-site.xml, hdfs-default.xml, hdfs-site.xml,
>
org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@2158ce23,org.apa
> che.hadoop.hive.ql.session.SessionState@49b6eef9))
>
> - field (class "org.apache.spark.sql.hive.HiveContext", name:
"x$3",
> type: "class scala.Tuple2")
>
> - object (class "org.apache.spark.sql.hive.HiveContext",
> org.apache.spark.sql.hive.HiveContext@4e6e66a4)
>
> - field (class
> "example.BaseQueryableDStream$$anonfun$registerTempTable$2", name:
> "sqlContext$1", type: "class org.apache.spark.sql.SQLContext")
>
>- object (class
> "example.BaseQueryableDStream$$anonfun$registerTempTable$2",
> )
>
> - field (class
> "org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1",
> name: "foreachFunc$1", type: "interface scala.Function1")
>
> - object (class
> "org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1",
> )
>
> - field (class
"org.apache.spark.streaming.dstream.ForEachDStream",
> name:
"org$apache$spark$streaming$dstream$ForEachDStream$$foreachFunc",
> type: "interface scala.Function2")
>
> - object (class
"org.apache.spark.streaming.dstream.ForEachDStream",
> org.apache.spark.streaming.dstream.ForEachDStream@5ccbdc20)
>
> - element of array (index: 0)
>
> - array (class "[Ljava.lang.Object;", size: 16)
>
> - field (class "scala.collection.mutable.ArrayBuffer", name:
> "array", type: "class [Ljava.lang.Object;")
>
> - object (class "scala.collection.mutable.ArrayBuffer",
>
ArrayBuffer(org.apache.spark.streaming.dstream.ForEachDStream@5ccbdc20))
>
> - field (class "org.apache.spark.streaming.DStreamGraph", name:
> "outputStreams", type: "class scala.collection.mutable.ArrayBuffer")
>
> - custom writeObject data (class
> "org.apache.spark.streaming.DStreamGraph")
>
> - object (class "org.apache.spark.streaming.DStreamGraph",
> org.apache.spark.streaming.DStreamGraph@776ae7da)
>
> - field (class "org.apache.spark.streaming.Checkpoi

Re: HiveContext cannot be serialized

2015-02-16 Thread Reynold Xin
I submitted a patch

https://github.com/apache/spark/pull/4628

On Mon, Feb 16, 2015 at 10:59 AM, Michael Armbrust 
wrote:

> I was suggesting you mark the variable that is holding the HiveContext
> '@transient' since the scala compiler is not correctly propagating this
> through the tuple extraction.  This is only a workaround.  We can also
> remove the tuple extraction.
>
> On Mon, Feb 16, 2015 at 10:47 AM, Reynold Xin  wrote:
>
>> Michael - it is already transient. This should probably considered a bug
>> in the scala compiler, but we can easily work around it by removing the use
>> of destructuring binding.
>>
>> On Mon, Feb 16, 2015 at 10:41 AM, Michael Armbrust <
>> mich...@databricks.com> wrote:
>>
>>> I'd suggest marking the HiveContext as @transient since its not valid to
>>> use it on the slaves anyway.
>>>
>>> On Mon, Feb 16, 2015 at 4:27 AM, Haopu Wang  wrote:
>>>
>>> > When I'm investigating this issue (in the end of this email), I take a
>>> > look at HiveContext's code and find this change
>>> > (
>>> https://github.com/apache/spark/commit/64945f868443fbc59cb34b34c16d782d
>>> > da0fb63d#diff-ff50aea397a607b79df9bec6f2a841db):
>>> >
>>> >
>>> >
>>> > -  @transient protected[hive] lazy val hiveconf = new
>>> > HiveConf(classOf[SessionState])
>>> >
>>> > -  @transient protected[hive] lazy val sessionState = {
>>> >
>>> > -val ss = new SessionState(hiveconf)
>>> >
>>> > -setConf(hiveconf.getAllProperties)  // Have SQLConf pick up the
>>> > initial set of HiveConf.
>>> >
>>> > -ss
>>> >
>>> > -  }
>>> >
>>> > +  @transient protected[hive] lazy val (hiveconf, sessionState) =
>>> >
>>> > +Option(SessionState.get())
>>> >
>>> > +  .orElse {
>>> >
>>> >
>>> >
>>> > With the new change, Scala compiler always generate a Tuple2 field of
>>> > HiveContext as below:
>>> >
>>> >
>>> >
>>> > private Tuple2 x$3;
>>> >
>>> > private transient OutputStream outputBuffer;
>>> >
>>> > private transient HiveConf hiveconf;
>>> >
>>> > private transient SessionState sessionState;
>>> >
>>> > private transient HiveMetastoreCatalog catalog;
>>> >
>>> >
>>> >
>>> > That "x$3" field's key is HiveConf object that cannot be serialized. So
>>> > can you suggest how to resolve this issue? Thank you very much!
>>> >
>>> >
>>> >
>>> > 
>>> >
>>> >
>>> >
>>> > I have a streaming application which registered temp table on a
>>> > HiveContext for each batch duration.
>>> >
>>> > The application runs well in Spark 1.1.0. But I get below error from
>>> > 1.1.1.
>>> >
>>> > Do you have any suggestions to resolve it? Thank you!
>>> >
>>> >
>>> >
>>> > java.io.NotSerializableException: org.apache.hadoop.hive.conf.HiveConf
>>> >
>>> > - field (class "scala.Tuple2", name: "_1", type: "class
>>> > java.lang.Object")
>>> >
>>> > - object (class "scala.Tuple2", (Configuration: core-default.xml,
>>> > core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml,
>>> > yarn-site.xml, hdfs-default.xml, hdfs-site.xml,
>>> > org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@2158ce23
>>> ,org.apa
>>> > che.hadoop.hive.ql.session.SessionState@49b6eef9))
>>> >
>>> > - field (class "org.apache.spark.sql.hive.HiveContext", name:
>>> "x$3",
>>> > type: "class scala.Tuple2")
>>> >
>>> > - object (class "org.apache.spark.sql.hive.HiveContext",
>>> > org.apache.spark.sql.hive.HiveContext@4e6e66a4)
>>> >
>>> > - field (class
>>> > "example.BaseQueryableDStream$$anonfun$registerTempTable$2", name:
>>> > "sqlContext$1", type: "class org.apache.spark.sql.SQLContext")
>>> >
>>> >- object (class
>>> > "example.BaseQueryableDStream$$anonfun$registerTempTable$2",
>>> > )
>>> >
>>> > - field (class
>>> > "org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1",
>>> > name: "foreachFunc$1", type: "interface scala.Function1")
>>> >
>>> > - object (class
>>> > "org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1",
>>> > )
>>> >
>>> > - field (class "org.apache.spark.streaming.dstream.ForEachDStream",
>>> > name: "org$apache$spark$streaming$dstream$ForEachDStream$$foreachFunc",
>>> > type: "interface scala.Function2")
>>> >
>>> > - object (class
>>> "org.apache.spark.streaming.dstream.ForEachDStream",
>>> > org.apache.spark.streaming.dstream.ForEachDStream@5ccbdc20)
>>> >
>>> > - element of array (index: 0)
>>> >
>>> > - array (class "[Ljava.lang.Object;", size: 16)
>>> >
>>> > - field (class "scala.collection.mutable.ArrayBuffer", name:
>>> > "array", type: "class [Ljava.lang.Object;")
>>> >
>>> > - object (class "scala.collection.mutable.ArrayBuffer",
>>> > ArrayBuffer(org.apache.spark.streaming.dstream.ForEachDStream@5ccbdc20
>>> ))
>>> >
>>> > - field (class "org.apache.spark.streaming.DStreamGraph", name:
>>> > "outputStreams", type: "class scala.collection.mutable.ArrayBuffer")
>>> >
>>> > - custom writeObject data (class
>>> > "org.apache.spark.streaming.DStreamGraph")
>>> >
>>> > - o

Re: HiveContext cannot be serialized

2015-02-16 Thread Michael Armbrust
I was suggesting you mark the variable that is holding the HiveContext
'@transient' since the scala compiler is not correctly propagating this
through the tuple extraction.  This is only a workaround.  We can also
remove the tuple extraction.

On Mon, Feb 16, 2015 at 10:47 AM, Reynold Xin  wrote:

> Michael - it is already transient. This should probably considered a bug
> in the scala compiler, but we can easily work around it by removing the use
> of destructuring binding.
>
> On Mon, Feb 16, 2015 at 10:41 AM, Michael Armbrust  > wrote:
>
>> I'd suggest marking the HiveContext as @transient since its not valid to
>> use it on the slaves anyway.
>>
>> On Mon, Feb 16, 2015 at 4:27 AM, Haopu Wang  wrote:
>>
>> > When I'm investigating this issue (in the end of this email), I take a
>> > look at HiveContext's code and find this change
>> > (
>> https://github.com/apache/spark/commit/64945f868443fbc59cb34b34c16d782d
>> > da0fb63d#diff-ff50aea397a607b79df9bec6f2a841db):
>> >
>> >
>> >
>> > -  @transient protected[hive] lazy val hiveconf = new
>> > HiveConf(classOf[SessionState])
>> >
>> > -  @transient protected[hive] lazy val sessionState = {
>> >
>> > -val ss = new SessionState(hiveconf)
>> >
>> > -setConf(hiveconf.getAllProperties)  // Have SQLConf pick up the
>> > initial set of HiveConf.
>> >
>> > -ss
>> >
>> > -  }
>> >
>> > +  @transient protected[hive] lazy val (hiveconf, sessionState) =
>> >
>> > +Option(SessionState.get())
>> >
>> > +  .orElse {
>> >
>> >
>> >
>> > With the new change, Scala compiler always generate a Tuple2 field of
>> > HiveContext as below:
>> >
>> >
>> >
>> > private Tuple2 x$3;
>> >
>> > private transient OutputStream outputBuffer;
>> >
>> > private transient HiveConf hiveconf;
>> >
>> > private transient SessionState sessionState;
>> >
>> > private transient HiveMetastoreCatalog catalog;
>> >
>> >
>> >
>> > That "x$3" field's key is HiveConf object that cannot be serialized. So
>> > can you suggest how to resolve this issue? Thank you very much!
>> >
>> >
>> >
>> > 
>> >
>> >
>> >
>> > I have a streaming application which registered temp table on a
>> > HiveContext for each batch duration.
>> >
>> > The application runs well in Spark 1.1.0. But I get below error from
>> > 1.1.1.
>> >
>> > Do you have any suggestions to resolve it? Thank you!
>> >
>> >
>> >
>> > java.io.NotSerializableException: org.apache.hadoop.hive.conf.HiveConf
>> >
>> > - field (class "scala.Tuple2", name: "_1", type: "class
>> > java.lang.Object")
>> >
>> > - object (class "scala.Tuple2", (Configuration: core-default.xml,
>> > core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml,
>> > yarn-site.xml, hdfs-default.xml, hdfs-site.xml,
>> > org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@2158ce23
>> ,org.apa
>> > che.hadoop.hive.ql.session.SessionState@49b6eef9))
>> >
>> > - field (class "org.apache.spark.sql.hive.HiveContext", name: "x$3",
>> > type: "class scala.Tuple2")
>> >
>> > - object (class "org.apache.spark.sql.hive.HiveContext",
>> > org.apache.spark.sql.hive.HiveContext@4e6e66a4)
>> >
>> > - field (class
>> > "example.BaseQueryableDStream$$anonfun$registerTempTable$2", name:
>> > "sqlContext$1", type: "class org.apache.spark.sql.SQLContext")
>> >
>> >- object (class
>> > "example.BaseQueryableDStream$$anonfun$registerTempTable$2",
>> > )
>> >
>> > - field (class
>> > "org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1",
>> > name: "foreachFunc$1", type: "interface scala.Function1")
>> >
>> > - object (class
>> > "org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1",
>> > )
>> >
>> > - field (class "org.apache.spark.streaming.dstream.ForEachDStream",
>> > name: "org$apache$spark$streaming$dstream$ForEachDStream$$foreachFunc",
>> > type: "interface scala.Function2")
>> >
>> > - object (class "org.apache.spark.streaming.dstream.ForEachDStream",
>> > org.apache.spark.streaming.dstream.ForEachDStream@5ccbdc20)
>> >
>> > - element of array (index: 0)
>> >
>> > - array (class "[Ljava.lang.Object;", size: 16)
>> >
>> > - field (class "scala.collection.mutable.ArrayBuffer", name:
>> > "array", type: "class [Ljava.lang.Object;")
>> >
>> > - object (class "scala.collection.mutable.ArrayBuffer",
>> > ArrayBuffer(org.apache.spark.streaming.dstream.ForEachDStream@5ccbdc20
>> ))
>> >
>> > - field (class "org.apache.spark.streaming.DStreamGraph", name:
>> > "outputStreams", type: "class scala.collection.mutable.ArrayBuffer")
>> >
>> > - custom writeObject data (class
>> > "org.apache.spark.streaming.DStreamGraph")
>> >
>> > - object (class "org.apache.spark.streaming.DStreamGraph",
>> > org.apache.spark.streaming.DStreamGraph@776ae7da)
>> >
>> > - field (class "org.apache.spark.streaming.Checkpoint", name:
>> > "graph", type: "class org.apache.spark.streaming.DStreamGraph")
>> >
>> > - root object (class "org.apache.spa

Re: HiveContext cannot be serialized

2015-02-16 Thread Reynold Xin
Michael - it is already transient. This should probably considered a bug in
the scala compiler, but we can easily work around it by removing the use of
destructuring binding.

On Mon, Feb 16, 2015 at 10:41 AM, Michael Armbrust 
wrote:

> I'd suggest marking the HiveContext as @transient since its not valid to
> use it on the slaves anyway.
>
> On Mon, Feb 16, 2015 at 4:27 AM, Haopu Wang  wrote:
>
> > When I'm investigating this issue (in the end of this email), I take a
> > look at HiveContext's code and find this change
> > (https://github.com/apache/spark/commit/64945f868443fbc59cb34b34c16d782d
> > da0fb63d#diff-ff50aea397a607b79df9bec6f2a841db):
> >
> >
> >
> > -  @transient protected[hive] lazy val hiveconf = new
> > HiveConf(classOf[SessionState])
> >
> > -  @transient protected[hive] lazy val sessionState = {
> >
> > -val ss = new SessionState(hiveconf)
> >
> > -setConf(hiveconf.getAllProperties)  // Have SQLConf pick up the
> > initial set of HiveConf.
> >
> > -ss
> >
> > -  }
> >
> > +  @transient protected[hive] lazy val (hiveconf, sessionState) =
> >
> > +Option(SessionState.get())
> >
> > +  .orElse {
> >
> >
> >
> > With the new change, Scala compiler always generate a Tuple2 field of
> > HiveContext as below:
> >
> >
> >
> > private Tuple2 x$3;
> >
> > private transient OutputStream outputBuffer;
> >
> > private transient HiveConf hiveconf;
> >
> > private transient SessionState sessionState;
> >
> > private transient HiveMetastoreCatalog catalog;
> >
> >
> >
> > That "x$3" field's key is HiveConf object that cannot be serialized. So
> > can you suggest how to resolve this issue? Thank you very much!
> >
> >
> >
> > 
> >
> >
> >
> > I have a streaming application which registered temp table on a
> > HiveContext for each batch duration.
> >
> > The application runs well in Spark 1.1.0. But I get below error from
> > 1.1.1.
> >
> > Do you have any suggestions to resolve it? Thank you!
> >
> >
> >
> > java.io.NotSerializableException: org.apache.hadoop.hive.conf.HiveConf
> >
> > - field (class "scala.Tuple2", name: "_1", type: "class
> > java.lang.Object")
> >
> > - object (class "scala.Tuple2", (Configuration: core-default.xml,
> > core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml,
> > yarn-site.xml, hdfs-default.xml, hdfs-site.xml,
> > org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@2158ce23,org.apa
> > che.hadoop.hive.ql.session.SessionState@49b6eef9))
> >
> > - field (class "org.apache.spark.sql.hive.HiveContext", name: "x$3",
> > type: "class scala.Tuple2")
> >
> > - object (class "org.apache.spark.sql.hive.HiveContext",
> > org.apache.spark.sql.hive.HiveContext@4e6e66a4)
> >
> > - field (class
> > "example.BaseQueryableDStream$$anonfun$registerTempTable$2", name:
> > "sqlContext$1", type: "class org.apache.spark.sql.SQLContext")
> >
> >- object (class
> > "example.BaseQueryableDStream$$anonfun$registerTempTable$2",
> > )
> >
> > - field (class
> > "org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1",
> > name: "foreachFunc$1", type: "interface scala.Function1")
> >
> > - object (class
> > "org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1",
> > )
> >
> > - field (class "org.apache.spark.streaming.dstream.ForEachDStream",
> > name: "org$apache$spark$streaming$dstream$ForEachDStream$$foreachFunc",
> > type: "interface scala.Function2")
> >
> > - object (class "org.apache.spark.streaming.dstream.ForEachDStream",
> > org.apache.spark.streaming.dstream.ForEachDStream@5ccbdc20)
> >
> > - element of array (index: 0)
> >
> > - array (class "[Ljava.lang.Object;", size: 16)
> >
> > - field (class "scala.collection.mutable.ArrayBuffer", name:
> > "array", type: "class [Ljava.lang.Object;")
> >
> > - object (class "scala.collection.mutable.ArrayBuffer",
> > ArrayBuffer(org.apache.spark.streaming.dstream.ForEachDStream@5ccbdc20))
> >
> > - field (class "org.apache.spark.streaming.DStreamGraph", name:
> > "outputStreams", type: "class scala.collection.mutable.ArrayBuffer")
> >
> > - custom writeObject data (class
> > "org.apache.spark.streaming.DStreamGraph")
> >
> > - object (class "org.apache.spark.streaming.DStreamGraph",
> > org.apache.spark.streaming.DStreamGraph@776ae7da)
> >
> > - field (class "org.apache.spark.streaming.Checkpoint", name:
> > "graph", type: "class org.apache.spark.streaming.DStreamGraph")
> >
> > - root object (class "org.apache.spark.streaming.Checkpoint",
> > org.apache.spark.streaming.Checkpoint@5eade065)
> >
> > at java.io.ObjectOutputStream.writeObject0(Unknown Source)
> >
> >
> >
> >
> >
> >
> >
> >
>


Re: Building Spark with Pants

2015-02-16 Thread Ryan Williams
I worked on Pants at Foursquare for a while and when coming up to speed on
Spark was interested in the possibility of building it with Pants,
particularly because allowing developers to share/reuse each others'
compilation artifacts seems like it would be a boon to productivity; that
was/is Pants' "killer feature" for Foursquare, as mentioned on the
pants-devel thread.

Given the monumental nature of the task of making Spark build with Pants,
most of my enthusiasm was deflected to SPARK-1517
, which deals with
publishing nightly builds (or better, exposing all assembly JARs built by
Jenkins?) that people could use rather than having to assemble their own.

Anyway, it's an intriguing idea, Nicholas, I'm glad you are pursuing it!

On Sat Feb 14 2015 at 4:21:16 AM Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> FYI: Here is the matching discussion over on the Pants dev list.
> 
>
> On Mon Feb 02 2015 at 4:50:33 PM Nicholas Chammas
> nicholas.cham...@gmail.com
>  wrote:
>
> To reiterate, I'm asking from an experimental perspective. I'm not
> > proposing we change Spark to build with Pants or anything like that.
> >
> > I'm interested in trying Pants out and I'm wondering if anyone else
> shares
> > my interest or already has experience with Pants that they can share.
> >
> > On Mon Feb 02 2015 at 4:40:45 PM Nicholas Chammas <
> > nicholas.cham...@gmail.com> wrote:
> >
> >> I'm asking from an experimental standpoint; this is not happening
> anytime
> >> soon.
> >>
> >> Of course, if the experiment turns out very well, Pants would replace
> >> both sbt and Maven (like it has at Twitter, for example). Pants also
> works
> >> with IDEs .
> >>
> >> On Mon Feb 02 2015 at 4:33:11 PM Stephen Boesch 
> >> wrote:
> >>
> >>> There is a significant investment in sbt and maven - and they are not
> at
> >>> all likely to be going away. A third build tool?  Note that there is
> also
> >>> the perspective of building within an IDE - which actually works
> presently
> >>> for sbt and with a little bit of tweaking with maven as well.
> >>>
> >>> 2015-02-02 16:25 GMT-08:00 Nicholas Chammas <
> nicholas.cham...@gmail.com>
> >>> :
> >>>
>  Does anyone here have experience with Pants
> 
> >>>  or interest in trying to
> build
> >>>
> >>>
>  Spark with it?
> 
>  Pants has an interesting story. It was born at Twitter to help them
>  build
>  their Scala, Java, and Python projects as several independent
>  components in
>  one monolithic repo. (It was inspired by a similar build tool at
> Google
>  called blaze.) The mix of languages and sub-projects at Twitter seems
>  similar to the breakdown we have in Spark.
> 
>  Pants has an interesting take on how a build system should work, and
>  Twitter and Foursquare (who use Pants as their primary build tool)
>  claim it
>  helps enforce better build hygiene and maintainability.
> 
>  Some relevant talks:
> 
> - Building Scala Hygienically with Pants
> 
> - The Pants Build Tool at Twitter
>   s-build-tool-at-twitter>
> - Getting Started with the Pants Build System: Why Pants?
>   started-with-the-pants-build-system-why-pants>
> >>>
> >>>
> 
>  At some point I may take a shot at converting Spark to use Pants as an
>  experiment and just see what it’s like.
> 
>  Nick
>  ​
> 
> >>> ​
>


Re: HiveContext cannot be serialized

2015-02-16 Thread Michael Armbrust
I'd suggest marking the HiveContext as @transient since its not valid to
use it on the slaves anyway.

On Mon, Feb 16, 2015 at 4:27 AM, Haopu Wang  wrote:

> When I'm investigating this issue (in the end of this email), I take a
> look at HiveContext's code and find this change
> (https://github.com/apache/spark/commit/64945f868443fbc59cb34b34c16d782d
> da0fb63d#diff-ff50aea397a607b79df9bec6f2a841db):
>
>
>
> -  @transient protected[hive] lazy val hiveconf = new
> HiveConf(classOf[SessionState])
>
> -  @transient protected[hive] lazy val sessionState = {
>
> -val ss = new SessionState(hiveconf)
>
> -setConf(hiveconf.getAllProperties)  // Have SQLConf pick up the
> initial set of HiveConf.
>
> -ss
>
> -  }
>
> +  @transient protected[hive] lazy val (hiveconf, sessionState) =
>
> +Option(SessionState.get())
>
> +  .orElse {
>
>
>
> With the new change, Scala compiler always generate a Tuple2 field of
> HiveContext as below:
>
>
>
> private Tuple2 x$3;
>
> private transient OutputStream outputBuffer;
>
> private transient HiveConf hiveconf;
>
> private transient SessionState sessionState;
>
> private transient HiveMetastoreCatalog catalog;
>
>
>
> That "x$3" field's key is HiveConf object that cannot be serialized. So
> can you suggest how to resolve this issue? Thank you very much!
>
>
>
> 
>
>
>
> I have a streaming application which registered temp table on a
> HiveContext for each batch duration.
>
> The application runs well in Spark 1.1.0. But I get below error from
> 1.1.1.
>
> Do you have any suggestions to resolve it? Thank you!
>
>
>
> java.io.NotSerializableException: org.apache.hadoop.hive.conf.HiveConf
>
> - field (class "scala.Tuple2", name: "_1", type: "class
> java.lang.Object")
>
> - object (class "scala.Tuple2", (Configuration: core-default.xml,
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml,
> yarn-site.xml, hdfs-default.xml, hdfs-site.xml,
> org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@2158ce23,org.apa
> che.hadoop.hive.ql.session.SessionState@49b6eef9))
>
> - field (class "org.apache.spark.sql.hive.HiveContext", name: "x$3",
> type: "class scala.Tuple2")
>
> - object (class "org.apache.spark.sql.hive.HiveContext",
> org.apache.spark.sql.hive.HiveContext@4e6e66a4)
>
> - field (class
> "example.BaseQueryableDStream$$anonfun$registerTempTable$2", name:
> "sqlContext$1", type: "class org.apache.spark.sql.SQLContext")
>
>- object (class
> "example.BaseQueryableDStream$$anonfun$registerTempTable$2",
> )
>
> - field (class
> "org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1",
> name: "foreachFunc$1", type: "interface scala.Function1")
>
> - object (class
> "org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1",
> )
>
> - field (class "org.apache.spark.streaming.dstream.ForEachDStream",
> name: "org$apache$spark$streaming$dstream$ForEachDStream$$foreachFunc",
> type: "interface scala.Function2")
>
> - object (class "org.apache.spark.streaming.dstream.ForEachDStream",
> org.apache.spark.streaming.dstream.ForEachDStream@5ccbdc20)
>
> - element of array (index: 0)
>
> - array (class "[Ljava.lang.Object;", size: 16)
>
> - field (class "scala.collection.mutable.ArrayBuffer", name:
> "array", type: "class [Ljava.lang.Object;")
>
> - object (class "scala.collection.mutable.ArrayBuffer",
> ArrayBuffer(org.apache.spark.streaming.dstream.ForEachDStream@5ccbdc20))
>
> - field (class "org.apache.spark.streaming.DStreamGraph", name:
> "outputStreams", type: "class scala.collection.mutable.ArrayBuffer")
>
> - custom writeObject data (class
> "org.apache.spark.streaming.DStreamGraph")
>
> - object (class "org.apache.spark.streaming.DStreamGraph",
> org.apache.spark.streaming.DStreamGraph@776ae7da)
>
> - field (class "org.apache.spark.streaming.Checkpoint", name:
> "graph", type: "class org.apache.spark.streaming.DStreamGraph")
>
> - root object (class "org.apache.spark.streaming.Checkpoint",
> org.apache.spark.streaming.Checkpoint@5eade065)
>
> at java.io.ObjectOutputStream.writeObject0(Unknown Source)
>
>
>
>
>
>
>
>


Spark Receivers

2015-02-16 Thread Mark Payne
Hello,


I am one of the committers for Apache NiFi (incubating). I am looking to 
integrate NiFi with Spark streaming. I have created a custom Receiver to 
receive data from NiFi. I’ve tested it locally, and things seem to work well.


I feel it would make more sense to have the NiFi Receiver in the Spark codebase 
along side the code for Flume, Kafka, etc., as this is where people are more 
likely to look to see what integrations are available. Looking there, though, 
it seems that all of those are “fully integrated” into Spark, rather than being 
simple Receivers.


Is Spark interested in housing the code for Receivers to interact with other 
services, or should this just reside in the NiFi codebase?


Thanks for any pointers

-Mark

HiveContext cannot be serialized

2015-02-16 Thread Haopu Wang
When I'm investigating this issue (in the end of this email), I take a
look at HiveContext's code and find this change
(https://github.com/apache/spark/commit/64945f868443fbc59cb34b34c16d782d
da0fb63d#diff-ff50aea397a607b79df9bec6f2a841db):

 

-  @transient protected[hive] lazy val hiveconf = new
HiveConf(classOf[SessionState])

-  @transient protected[hive] lazy val sessionState = {

-val ss = new SessionState(hiveconf)

-setConf(hiveconf.getAllProperties)  // Have SQLConf pick up the
initial set of HiveConf.

-ss

-  }

+  @transient protected[hive] lazy val (hiveconf, sessionState) =

+Option(SessionState.get())

+  .orElse {

 

With the new change, Scala compiler always generate a Tuple2 field of
HiveContext as below:

 

private Tuple2 x$3;

private transient OutputStream outputBuffer;

private transient HiveConf hiveconf;

private transient SessionState sessionState;

private transient HiveMetastoreCatalog catalog;

 

That "x$3" field's key is HiveConf object that cannot be serialized. So
can you suggest how to resolve this issue? Thank you very much!

 



 

I have a streaming application which registered temp table on a
HiveContext for each batch duration.

The application runs well in Spark 1.1.0. But I get below error from
1.1.1.

Do you have any suggestions to resolve it? Thank you!

 

java.io.NotSerializableException: org.apache.hadoop.hive.conf.HiveConf

- field (class "scala.Tuple2", name: "_1", type: "class
java.lang.Object")

- object (class "scala.Tuple2", (Configuration: core-default.xml,
core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml,
yarn-site.xml, hdfs-default.xml, hdfs-site.xml,
org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@2158ce23,org.apa
che.hadoop.hive.ql.session.SessionState@49b6eef9))

- field (class "org.apache.spark.sql.hive.HiveContext", name: "x$3",
type: "class scala.Tuple2")

- object (class "org.apache.spark.sql.hive.HiveContext",
org.apache.spark.sql.hive.HiveContext@4e6e66a4)

- field (class
"example.BaseQueryableDStream$$anonfun$registerTempTable$2", name:
"sqlContext$1", type: "class org.apache.spark.sql.SQLContext")

   - object (class
"example.BaseQueryableDStream$$anonfun$registerTempTable$2",
)

- field (class
"org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1",
name: "foreachFunc$1", type: "interface scala.Function1")

- object (class
"org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1",
)

- field (class "org.apache.spark.streaming.dstream.ForEachDStream",
name: "org$apache$spark$streaming$dstream$ForEachDStream$$foreachFunc",
type: "interface scala.Function2")

- object (class "org.apache.spark.streaming.dstream.ForEachDStream",
org.apache.spark.streaming.dstream.ForEachDStream@5ccbdc20)

- element of array (index: 0)

- array (class "[Ljava.lang.Object;", size: 16)

- field (class "scala.collection.mutable.ArrayBuffer", name:
"array", type: "class [Ljava.lang.Object;")

- object (class "scala.collection.mutable.ArrayBuffer",
ArrayBuffer(org.apache.spark.streaming.dstream.ForEachDStream@5ccbdc20))

- field (class "org.apache.spark.streaming.DStreamGraph", name:
"outputStreams", type: "class scala.collection.mutable.ArrayBuffer")

- custom writeObject data (class
"org.apache.spark.streaming.DStreamGraph")

- object (class "org.apache.spark.streaming.DStreamGraph",
org.apache.spark.streaming.DStreamGraph@776ae7da)

- field (class "org.apache.spark.streaming.Checkpoint", name:
"graph", type: "class org.apache.spark.streaming.DStreamGraph")

- root object (class "org.apache.spark.streaming.Checkpoint",
org.apache.spark.streaming.Checkpoint@5eade065)

at java.io.ObjectOutputStream.writeObject0(Unknown Source)

 

 

 



Re: Replacing Jetty with TomCat

2015-02-16 Thread Sean Owen
There's no particular reason you have to remove the embedded Jetty
server, right? it doesn't prevent you from using it inside another app
that happens to run in Tomcat. You won't be able to switch it out
without rewriting a fair bit of code, no, but you don't need to.

On Mon, Feb 16, 2015 at 5:08 AM, Niranda Perera
 wrote:
> Hi,
>
> We are thinking of integrating Spark server inside a product. Our current
> product uses Tomcat as its webserver.
>
> Is it possible to switch the Jetty webserver in Spark to Tomcat
> off-the-shelf?
>
> Cheers
>
> --
> Niranda

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org