RE: Hive From Spark

Andrew Lee Mon, 25 Aug 2014 14:46:24 -0700
Hi Du,
I didn't notice the ticket was updated recently. SPARK-2848 is a sub-task of 
Spark-2420, and it's already resolved in Spark 1.1.0.It looks like Spark-2420 
will release in Spark 1.2.0 according to the current JIRA status.
I'm tracking branch-1.1 instead of the master and haven't seen the results 
merged. Still seeing guava 14.0.1 so I don't think Spark 2848 has been merged 
yet.
Will be great to have someone to confirm or clarify the expectation.
> From: l...@yahoo-inc.com.INVALID
> To: van...@cloudera.com; alee...@hotmail.com
> CC: user@spark.apache.org
> Subject: Re: Hive From Spark
> Date: Sat, 23 Aug 2014 00:08:47 +0000
> 
> I thought the fix had been pushed to the apache master ref. commit
> "[SPARK-2848] Shade Guava in uber-jars" By Marcelo Vanzin on 8/20. So my
> previous email was based on own build of the apache master, which turned
> out not working yet.
> 
> Marcelo: Please correct me if I got that commit wrong.
> 
> Thanks,
> Du
> 
> 
> 
> On 8/22/14, 11:41 AM, "Marcelo Vanzin" <van...@cloudera.com> wrote:
> 
> >SPARK-2420 is fixed. I don't think it will be in 1.1, though - might
> >be too risky at this point.
> >
> >I'm not familiar with spark-sql.
> >
> >On Fri, Aug 22, 2014 at 11:25 AM, Andrew Lee <alee...@hotmail.com> wrote:
> >> Hopefully there could be some progress on SPARK-2420. It looks like
> >>shading
> >> may be the voted solution among downgrading.
> >>
> >> Any idea when this will happen? Could it happen in Spark 1.1.1 or Spark
> >> 1.1.2?
> >>
> >> By the way, regarding bin/spark-sql? Is this more of a debugging tool
> >>for
> >> Spark job integrating with Hive?
> >> How does people use spark-sql? I'm trying to understand the rationale
> >>and
> >> motivation behind this script, any idea?
> >>
> >>
> >>> Date: Thu, 21 Aug 2014 16:31:08 -0700
> >>
> >>> Subject: Re: Hive From Spark
> >>> From: van...@cloudera.com
> >>> To: l...@yahoo-inc.com.invalid
> >>> CC: user@spark.apache.org; u...@spark.incubator.apache.org;
> >>> pwend...@gmail.com
> >>
> >>>
> >>> Hi Du,
> >>>
> >>> I don't believe the Guava change has made it to the 1.1 branch. The
> >>> Guava doc says "hashInt" was added in 12.0, so what's probably
> >>> happening is that you have and old version of Guava in your classpath
> >>> before the Spark jars. (Hadoop ships with Guava 11, so that may be the
> >>> source of your problem.)
> >>>
> >>> On Thu, Aug 21, 2014 at 4:23 PM, Du Li <l...@yahoo-inc.com.invalid>
> >>>wrote:
> >>> > Hi,
> >>> >
> >>> > This guava dependency conflict problem should have been fixed as of
> >>> > yesterday according to
> >>>https://issues.apache.org/jira/browse/SPARK-2420
> >>> >
> >>> > However, I just got java.lang.NoSuchMethodError:
> >>> >
> >>> > 
> >>>com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/Ha
> >>>shCode;
> >>> > by the following code snippet and ³mvn3 test² on Mac. I built the
> >>>latest
> >>> > version of spark (1.1.0-SNAPSHOT) and installed the jar files to the
> >>> > local
> >>> > maven repo. From my pom file I explicitly excluded guava from almost
> >>>all
> >>> > possible dependencies, such as spark-hive_2.10-1.1.0.SNAPSHOT, and
> >>> > hadoop-client. This snippet is abstracted from a larger project. So
> >>>the
> >>> > pom.xml includes many dependencies although not all are required by
> >>>this
> >>> > snippet. The pom.xml is attached.
> >>> >
> >>> > Anybody knows what to fix it?
> >>> >
> >>> > Thanks,
> >>> > Du
> >>> > -------
> >>> >
> >>> > package com.myself.test
> >>> >
> >>> > import org.scalatest._
> >>> > import org.apache.hadoop.io.{NullWritable, BytesWritable}
> >>> > import org.apache.spark.{SparkContext, SparkConf}
> >>> > import org.apache.spark.SparkContext._
> >>> >
> >>> > class MyRecord(name: String) extends Serializable {
> >>> > def getWritable(): BytesWritable = {
> >>> > new
> >>> > 
> >>>BytesWritable(Option(name).getOrElse("\\N").toString.getBytes("UTF-8"))
> >>> > }
> >>> >
> >>> > final override def equals(that: Any): Boolean = {
> >>> > if( !that.isInstanceOf[MyRecord] )
> >>> > false
> >>> > else {
> >>> > val other = that.asInstanceOf[MyRecord]
> >>> > this.getWritable == other.getWritable
> >>> > }
> >>> > }
> >>> > }
> >>> >
> >>> > class MyRecordTestSuite extends FunSuite {
> >>> > // construct an MyRecord by Consumer.schema
> >>> > val rec: MyRecord = new MyRecord("James Bond")
> >>> >
> >>> > test("generated SequenceFile should be readable from spark") {
> >>> > val path = "./testdata/"
> >>> >
> >>> > val conf = new SparkConf(false).setMaster("local").setAppName("test
> >>>data
> >>> > exchange with Hive")
> >>> > conf.set("spark.driver.host", "localhost")
> >>> > val sc = new SparkContext(conf)
> >>> > val rdd = sc.makeRDD(Seq(rec))
> >>> > rdd.map((x: MyRecord) => (NullWritable.get(), x.getWritable()))
> >>> > .saveAsSequenceFile(path)
> >>> >
> >>> > val bytes = sc.sequenceFile(path, classOf[NullWritable],
> >>> > classOf[BytesWritable]).first._2
> >>> > assert(rec.getWritable() == bytes)
> >>> >
> >>> > sc.stop()
> >>> > System.clearProperty("spark.driver.port")
> >>> > }
> >>> > }
> >>> >
> >>> >
> >>> > From: Andrew Lee <alee...@hotmail.com>
> >>> > Reply-To: "user@spark.apache.org" <user@spark.apache.org>
> >>> > Date: Monday, July 21, 2014 at 10:27 AM
> >>> > To: "user@spark.apache.org" <user@spark.apache.org>,
> >>> > "u...@spark.incubator.apache.org" <u...@spark.incubator.apache.org>
> >>> >
> >>> > Subject: RE: Hive From Spark
> >>> >
> >>> > Hi All,
> >>> >
> >>> > Currently, if you are running Spark HiveContext API with Hive 0.12,
> >>>it
> >>> > won't
> >>> > work due to the following 2 libraries which are not consistent with
> >>>Hive
> >>> > 0.12 and Hadoop as well. (Hive libs aligns with Hadoop libs, and as a
> >>> > common
> >>> > practice, they should be consistent to work inter-operable).
> >>> >
> >>> > These are under discussion in the 2 JIRA tickets:
> >>> >
> >>> > https://issues.apache.org/jira/browse/HIVE-7387
> >>> >
> >>> > https://issues.apache.org/jira/browse/SPARK-2420
> >>> >
> >>> > When I ran the command by tweaking the classpath and build for Spark
> >>> > 1.0.1-rc3, I was able to create table through HiveContext, however,
> >>>when
> >>> > I
> >>> > fetch the data, due to incompatible API calls in Guava, it breaks.
> >>>This
> >>> > is
> >>> > critical since it needs to map the cllumns to the RDD schema.
> >>> >
> >>> > Hive and Hadoop are using an older version of guava libraries
> >>>(11.0.1)
> >>> > where
> >>> > Spark Hive is using guava 14.0.1+.
> >>> > The community isn't willing to downgrade to 11.0.1 which is the
> >>>current
> >>> > version for Hadoop 2.2 and Hive 0.12.
> >>> > Be aware of protobuf version as well in Hive 0.12 (it uses protobuf
> >>> > 2.4).
> >>> >
> >>> > scala>
> >>> >
> >>> > scala> import org.apache.spark.SparkContext
> >>> > import org.apache.spark.SparkContext
> >>> >
> >>> > scala> import org.apache.spark.sql.hive._
> >>> > import org.apache.spark.sql.hive._
> >>> >
> >>> > scala>
> >>> >
> >>> > scala> val hiveContext = new
> >>>org.apache.spark.sql.hive.HiveContext(sc)
> >>> > hiveContext: org.apache.spark.sql.hive.HiveContext =
> >>> > org.apache.spark.sql.hive.HiveContext@34bee01a
> >>> >
> >>> > scala>
> >>> >
> >>> > scala> hiveContext.hql("CREATE TABLE IF NOT EXISTS src (key INT,
> >>>value
> >>> > STRING)")
> >>> > res0: org.apache.spark.sql.SchemaRDD =
> >>> > SchemaRDD[0] at RDD at SchemaRDD.scala:104
> >>> > == Query Plan ==
> >>> > <Native command: executed by Hive>
> >>> >
> >>> > scala> hiveContext.hql("LOAD DATA LOCAL INPATH
> >>> > 'examples/src/main/resources/kv1.txt' INTO TABLE src")
> >>> > res1: org.apache.spark.sql.SchemaRDD =
> >>> > SchemaRDD[3] at RDD at SchemaRDD.scala:104
> >>> > == Query Plan ==
> >>> > <Native command: executed by Hive>
> >>> >
> >>> > scala>
> >>> >
> >>> > scala> // Queries are expressed in HiveQL
> >>> >
> >>> > scala> hiveContext.hql("FROM src SELECT key,
> >>> > value").collect().foreach(println)
> >>> > java.lang.NoSuchMethodError:
> >>> >
> >>> > 
> >>>com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/Ha
> >>>shCode;
> >>> > at
> >>> >
> >>> > 
> >>>org.apache.spark.util.collection.OpenHashSet.org$apache$spark$util$colle
> >>>ction$OpenHashSet$$hashcode(OpenHashSet.scala:261)
> >>> > at
> >>> >
> >>> > 
> >>>org.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHa
> >>>shSet.scala:165)
> >>> > at
> >>> >
> >>> > 
> >>>org.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(Open
> >>>HashSet.scala:102)
> >>> > at
> >>> >
> >>> > 
> >>>org.apache.spark.util.SizeEstimator$$anonfun$visitArray$2.apply$mcVI$sp(
> >>>SizeEstimator.scala:214)
> >>> > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> >>> > at
> >>> > 
> >>>org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:210)
> >>> > at
> >>> >
> >>> > 
> >>>org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.sca
> >>>la:169)
> >>> > at
> >>> >
> >>> > 
> >>>org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator
> >>>$$estimate(SizeEstimator.scala:161)
> >>> > at
> >>> > 
> >>>org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:155)
> >>> > at 
> >>>org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:75)
> >>> > at 
> >>>org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:92)
> >>> > at 
> >>>org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:661)
> >>> > at org.apache.spark.storage.BlockManager.put(BlockManager.scala:546)
> >>> > at
> >>> > 
> >>>org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:812)
> >>> > at
> >>> > 
> >>>org.apache.spark.broadcast.HttpBroadcast.<init>(HttpBroadcast.scala:52)
> >>> > at
> >>> >
> >>> > 
> >>>org.apache.spark.broadcast.HttpBroadcastFactory.newBroadcast(HttpBroadca
> >>>stFactory.scala:35)
> >>> > at
> >>> >
> >>> > 
> >>>org.apache.spark.broadcast.HttpBroadcastFactory.newBroadcast(HttpBroadca
> >>>stFactory.scala:29)
> >>> > at
> >>> >
> >>> > 
> >>>org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManage
> >>>r.scala:62)
> >>> > at org.apache.spark.SparkContext.broadcast(SparkContext.scala:776)
> >>> > at
> >>> > 
> >>>org.apache.spark.sql.hive.HadoopTableReader.<init>(TableReader.scala:60)
> >>> > at
> >>> >
> >>> > 
> >>>org.apache.spark.sql.hive.execution.HiveTableScan.<init>(HiveTableScan.s
> >>>cala:70)
> >>> > at
> >>> >
> >>> > 
> >>>org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$4.apply
> >>>(HiveStrategies.scala:73)
> >>> > at
> >>> >
> >>> > 
> >>>org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$4.apply
> >>>(HiveStrategies.scala:73)
> >>> > at
> >>> >
> >>> > 
> >>>org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLConte
> >>>xt.scala:280)
> >>> > at
> >>> >
> >>> > 
> >>>org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrat
> >>>egies.scala:69)
> >>> > at
> >>> >
> >>> > 
> >>>org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(Que
> >>>ryPlanner.scala:58)
> >>> > at
> >>> >
> >>> > 
> >>>org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(Que
> >>>ryPlanner.scala:58)
> >>> > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> >>> > at
> >>> >
> >>> > 
> >>>org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.s
> >>>cala:59)
> >>> > at
> >>> >
> >>> > 
> >>>org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLC
> >>>ontext.scala:316)
> >>> > at
> >>> >
> >>> > 
> >>>org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scal
> >>>a:316)
> >>> > at
> >>> >
> >>> > 
> >>>org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(S
> >>>QLContext.scala:319)
> >>> > at
> >>> >
> >>> > 
> >>>org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.s
> >>>cala:319)
> >>> > at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:420)
> >>> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:19)
> >>> > at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:24)
> >>> > at $iwC$$iwC$$iwC$$iwC.<init>(<console>:26)
> >>> > at $iwC$$iwC$$iwC.<init>(<console>:28)
> >>> > at $iwC$$iwC.<init>(<console>:30)
> >>> > at $iwC.<init>(<console>:32)
> >>> > at <init>(<console>:34)
> >>> > at .<init>(<console>:38)
> >>> > at .<clinit>(<console>)
> >>> > at .<init>(<console>:7)
> >>> > at .<clinit>(<console>)
> >>> > at $print(<console>)
> >>> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>> > at
> >>> >
> >>> > 
> >>>sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
> >>>a:57)
> >>> > at
> >>> >
> >>> > 
> >>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
> >>>Impl.java:43)
> >>> > at java.lang.reflect.Method.invoke(Method.java:606)
> >>> > at
> >>> > 
> >>>org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788
> >>>)
> >>> > at
> >>> >
> >>> > 
> >>>org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:105
> >>>6)
> >>> > at
> >>> > 
> >>>org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
> >>> > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
> >>> > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
> >>> > at
> >>> > 
> >>>org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
> >>> > at
> >>> >
> >>> > 
> >>>org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:
> >>>841)
> >>> > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
> >>> > at 
> >>>org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601)
> >>> > at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608)
> >>> > at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611)
> >>> > at
> >>> >
> >>> > 
> >>>org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkIL
> >>>oop.scala:936)
> >>> > at
> >>> >
> >>> > 
> >>>org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.sca
> >>>la:884)
> >>> > at
> >>> >
> >>> > 
> >>>org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.sca
> >>>la:884)
> >>> > at
> >>> >
> >>> > 
> >>>scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoa
> >>>der.scala:135)
> >>> > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
> >>> > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
> >>> > at org.apache.spark.repl.Main$.main(Main.scala:31)
> >>> > at org.apache.spark.repl.Main.main(Main.scala)
> >>> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>> > at
> >>> >
> >>> > 
> >>>sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
> >>>a:57)
> >>> > at
> >>> >
> >>> > 
> >>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
> >>>Impl.java:43)
> >>> > at java.lang.reflect.Method.invoke(Method.java:606)
> >>> > at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
> >>> > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
> >>> > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> >>> >
> >>> >
> >>> >
> >>> >> From: hao.ch...@intel.com
> >>> >> To: user@spark.apache.org; u...@spark.incubator.apache.org
> >>> >> Subject: RE: Hive From Spark
> >>> >> Date: Mon, 21 Jul 2014 01:14:19 +0000
> >>> >>
> >>> >> JiaJia, I've checkout the latest 1.0 branch, and then do the
> >>>following
> >>> >> steps:
> >>> >> SPAKR_HIVE=true sbt/sbt clean assembly
> >>> >> cd examples
> >>> >> ../bin/run-example sql.hive.HiveFromSpark
> >>> >>
> >>> >> It works well in my local
> >>> >>
> >>> >> From your log output, it shows "Invalid method name: 'get_table',
> >>>seems
> >>> >> an
> >>> >> incompatible jar version or something wrong between the Hive
> >>>metastore
> >>> >> service and client, can you double check the jar versions of Hive
> >>> >> metastore
> >>> >> service or thrift?
> >>> >>
> >>> >>
> >>> >> -----Original Message-----
> >>> >> From: JiajiaJing [mailto:jj.jing0...@gmail.com]
> >>> >> Sent: Saturday, July 19, 2014 7:29 AM
> >>> >> To: u...@spark.incubator.apache.org
> >>> >> Subject: RE: Hive From Spark
> >>> >>
> >>> >> Hi Cheng Hao,
> >>> >>
> >>> >> Thank you very much for your reply.
> >>> >>
> >>> >> Basically, the program runs on Spark 1.0.0 and Hive 0.12.0 .
> >>> >>
> >>> >> Some setups of the environment are done by running "SPARK_HIVE=true
> >>> >> sbt/sbt assembly/assembly", including the jar in all the workers,
> >>>and
> >>> >> copying the hive-site.xml to spark's conf dir.
> >>> >>
> >>> >> And then run the program as: " ./bin/run-example
> >>> >> org.apache.spark.examples.sql.hive.HiveFromSpark "
> >>> >>
> >>> >> It's good to know that this example runs well on your machine, could
> >>> >> you
> >>> >> please give me some insight about your have done as well?
> >>> >>
> >>> >> Thank you very much!
> >>> >>
> >>> >> Jiajia
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >> --
> >>> >> View this message in context:
> >>> >>
> >>> >> 
> >>>http://apache-spark-user-list.1001560.n3.nabble.com/Hive-From-Spark-tp10
> >>>110p10215.html
> >>> >> Sent from the Apache Spark User List mailing list archive at
> >>> >> Nabble.com.
> >>> >
> >>> >
> >>> > ---------------------------------------------------------------------
> >>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >>> > For additional commands, e-mail: user-h...@spark.apache.org
> >>>
> >>>
> >>>
> >>> --
> >>> Marcelo
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: user-h...@spark.apache.org
> >>>
> >
> >
> >
> >-- 
> >Marcelo
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >For additional commands, e-mail: user-h...@spark.apache.org
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
RE: Hive From Spark

Reply via email to