Re: Hive From Spark

Du Li Fri, 22 Aug 2014 17:09:42 -0700

I thought the fix had been pushed to the apache master ref. commit
"[SPARK-2848] Shade Guava in uber-jars" By Marcelo Vanzin on 8/20. So my
previous email was based on own build of the apache master, which turned
out not working yet.


Marcelo: Please correct me if I got that commit wrong.

Thanks,
Du



On 8/22/14, 11:41 AM, "Marcelo Vanzin" <van...@cloudera.com> wrote:

>SPARK-2420 is fixed. I don't think it will be in 1.1, though - might
>be too risky at this point.
>
>I'm not familiar with spark-sql.
>
>On Fri, Aug 22, 2014 at 11:25 AM, Andrew Lee <alee...@hotmail.com> wrote:
>> Hopefully there could be some progress on SPARK-2420. It looks like
>>shading
>> may be the voted solution among downgrading.
>>
>> Any idea when this will happen? Could it happen in Spark 1.1.1 or Spark
>> 1.1.2?
>>
>> By the way, regarding bin/spark-sql? Is this more of a debugging tool
>>for
>> Spark job integrating with Hive?
>> How does people use spark-sql? I'm trying to understand the rationale
>>and
>> motivation behind this script, any idea?
>>
>>
>>> Date: Thu, 21 Aug 2014 16:31:08 -0700
>>
>>> Subject: Re: Hive From Spark
>>> From: van...@cloudera.com
>>> To: l...@yahoo-inc.com.invalid
>>> CC: user@spark.apache.org; u...@spark.incubator.apache.org;
>>> pwend...@gmail.com
>>
>>>
>>> Hi Du,
>>>
>>> I don't believe the Guava change has made it to the 1.1 branch. The
>>> Guava doc says "hashInt" was added in 12.0, so what's probably
>>> happening is that you have and old version of Guava in your classpath
>>> before the Spark jars. (Hadoop ships with Guava 11, so that may be the
>>> source of your problem.)
>>>
>>> On Thu, Aug 21, 2014 at 4:23 PM, Du Li <l...@yahoo-inc.com.invalid>
>>>wrote:
>>> > Hi,
>>> >
>>> > This guava dependency conflict problem should have been fixed as of
>>> > yesterday according to
>>>https://issues.apache.org/jira/browse/SPARK-2420
>>> >
>>> > However, I just got java.lang.NoSuchMethodError:
>>> >
>>> > 
>>>com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/Ha
>>>shCode;
>>> > by the following code snippet and ³mvn3 test² on Mac. I built the
>>>latest
>>> > version of spark (1.1.0-SNAPSHOT) and installed the jar files to the
>>> > local
>>> > maven repo. From my pom file I explicitly excluded guava from almost
>>>all
>>> > possible dependencies, such as spark-hive_2.10-1.1.0.SNAPSHOT, and
>>> > hadoop-client. This snippet is abstracted from a larger project. So
>>>the
>>> > pom.xml includes many dependencies although not all are required by
>>>this
>>> > snippet. The pom.xml is attached.
>>> >
>>> > Anybody knows what to fix it?
>>> >
>>> > Thanks,
>>> > Du
>>> > -------
>>> >
>>> > package com.myself.test
>>> >
>>> > import org.scalatest._
>>> > import org.apache.hadoop.io.{NullWritable, BytesWritable}
>>> > import org.apache.spark.{SparkContext, SparkConf}
>>> > import org.apache.spark.SparkContext._
>>> >
>>> > class MyRecord(name: String) extends Serializable {
>>> > def getWritable(): BytesWritable = {
>>> > new
>>> > 
>>>BytesWritable(Option(name).getOrElse("\\N").toString.getBytes("UTF-8"))
>>> > }
>>> >
>>> > final override def equals(that: Any): Boolean = {
>>> > if( !that.isInstanceOf[MyRecord] )
>>> > false
>>> > else {
>>> > val other = that.asInstanceOf[MyRecord]
>>> > this.getWritable == other.getWritable
>>> > }
>>> > }
>>> > }
>>> >
>>> > class MyRecordTestSuite extends FunSuite {
>>> > // construct an MyRecord by Consumer.schema
>>> > val rec: MyRecord = new MyRecord("James Bond")
>>> >
>>> > test("generated SequenceFile should be readable from spark") {
>>> > val path = "./testdata/"
>>> >
>>> > val conf = new SparkConf(false).setMaster("local").setAppName("test
>>>data
>>> > exchange with Hive")
>>> > conf.set("spark.driver.host", "localhost")
>>> > val sc = new SparkContext(conf)
>>> > val rdd = sc.makeRDD(Seq(rec))
>>> > rdd.map((x: MyRecord) => (NullWritable.get(), x.getWritable()))
>>> > .saveAsSequenceFile(path)
>>> >
>>> > val bytes = sc.sequenceFile(path, classOf[NullWritable],
>>> > classOf[BytesWritable]).first._2
>>> > assert(rec.getWritable() == bytes)
>>> >
>>> > sc.stop()
>>> > System.clearProperty("spark.driver.port")
>>> > }
>>> > }
>>> >
>>> >
>>> > From: Andrew Lee <alee...@hotmail.com>
>>> > Reply-To: "user@spark.apache.org" <user@spark.apache.org>
>>> > Date: Monday, July 21, 2014 at 10:27 AM
>>> > To: "user@spark.apache.org" <user@spark.apache.org>,
>>> > "u...@spark.incubator.apache.org" <u...@spark.incubator.apache.org>
>>> >
>>> > Subject: RE: Hive From Spark
>>> >
>>> > Hi All,
>>> >
>>> > Currently, if you are running Spark HiveContext API with Hive 0.12,
>>>it
>>> > won't
>>> > work due to the following 2 libraries which are not consistent with
>>>Hive
>>> > 0.12 and Hadoop as well. (Hive libs aligns with Hadoop libs, and as a
>>> > common
>>> > practice, they should be consistent to work inter-operable).
>>> >
>>> > These are under discussion in the 2 JIRA tickets:
>>> >
>>> > https://issues.apache.org/jira/browse/HIVE-7387
>>> >
>>> > https://issues.apache.org/jira/browse/SPARK-2420
>>> >
>>> > When I ran the command by tweaking the classpath and build for Spark
>>> > 1.0.1-rc3, I was able to create table through HiveContext, however,
>>>when
>>> > I
>>> > fetch the data, due to incompatible API calls in Guava, it breaks.
>>>This
>>> > is
>>> > critical since it needs to map the cllumns to the RDD schema.
>>> >
>>> > Hive and Hadoop are using an older version of guava libraries
>>>(11.0.1)
>>> > where
>>> > Spark Hive is using guava 14.0.1+.
>>> > The community isn't willing to downgrade to 11.0.1 which is the
>>>current
>>> > version for Hadoop 2.2 and Hive 0.12.
>>> > Be aware of protobuf version as well in Hive 0.12 (it uses protobuf
>>> > 2.4).
>>> >
>>> > scala>
>>> >
>>> > scala> import org.apache.spark.SparkContext
>>> > import org.apache.spark.SparkContext
>>> >
>>> > scala> import org.apache.spark.sql.hive._
>>> > import org.apache.spark.sql.hive._
>>> >
>>> > scala>
>>> >
>>> > scala> val hiveContext = new
>>>org.apache.spark.sql.hive.HiveContext(sc)
>>> > hiveContext: org.apache.spark.sql.hive.HiveContext =
>>> > org.apache.spark.sql.hive.HiveContext@34bee01a
>>> >
>>> > scala>
>>> >
>>> > scala> hiveContext.hql("CREATE TABLE IF NOT EXISTS src (key INT,
>>>value
>>> > STRING)")
>>> > res0: org.apache.spark.sql.SchemaRDD =
>>> > SchemaRDD[0] at RDD at SchemaRDD.scala:104
>>> > == Query Plan ==
>>> > <Native command: executed by Hive>
>>> >
>>> > scala> hiveContext.hql("LOAD DATA LOCAL INPATH
>>> > 'examples/src/main/resources/kv1.txt' INTO TABLE src")
>>> > res1: org.apache.spark.sql.SchemaRDD =
>>> > SchemaRDD[3] at RDD at SchemaRDD.scala:104
>>> > == Query Plan ==
>>> > <Native command: executed by Hive>
>>> >
>>> > scala>
>>> >
>>> > scala> // Queries are expressed in HiveQL
>>> >
>>> > scala> hiveContext.hql("FROM src SELECT key,
>>> > value").collect().foreach(println)
>>> > java.lang.NoSuchMethodError:
>>> >
>>> > 
>>>com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/Ha
>>>shCode;
>>> > at
>>> >
>>> > 
>>>org.apache.spark.util.collection.OpenHashSet.org$apache$spark$util$colle
>>>ction$OpenHashSet$$hashcode(OpenHashSet.scala:261)
>>> > at
>>> >
>>> > 
>>>org.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHa
>>>shSet.scala:165)
>>> > at
>>> >
>>> > 
>>>org.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(Open
>>>HashSet.scala:102)
>>> > at
>>> >
>>> > 
>>>org.apache.spark.util.SizeEstimator$$anonfun$visitArray$2.apply$mcVI$sp(
>>>SizeEstimator.scala:214)
>>> > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>>> > at
>>> > 
>>>org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:210)
>>> > at
>>> >
>>> > 
>>>org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.sca
>>>la:169)
>>> > at
>>> >
>>> > 
>>>org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator
>>>$$estimate(SizeEstimator.scala:161)
>>> > at
>>> > 
>>>org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:155)
>>> > at 
>>>org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:75)
>>> > at 
>>>org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:92)
>>> > at 
>>>org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:661)
>>> > at org.apache.spark.storage.BlockManager.put(BlockManager.scala:546)
>>> > at
>>> > 
>>>org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:812)
>>> > at
>>> > 
>>>org.apache.spark.broadcast.HttpBroadcast.<init>(HttpBroadcast.scala:52)
>>> > at
>>> >
>>> > 
>>>org.apache.spark.broadcast.HttpBroadcastFactory.newBroadcast(HttpBroadca
>>>stFactory.scala:35)
>>> > at
>>> >
>>> > 
>>>org.apache.spark.broadcast.HttpBroadcastFactory.newBroadcast(HttpBroadca
>>>stFactory.scala:29)
>>> > at
>>> >
>>> > 
>>>org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManage
>>>r.scala:62)
>>> > at org.apache.spark.SparkContext.broadcast(SparkContext.scala:776)
>>> > at
>>> > 
>>>org.apache.spark.sql.hive.HadoopTableReader.<init>(TableReader.scala:60)
>>> > at
>>> >
>>> > 
>>>org.apache.spark.sql.hive.execution.HiveTableScan.<init>(HiveTableScan.s
>>>cala:70)
>>> > at
>>> >
>>> > 
>>>org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$4.apply
>>>(HiveStrategies.scala:73)
>>> > at
>>> >
>>> > 
>>>org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$4.apply
>>>(HiveStrategies.scala:73)
>>> > at
>>> >
>>> > 
>>>org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLConte
>>>xt.scala:280)
>>> > at
>>> >
>>> > 
>>>org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrat
>>>egies.scala:69)
>>> > at
>>> >
>>> > 
>>>org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(Que
>>>ryPlanner.scala:58)
>>> > at
>>> >
>>> > 
>>>org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(Que
>>>ryPlanner.scala:58)
>>> > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>>> > at
>>> >
>>> > 
>>>org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.s
>>>cala:59)
>>> > at
>>> >
>>> > 
>>>org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLC
>>>ontext.scala:316)
>>> > at
>>> >
>>> > 
>>>org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scal
>>>a:316)
>>> > at
>>> >
>>> > 
>>>org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(S
>>>QLContext.scala:319)
>>> > at
>>> >
>>> > 
>>>org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.s
>>>cala:319)
>>> > at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:420)
>>> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:19)
>>> > at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:24)
>>> > at $iwC$$iwC$$iwC$$iwC.<init>(<console>:26)
>>> > at $iwC$$iwC$$iwC.<init>(<console>:28)
>>> > at $iwC$$iwC.<init>(<console>:30)
>>> > at $iwC.<init>(<console>:32)
>>> > at <init>(<console>:34)
>>> > at .<init>(<console>:38)
>>> > at .<clinit>(<console>)
>>> > at .<init>(<console>:7)
>>> > at .<clinit>(<console>)
>>> > at $print(<console>)
>>> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> > at
>>> >
>>> > 
>>>sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
>>>a:57)
>>> > at
>>> >
>>> > 
>>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
>>>Impl.java:43)
>>> > at java.lang.reflect.Method.invoke(Method.java:606)
>>> > at
>>> > 
>>>org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788
>>>)
>>> > at
>>> >
>>> > 
>>>org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:105
>>>6)
>>> > at
>>> > 
>>>org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
>>> > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
>>> > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
>>> > at
>>> > 
>>>org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
>>> > at
>>> >
>>> > 
>>>org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:
>>>841)
>>> > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
>>> > at 
>>>org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601)
>>> > at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608)
>>> > at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611)
>>> > at
>>> >
>>> > 
>>>org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkIL
>>>oop.scala:936)
>>> > at
>>> >
>>> > 
>>>org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.sca
>>>la:884)
>>> > at
>>> >
>>> > 
>>>org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.sca
>>>la:884)
>>> > at
>>> >
>>> > 
>>>scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoa
>>>der.scala:135)
>>> > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
>>> > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
>>> > at org.apache.spark.repl.Main$.main(Main.scala:31)
>>> > at org.apache.spark.repl.Main.main(Main.scala)
>>> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> > at
>>> >
>>> > 
>>>sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
>>>a:57)
>>> > at
>>> >
>>> > 
>>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
>>>Impl.java:43)
>>> > at java.lang.reflect.Method.invoke(Method.java:606)
>>> > at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
>>> > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
>>> > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>> >
>>> >
>>> >
>>> >> From: hao.ch...@intel.com
>>> >> To: user@spark.apache.org; u...@spark.incubator.apache.org
>>> >> Subject: RE: Hive From Spark
>>> >> Date: Mon, 21 Jul 2014 01:14:19 +0000
>>> >>
>>> >> JiaJia, I've checkout the latest 1.0 branch, and then do the
>>>following
>>> >> steps:
>>> >> SPAKR_HIVE=true sbt/sbt clean assembly
>>> >> cd examples
>>> >> ../bin/run-example sql.hive.HiveFromSpark
>>> >>
>>> >> It works well in my local
>>> >>
>>> >> From your log output, it shows "Invalid method name: 'get_table',
>>>seems
>>> >> an
>>> >> incompatible jar version or something wrong between the Hive
>>>metastore
>>> >> service and client, can you double check the jar versions of Hive
>>> >> metastore
>>> >> service or thrift?
>>> >>
>>> >>
>>> >> -----Original Message-----
>>> >> From: JiajiaJing [mailto:jj.jing0...@gmail.com]
>>> >> Sent: Saturday, July 19, 2014 7:29 AM
>>> >> To: u...@spark.incubator.apache.org
>>> >> Subject: RE: Hive From Spark
>>> >>
>>> >> Hi Cheng Hao,
>>> >>
>>> >> Thank you very much for your reply.
>>> >>
>>> >> Basically, the program runs on Spark 1.0.0 and Hive 0.12.0 .
>>> >>
>>> >> Some setups of the environment are done by running "SPARK_HIVE=true
>>> >> sbt/sbt assembly/assembly", including the jar in all the workers,
>>>and
>>> >> copying the hive-site.xml to spark's conf dir.
>>> >>
>>> >> And then run the program as: " ./bin/run-example
>>> >> org.apache.spark.examples.sql.hive.HiveFromSpark "
>>> >>
>>> >> It's good to know that this example runs well on your machine, could
>>> >> you
>>> >> please give me some insight about your have done as well?
>>> >>
>>> >> Thank you very much!
>>> >>
>>> >> Jiajia
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> View this message in context:
>>> >>
>>> >> 
>>>http://apache-spark-user-list.1001560.n3.nabble.com/Hive-From-Spark-tp10
>>>110p10215.html
>>> >> Sent from the Apache Spark User List mailing list archive at
>>> >> Nabble.com.
>>> >
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> > For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>>
>>> --
>>> Marcelo
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>
>
>
>-- 
>Marcelo
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Hive From Spark

Reply via email to