Re: Spark Job on YARN accessing Hbase Table

Benjamin Kim Sun, 13 Mar 2016 11:40:52 -0700

Hi Ted,

I see that you’re working on the hbase-spark module for hbase. I recently 
packaged the SparkOnHBase project and gave it a test run. It works like a charm 
on CDH 5.4 and 5.5. All I had to do was add 
/opt/cloudera/parcels/CDH/jars/htrace-core-3.1.0-incubating.jar to the 
classpath.txt file in /etc/spark/conf. Then, I ran spark-shell with “—jars 
/path/to/spark-hbase-0.0.2-clabs.jar” as an argument and used the easy-to-use 
HBaseContext for HBase operations. Now, I want to use the latest in Dataframes. 
Since the new functionality is only in the hbase-spark module, I want to know 
how to get it and package it for CDH 5.5, which still uses HBase 1.0.0. Can you 
tell me what version of hbase master is still backwards compatible?


By the way, we are using Spark 1.6 if it matters.

Thanks,
Ben

> On Feb 10, 2016, at 2:34 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> 
> Have you tried adding hbase client jars to spark.executor.extraClassPath ?
> 
> Cheers
> 
> On Wed, Feb 10, 2016 at 12:17 AM, Prabhu Joseph <prabhujose.ga...@gmail.com 
> <mailto:prabhujose.ga...@gmail.com>> wrote:
> + Spark-Dev
> 
> For a Spark job on YARN accessing hbase table, added all hbase client jars 
> into spark.yarn.dist.files, NodeManager when launching container i.e 
> executor, does localization and brings all hbase-client jars into executor 
> CWD, but still the executor tasks fail with ClassNotFoundException of hbase 
> client jars, when i checked launch container.sh , Classpath does not have 
> $PWD/* and hence all the hbase client jars are ignored.
> 
> Is spark.yarn.dist.files not for adding jars into the executor classpath.
> 
> Thanks,
> Prabhu Joseph 
> 
> On Tue, Feb 9, 2016 at 1:42 PM, Prabhu Joseph <prabhujose.ga...@gmail.com 
> <mailto:prabhujose.ga...@gmail.com>> wrote:
> Hi All,
> 
>  When i do count on a Hbase table from Spark Shell which runs as yarn-client 
> mode, the job fails at count().
> 
> MASTER=yarn-client ./spark-shell
> 
> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor, 
> TableName}
> import org.apache.hadoop.hbase.client.HBaseAdmin
> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>  
> val conf = HBaseConfiguration.create()
> conf.set(TableInputFormat.INPUT_TABLE,"spark")
> 
> val hBaseRDD = sc.newAPIHadoopRDD(conf, 
> classOf[TableInputFormat],classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],classOf[org.apache.hadoop.hbase.client.Result])
> hBaseRDD.count()
> 
> 
> Tasks throw below exception, the actual exception is swallowed, a bug 
> JDK-7172206. After installing hbase client on all NodeManager machines, the 
> Spark job ran fine. So I confirmed that the issue is with executor classpath.
> 
> But i am searching for some other way of including hbase jars in spark 
> executor classpath instead of installing hbase client on all NM machines. 
> Tried adding all hbase jars in spark.yarn.dist.files , NM logs shows that it 
> localized all hbase jars, still the job fails. Tried 
> spark.executor.extraClasspath, still the job fails.
> 
> Is there any way we can access hbase from Executor without installing 
> hbase-client on all machines.
> 
> 
> 16/02/09 02:34:57 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
> prabhuFS1): java.lang.IllegalStateException: unread block data
>         at 
> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2428)
>         at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
>         at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
>         at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
>         at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>         at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>         at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>         at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
>         at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> 
> 
> 
> Thanks,
> Prabhu Joseph
> 
>

Re: Spark Job on YARN accessing Hbase Table

Reply via email to