Re: spark-sql throws org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException

2015-03-27 Thread ๏̯͡๏
Ok.
I modified as per your suggestions

export SPARK_HOME=/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4
export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.3.0-hadoop2.4.0.jar
export HADOOP_CONF_DIR=/apache/hadoop/conf

cd $SPARK_HOME
./bin/spark-sql -v  --driver-class-path
 
/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar:/home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar:/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar


spark-sql prompt . I ran show tables , desc dw_bid. Each throw below
exception.





spark-sql desc dw_bid;
15/03/26 23:10:14 WARN conf.HiveConf: DEPRECATED: Configuration property
hive.metastore.local no longer has any effect. Make sure to provide a valid
value for hive.metastore.uris if you are connecting to a remote metastore.
15/03/26 23:10:14 WARN conf.HiveConf: DEPRECATED: hive.metastore.ds.retry.*
no longer has any effect.  Use hive.hmshandler.retry.* instead
15/03/26 23:10:14 INFO parse.ParseDriver: Parsing command: desc dw_bid
15/03/26 23:10:14 INFO parse.ParseDriver: Parse Completed
15/03/26 23:10:15 INFO metastore.HiveMetaStore: 0: get_table : db=default
tbl=dw_bid
15/03/26 23:10:15 INFO HiveMetaStore.audit: ugi=dvasthi...@corp.ebay.com
ip=unknown-ip-addr cmd=get_table : db=default tbl=dw_bid
15/03/26 23:10:15 INFO spark.SparkContext: Starting job: collect at
SparkPlan.scala:83
15/03/26 23:10:15 INFO scheduler.DAGScheduler: Got job 0 (collect at
SparkPlan.scala:83) with 1 output partitions (allowLocal=false)
15/03/26 23:10:15 INFO scheduler.DAGScheduler: Final stage: Stage 0(collect
at SparkPlan.scala:83)
15/03/26 23:10:15 INFO scheduler.DAGScheduler: Parents of final stage:
List()
15/03/26 23:10:15 INFO scheduler.DAGScheduler: Missing parents: List()
15/03/26 23:10:15 INFO scheduler.DAGScheduler: Submitting Stage 0
(MapPartitionsRDD[1] at map at SparkPlan.scala:83), which has no missing
parents
15/03/26 23:10:16 INFO scheduler.TaskSchedulerImpl: Cancelling stage 0
15/03/26 23:10:16 INFO scheduler.DAGScheduler: Job 0 failed: collect at
SparkPlan.scala:83, took 0.078101 s
15/03/26 23:10:16 ERROR thriftserver.SparkSQLDriver: Failed in [desc dw_bid]
org.apache.spark.SparkException: Job aborted due to stage failure: Task
serialization failed: java.lang.reflect.InvocationTargetException
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:526)
org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:68)
org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:60)
org.apache.spark.broadcast.TorrentBroadcast.org
$apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73)
org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:79)
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
org.apache.spark.SparkContext.broadcast(SparkContext.scala:1051)
org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:839)
org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:778)
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:762)
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1362)
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:847)
at org.apache.spark.scheduler.DAGScheduler.org

Re: spark-sql throws org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException

2015-03-26 Thread Cheng Lian

Hey Deepak,

It seems that your hive-site.xml says your Hive metastore setup is using 
MySQL. If that's not the case, you need to adjust your hive-site.xml 
configurations. As for the version of MySQL driver, it should match the 
MySQL server.


Cheng

On 3/27/15 11:07 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote:
I do not use MySQL, i want to read Hive tables from Spark SQL and 
transform them in Spark SQL. Why do i need a MySQL driver ? If i still 
need it which version should i use.


Assuming i need it, i downloaded the latest version of it from 
http://mvnrepository.com/artifact/mysql/mysql-connector-java/5.1.34 
and ran the following commands, i do not see above exception , however 
i see a new one.






export SPARK_HOME=/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4
export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.3.0-hadoop2.4.0.jar
export 
SPARK_CLASSPATH=/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar:/home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar:/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:*/home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar*

export HADOOP_CONF_DIR=/apache/hadoop/conf
cd $SPARK_HOME
./bin/spark-sql
Spark assembly has been built with Hive, including Datanucleus jars on 
classpath

...
...

spark-sql

spark-sql

spark-sql


show tables;

15/03/26 20:03:57 INFO metastore.HiveMetaStore: 0: get_tables: 
db=default pat=.*


15/03/26 20:03:57 INFO HiveMetaStore.audit: 
ugi=dvasthi...@corp.ebay.com 
mailto:dvasthi...@corp.ebay.comip=unknown-ip-addrcmd=get_tables: 
db=default pat=.*


15/03/26 20:03:58 INFO spark.SparkContext: Starting job: collect at 
SparkPlan.scala:83


15/03/26 20:03:58 INFO scheduler.DAGScheduler: Got job 1 (collect at 
SparkPlan.scala:83) with 1 output partitions (allowLocal=false)


15/03/26 20:03:58 INFO scheduler.DAGScheduler: Final stage: Stage 
1(collect at SparkPlan.scala:83)


15/03/26 20:03:58 INFO scheduler.DAGScheduler: Parents of final stage: 
List()


15/03/26 20:03:58 INFO scheduler.DAGScheduler: Missing parents: List()

15/03/26 20:03:58 INFO scheduler.DAGScheduler: Submitting Stage 1 
(MapPartitionsRDD[3] at map at SparkPlan.scala:83), which has no 
missing parents


15/03/26 20:03:58 INFO scheduler.TaskSchedulerImpl: Cancelling stage 1

15/03/26 20:03:58 INFO scheduler.StatsReportListener: Finished stage: 
org.apache.spark.scheduler.StageInfo@2bfd9c4d


15/03/26 20:03:58 INFO scheduler.DAGScheduler: Job 1 failed: collect 
at SparkPlan.scala:83, took 0.005163 s


15/03/26 20:03:58 ERROR thriftserver.SparkSQLDriver: Failed in [show 
tables]


org.apache.spark.SparkException: Job aborted due to stage failure: 
Task serialization failed: java.lang.reflect.InvocationTargetException


sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)

sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

java.lang.reflect.Constructor.newInstance(Constructor.java:526)

org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:68)

org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:60)

org.apache.spark.broadcast.TorrentBroadcast.org 
http://org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73)


org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:79)

org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)

org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)

org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)

org.apache.spark.SparkContext.broadcast(SparkContext.scala:1051)

org.apache.spark.scheduler.DAGScheduler.org 
http://org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:839)


org.apache.spark.scheduler.DAGScheduler.org 
http://org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:778)


org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:762)

org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1362)

org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)

org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)


at org.apache.spark.scheduler.DAGScheduler.org 
http://org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)



spark-sql throws org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException

2015-03-26 Thread ๏̯͡๏
I am unable to run spark-sql form command line.  I attempted the following

1)

export SPARK_HOME=/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4
export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.3.0-hadoop2.4.0.jar
export
SPARK_CLASSPATH=/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
cd $SPARK_HOME

./bin/spark-sql

./bin/spark-sql
2)

export SPARK_HOME=/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4
export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.3.0-hadoop2.4.0.jar
export
SPARK_CLASSPATH=/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
cd $SPARK_HOME

./bin/spark-sql --jars
/home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar


3)

export SPARK_HOME=/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4
export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.3.0-hadoop2.4.0.jar
export
SPARK_CLASSPATH=/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar:/home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar:/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar
export HADOOP_CONF_DIR=/apache/hadoop/conf
cd $SPARK_HOME
./bin/spark-sql



*Each time i get the below exception*


Spark assembly has been built with Hive, including Datanucleus jars on
classpath
15/03/26 19:43:49 WARN conf.HiveConf: DEPRECATED: Configuration property
hive.metastore.local no longer has any effect. Make sure to provide a valid
value for hive.metastore.uris if you are connecting to a remote metastore.
15/03/26 19:43:49 WARN conf.HiveConf: DEPRECATED: hive.metastore.ds.retry.*
no longer has any effect.  Use hive.hmshandler.retry.* instead
15/03/26 19:43:49 INFO metastore.HiveMetaStore: 0: Opening raw store with
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/03/26 19:43:49 INFO metastore.ObjectStore: ObjectStore, initialize called
15/03/26 19:43:50 INFO DataNucleus.Persistence: Property
datanucleus.cache.level2 unknown - will be ignored
15/03/26 19:43:50 INFO DataNucleus.Persistence: Property
hive.metastore.integral.jdo.pushdown unknown - will be ignored
Exception in thread main java.lang.RuntimeException:
java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:346)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:101)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
at
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1412)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.init(RetryingMetaStoreClient.java:62)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:72)
at
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2453)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2465)
at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:340)
... 11 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at

Re: spark-sql throws org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException

2015-03-26 Thread ๏̯͡๏
I do not use MySQL, i want to read Hive tables from Spark SQL and transform
them in Spark SQL. Why do i need a MySQL driver ? If i still need it which
version should i use.

Assuming i need it, i downloaded the latest version of it from
http://mvnrepository.com/artifact/mysql/mysql-connector-java/5.1.34 and ran
the following commands, i do not see above exception , however i see a new
one.





export SPARK_HOME=/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4
export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.3.0-hadoop2.4.0.jar
export
SPARK_CLASSPATH=/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar:/home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar:/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:
*/home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar*
export HADOOP_CONF_DIR=/apache/hadoop/conf
cd $SPARK_HOME
./bin/spark-sql
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
...
...

spark-sql

spark-sql

spark-sql


show tables;

15/03/26 20:03:57 INFO metastore.HiveMetaStore: 0: get_tables: db=default
pat=.*

15/03/26 20:03:57 INFO HiveMetaStore.audit: ugi=dvasthi...@corp.ebay.com
ip=unknown-ip-addr cmd=get_tables: db=default pat=.*

15/03/26 20:03:58 INFO spark.SparkContext: Starting job: collect at
SparkPlan.scala:83

15/03/26 20:03:58 INFO scheduler.DAGScheduler: Got job 1 (collect at
SparkPlan.scala:83) with 1 output partitions (allowLocal=false)

15/03/26 20:03:58 INFO scheduler.DAGScheduler: Final stage: Stage 1(collect
at SparkPlan.scala:83)

15/03/26 20:03:58 INFO scheduler.DAGScheduler: Parents of final stage:
List()

15/03/26 20:03:58 INFO scheduler.DAGScheduler: Missing parents: List()

15/03/26 20:03:58 INFO scheduler.DAGScheduler: Submitting Stage 1
(MapPartitionsRDD[3] at map at SparkPlan.scala:83), which has no missing
parents

15/03/26 20:03:58 INFO scheduler.TaskSchedulerImpl: Cancelling stage 1

15/03/26 20:03:58 INFO scheduler.StatsReportListener: Finished stage:
org.apache.spark.scheduler.StageInfo@2bfd9c4d

15/03/26 20:03:58 INFO scheduler.DAGScheduler: Job 1 failed: collect at
SparkPlan.scala:83, took 0.005163 s

15/03/26 20:03:58 ERROR thriftserver.SparkSQLDriver: Failed in [show tables]

org.apache.spark.SparkException: Job aborted due to stage failure: Task
serialization failed: java.lang.reflect.InvocationTargetException

sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)

sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

java.lang.reflect.Constructor.newInstance(Constructor.java:526)

org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:68)

org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:60)

org.apache.spark.broadcast.TorrentBroadcast.org
$apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73)

org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:79)

org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)

org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)

org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)

org.apache.spark.SparkContext.broadcast(SparkContext.scala:1051)

org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:839)

org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:778)

org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:762)

org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1362)

org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)

org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)


at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)

at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)

at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:847)

at 

Re: spark-sql throws org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException

2015-03-26 Thread Denny Lee
If you're not using MySQL as your metastore for Hive, out of curiosity what
are you using?

The error you are seeing is common when there isn't the correct driver to
allow Spark to connect to the Hive metastore because the correct driver
isn't there.

As well, I noticed that you're using SPARK_CLASSPATH which has been
deprecated.  Depending on your scenario, you may want to use --jars,
--driver-class-path, or extraClassPath.  A good thread on this topic can be
found at
http://mail-archives.us.apache.org/mod_mbox/spark-user/201503.mbox/%3C01a901d0547c$a23ba480$e6b2ed80$@innowireless.com%3E
.

For example, when I connect to my own Hive metastore via Spark 1.3, I
reference the --driver-class-path where in my case I am using MySQL as my
Hive metastore:

./bin/spark-sql --master spark://$standalone$:7077 --driver-class-path
mysql-connector-$version$.jar

HTH!


On Thu, Mar 26, 2015 at 8:09 PM ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 I do not use MySQL, i want to read Hive tables from Spark SQL and
 transform them in Spark SQL. Why do i need a MySQL driver ? If i still need
 it which version should i use.

 Assuming i need it, i downloaded the latest version of it from
 http://mvnrepository.com/artifact/mysql/mysql-connector-java/5.1.34 and
 ran the following commands, i do not see above exception , however i see a
 new one.





 export SPARK_HOME=/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4
 export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.3.0-hadoop2.4.0.jar
 export
 SPARK_CLASSPATH=/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar:/home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar:/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:
 */home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar*
 export HADOOP_CONF_DIR=/apache/hadoop/conf
 cd $SPARK_HOME
 ./bin/spark-sql
 Spark assembly has been built with Hive, including Datanucleus jars on
 classpath
 ...
 ...

 spark-sql

 spark-sql

 spark-sql


 show tables;

 15/03/26 20:03:57 INFO metastore.HiveMetaStore: 0: get_tables: db=default
 pat=.*

 15/03/26 20:03:57 INFO HiveMetaStore.audit: ugi=dvasthi...@corp.ebay.com
 ip=unknown-ip-addr cmd=get_tables: db=default pat=.*

 15/03/26 20:03:58 INFO spark.SparkContext: Starting job: collect at
 SparkPlan.scala:83

 15/03/26 20:03:58 INFO scheduler.DAGScheduler: Got job 1 (collect at
 SparkPlan.scala:83) with 1 output partitions (allowLocal=false)

 15/03/26 20:03:58 INFO scheduler.DAGScheduler: Final stage: Stage
 1(collect at SparkPlan.scala:83)

 15/03/26 20:03:58 INFO scheduler.DAGScheduler: Parents of final stage:
 List()

 15/03/26 20:03:58 INFO scheduler.DAGScheduler: Missing parents: List()

 15/03/26 20:03:58 INFO scheduler.DAGScheduler: Submitting Stage 1
 (MapPartitionsRDD[3] at map at SparkPlan.scala:83), which has no missing
 parents

 15/03/26 20:03:58 INFO scheduler.TaskSchedulerImpl: Cancelling stage 1

 15/03/26 20:03:58 INFO scheduler.StatsReportListener: Finished stage:
 org.apache.spark.scheduler.StageInfo@2bfd9c4d

 15/03/26 20:03:58 INFO scheduler.DAGScheduler: Job 1 failed: collect at
 SparkPlan.scala:83, took 0.005163 s

 15/03/26 20:03:58 ERROR thriftserver.SparkSQLDriver: Failed in [show
 tables]

 org.apache.spark.SparkException: Job aborted due to stage failure: Task
 serialization failed: java.lang.reflect.InvocationTargetException

 sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)


 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)


 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

 java.lang.reflect.Constructor.newInstance(Constructor.java:526)


 org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:68)


 org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:60)

 org.apache.spark.broadcast.TorrentBroadcast.org
 $apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73)


 org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:79)


 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)


 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)


 org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)

 org.apache.spark.SparkContext.broadcast(SparkContext.scala:1051)

 org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:839)

 org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:778)


 

Re: spark-sql throws org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException

2015-03-26 Thread Cheng Lian
As the exception suggests, you don't have MySQL JDBC driver on your 
classpath.



On 3/27/15 10:45 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote:

I am unable to run spark-sql form command line.  I attempted the following

1)

export SPARK_HOME=/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4
export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.3.0-hadoop2.4.0.jar
export 
SPARK_CLASSPATH=/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar

cd $SPARK_HOME

./bin/spark-sql

./bin/spark-sql
2)

export SPARK_HOME=/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4
export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.3.0-hadoop2.4.0.jar
export 
SPARK_CLASSPATH=/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar

cd $SPARK_HOME

./bin/spark-sql --jars 
/home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar



3)

export SPARK_HOME=/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4
export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.3.0-hadoop2.4.0.jar
export 
SPARK_CLASSPATH=/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar:/home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar:/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar

export HADOOP_CONF_DIR=/apache/hadoop/conf
cd $SPARK_HOME
./bin/spark-sql



_Each time i get the below exception_


Spark assembly has been built with Hive, including Datanucleus jars on 
classpath
15/03/26 19:43:49 WARN conf.HiveConf: DEPRECATED: Configuration 
property hive.metastore.local no longer has any effect. Make sure to 
provide a valid value for hive.metastore.uris if you are connecting to 
a remote metastore.
15/03/26 19:43:49 WARN conf.HiveConf: DEPRECATED: 
hive.metastore.ds.retry.* no longer has any effect.  Use 
hive.hmshandler.retry.* instead
15/03/26 19:43:49 INFO metastore.HiveMetaStore: 0: Opening raw store 
with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/03/26 19:43:49 INFO metastore.ObjectStore: ObjectStore, initialize 
called
15/03/26 19:43:50 INFO DataNucleus.Persistence: Property 
datanucleus.cache.level2 unknown - will be ignored
15/03/26 19:43:50 INFO DataNucleus.Persistence: Property 
hive.metastore.integral.jdo.pushdown unknown - will be ignored
Exception in thread main java.lang.RuntimeException: 
java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:346)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:101)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)

at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1412)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.init(RetryingMetaStoreClient.java:62)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:72)
at 
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2453)

at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2465)
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:340)

... 11 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at