Re: hive on spark - why is it so hard?
You should try with TEZ+LLAP. Additionally you will need to compare different configurations. Finally just any comparison is meaningless. You should use queries, data and file formats that your users are using later. > On 2. Oct 2017, at 03:06, Stephen Sprague wrote: > > so... i made some progress after much copying of jar files around (as > alluded to by Gopal previously on this thread). > > > following the instructions here: > https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started > > and doing this as instructed will leave off about a dozen or so jar files > that spark'll need: > ./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz > "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided" > > i ended copying the missing jars to $SPARK_HOME/jars but i would have > preferred to just add a path(s) to the spark class path but i did not find > any effective way to do that. In hive you can specify HIVE_AUX_JARS_PATH but > i don't see the analagous var in spark - i don't think it inherits the hive > classpath. > > anyway a simple query is now working under Hive On Spark so i think i might > be over the hump. Now its a matter of comparing the performance with Tez. > > Cheers, > Stephen. > > >> On Wed, Sep 27, 2017 at 9:37 PM, Stephen Sprague wrote: >> ok.. getting further. seems now i have to deploy hive to all nodes in the >> cluster - don't think i had to do that before but not a big deal to do it >> now. >> >> for me: >> HIVE_HOME=/usr/lib/apache-hive-2.3.0-bin/ >> SPARK_HOME=/usr/lib/spark-2.2.0-bin-hadoop2.6 >> >> on all three nodes now. >> >> i started spark master on the namenode and i started spark slaves (2) on two >> datanodes of the cluster. >> >> so far so good. >> >> now i run my usual test command. >> >> $ hive --hiveconf hive.root.logger=DEBUG,console -e 'set >> hive.execution.engine=spark; select date_key, count(*) from >> fe_inventory.merged_properties_hist group by 1 order by 1;' >> >> i get a little further now and find the stderr from the Spark Web UI >> interface (nice) and it reports this: >> >> 17/09/27 20:47:35 INFO WorkerWatcher: Successfully connected to >> spark://Worker@172.19.79.127:40145 >> Exception in thread "main" java.lang.reflect.InvocationTargetException >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:483) >> at >> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58) >> at >> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) >> Caused by: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS >> at >> org.apache.hive.spark.client.rpc.RpcConfiguration.(RpcConfiguration.java:47) >> at >> org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:134) >> at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516) >> ... 6 more >> >> >> searching around the internet i find this is probably a compatibility issue. >> >> i know. i know. no surprise here. >> >> so i guess i just got to the point where everybody else is... build spark >> w/o hive. >> >> lemme see what happens next. >> >> >> >> >> >>> On Wed, Sep 27, 2017 at 7:41 PM, Stephen Sprague wrote: >>> thanks. I haven't had a chance to dig into this again today but i do >>> appreciate the pointer. I'll keep you posted. >>> On Wed, Sep 27, 2017 at 10:14 AM, Sahil Takiar wrote: You can try increasing the value of hive.spark.client.connect.timeout. Would also suggest taking a look at the HoS Remote Driver logs. The driver gets launched in a YARN container (assuming you are running Spark in yarn-client mode), so you just have to find the logs for that container. --Sahil > On Tue, Sep 26, 2017 at 9:17 PM, Stephen Sprague > wrote: > i _seem_ to be getting closer. Maybe its just wishful thinking. Here's > where i'm at now. > > 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: > 17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with > CreateSubmissionResponse: > 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: { > 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: > "action" : "CreateSubmissionResponse", > 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: > "message" : "Driver successfully submitted as driver-20170926211038-0003", > 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: > "serverSparkVersion" : "2.2.0", > 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: > "submissionId" : "driver-201709
Re: hive on spark - why is it so hard?
so... i made some progress after much copying of jar files around (as alluded to by Gopal previously on this thread). following the instructions here: https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started and doing this as instructed will leave off about a dozen or so jar files that spark'll need: ./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided" i ended copying the missing jars to $SPARK_HOME/jars but i would have preferred to just add a path(s) to the spark class path but i did not find any effective way to do that. In hive you can specify HIVE_AUX_JARS_PATH but i don't see the analagous var in spark - i don't think it inherits the hive classpath. anyway a simple query is now working under Hive On Spark so i think i might be over the hump. Now its a matter of comparing the performance with Tez. Cheers, Stephen. On Wed, Sep 27, 2017 at 9:37 PM, Stephen Sprague wrote: > ok.. getting further. seems now i have to deploy hive to all nodes in the > cluster - don't think i had to do that before but not a big deal to do it > now. > > for me: > HIVE_HOME=/usr/lib/apache-hive-2.3.0-bin/ > SPARK_HOME=/usr/lib/spark-2.2.0-bin-hadoop2.6 > > on all three nodes now. > > i started spark master on the namenode and i started spark slaves (2) on > two datanodes of the cluster. > > so far so good. > > now i run my usual test command. > > $ hive --hiveconf hive.root.logger=DEBUG,console -e 'set > hive.execution.engine=spark; select date_key, count(*) from > fe_inventory.merged_properties_hist group by 1 order by 1;' > > i get a little further now and find the stderr from the Spark Web UI > interface (nice) and it reports this: > > 17/09/27 20:47:35 INFO WorkerWatcher: Successfully connected to > spark://Worker@172.19.79.127:40145 > Exception in thread "main" java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58) > at > org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)*Caused > by: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS* > at > org.apache.hive.spark.client.rpc.RpcConfiguration.(RpcConfiguration.java:47) > at > org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:134) > at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516) > ... 6 more > > > > searching around the internet i find this is probably a compatibility > issue. > > i know. i know. no surprise here. > > so i guess i just got to the point where everybody else is... build spark > w/o hive. > > lemme see what happens next. > > > > > > On Wed, Sep 27, 2017 at 7:41 PM, Stephen Sprague > wrote: > >> thanks. I haven't had a chance to dig into this again today but i do >> appreciate the pointer. I'll keep you posted. >> >> On Wed, Sep 27, 2017 at 10:14 AM, Sahil Takiar >> wrote: >> >>> You can try increasing the value of hive.spark.client.connect.timeout. >>> Would also suggest taking a look at the HoS Remote Driver logs. The driver >>> gets launched in a YARN container (assuming you are running Spark in >>> yarn-client mode), so you just have to find the logs for that container. >>> >>> --Sahil >>> >>> On Tue, Sep 26, 2017 at 9:17 PM, Stephen Sprague >>> wrote: >>> i _seem_ to be getting closer. Maybe its just wishful thinking. Here's where i'm at now. 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: 17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with CreateSubmissionResponse: 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: { 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: "action" : "CreateSubmissionResponse", 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: "message" : "Driver successfully submitted as driver-20170926211038-0003", 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: "serverSparkVersion" : "2.2.0", 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: "submissionId" : "driver-20170926211038-0003", 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: "success" : true 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: } 2017-09-26T21:10:45,701 DEBUG [IPC Client (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC Client (425015667) connection to dwrdevnn1.sv2.trulia.com/172.1 9.73.136:8020 from dwr: close
Re: hive on spark - why is it so hard?
ok.. getting further. seems now i have to deploy hive to all nodes in the cluster - don't think i had to do that before but not a big deal to do it now. for me: HIVE_HOME=/usr/lib/apache-hive-2.3.0-bin/ SPARK_HOME=/usr/lib/spark-2.2.0-bin-hadoop2.6 on all three nodes now. i started spark master on the namenode and i started spark slaves (2) on two datanodes of the cluster. so far so good. now i run my usual test command. $ hive --hiveconf hive.root.logger=DEBUG,console -e 'set hive.execution.engine=spark; select date_key, count(*) from fe_inventory.merged_properties_hist group by 1 order by 1;' i get a little further now and find the stderr from the Spark Web UI interface (nice) and it reports this: 17/09/27 20:47:35 INFO WorkerWatcher: Successfully connected to spark://Worker@172.19.79.127:40145 Exception in thread "main" java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58) at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)*Caused by: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS* at org.apache.hive.spark.client.rpc.RpcConfiguration.(RpcConfiguration.java:47) at org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:134) at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516) ... 6 more searching around the internet i find this is probably a compatibility issue. i know. i know. no surprise here. so i guess i just got to the point where everybody else is... build spark w/o hive. lemme see what happens next. On Wed, Sep 27, 2017 at 7:41 PM, Stephen Sprague wrote: > thanks. I haven't had a chance to dig into this again today but i do > appreciate the pointer. I'll keep you posted. > > On Wed, Sep 27, 2017 at 10:14 AM, Sahil Takiar > wrote: > >> You can try increasing the value of hive.spark.client.connect.timeout. >> Would also suggest taking a look at the HoS Remote Driver logs. The driver >> gets launched in a YARN container (assuming you are running Spark in >> yarn-client mode), so you just have to find the logs for that container. >> >> --Sahil >> >> On Tue, Sep 26, 2017 at 9:17 PM, Stephen Sprague >> wrote: >> >>> i _seem_ to be getting closer. Maybe its just wishful thinking. >>> Here's where i'm at now. >>> >>> 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: >>> 17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with >>> CreateSubmissionResponse: >>> 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: { >>> 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: >>> "action" : "CreateSubmissionResponse", >>> 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: >>> "message" : "Driver successfully submitted as driver-20170926211038-0003", >>> 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: >>> "serverSparkVersion" : "2.2.0", >>> 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: >>> "submissionId" : "driver-20170926211038-0003", >>> 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: >>> "success" : true >>> 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: } >>> 2017-09-26T21:10:45,701 DEBUG [IPC Client (425015667) connection to >>> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC >>> Client (425015667) connection to dwrdevnn1.sv2.trulia.com/172.1 >>> 9.73.136:8020 from dwr: closed >>> 2017-09-26T21:10:45,702 DEBUG [IPC Client (425015667) connection to >>> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC >>> Clien >>> t (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 >>> from dwr: stopped, remaining connections 0 >>> 2017-09-26T21:12:06,719 ERROR [2337b36e-86ca-47cd-b1ae-f0b32571b97e >>> main] client.SparkClientImpl: Timed out waiting for client to connect. >>> *Possible reasons include network issues, errors in remote driver or the >>> cluster has no available resources, etc.* >>> *Please check YARN or Spark driver's logs for further information.* >>> java.util.concurrent.ExecutionException: >>> java.util.concurrent.TimeoutException: >>> Timed out waiting for client connection. >>> at >>> io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37) >>> ~[netty-all-4.0.29.Final.jar:4.0.29.Final] >>> at >>> org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:108) >>> [hive-exec-2.3.0.jar:2.3.0] >>> at >>> org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80) >>> [hiv
Re: hive on spark - why is it so hard?
thanks. I haven't had a chance to dig into this again today but i do appreciate the pointer. I'll keep you posted. On Wed, Sep 27, 2017 at 10:14 AM, Sahil Takiar wrote: > You can try increasing the value of hive.spark.client.connect.timeout. > Would also suggest taking a look at the HoS Remote Driver logs. The driver > gets launched in a YARN container (assuming you are running Spark in > yarn-client mode), so you just have to find the logs for that container. > > --Sahil > > On Tue, Sep 26, 2017 at 9:17 PM, Stephen Sprague > wrote: > >> i _seem_ to be getting closer. Maybe its just wishful thinking. Here's >> where i'm at now. >> >> 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: >> 17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with >> CreateSubmissionResponse: >> 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: { >> 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: >> "action" : "CreateSubmissionResponse", >> 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: >> "message" : "Driver successfully submitted as driver-20170926211038-0003", >> 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: >> "serverSparkVersion" : "2.2.0", >> 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: >> "submissionId" : "driver-20170926211038-0003", >> 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: >> "success" : true >> 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: } >> 2017-09-26T21:10:45,701 DEBUG [IPC Client (425015667) connection to >> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC >> Client (425015667) connection to dwrdevnn1.sv2.trulia.com/172.1 >> 9.73.136:8020 from dwr: closed >> 2017-09-26T21:10:45,702 DEBUG [IPC Client (425015667) connection to >> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC >> Clien >> t (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 >> from dwr: stopped, remaining connections 0 >> 2017-09-26T21:12:06,719 ERROR [2337b36e-86ca-47cd-b1ae-f0b32571b97e >> main] client.SparkClientImpl: Timed out waiting for client to connect. >> *Possible reasons include network issues, errors in remote driver or the >> cluster has no available resources, etc.* >> *Please check YARN or Spark driver's logs for further information.* >> java.util.concurrent.ExecutionException: >> java.util.concurrent.TimeoutException: >> Timed out waiting for client connection. >> at >> io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37) >> ~[netty-all-4.0.29.Final.jar:4.0.29.Final] >> at >> org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:108) >> [hive-exec-2.3.0.jar:2.3.0] >> at >> org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80) >> [hive-exec-2.3.0.jar:2.3.0] >> at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.c >> reateRemoteClient(RemoteHiveSparkClient.java:101) >> [hive-exec-2.3.0.jar:2.3.0] >> at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.< >> init>(RemoteHiveSparkClient.java:97) [hive-exec-2.3.0.jar:2.3.0] >> at org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory. >> createHiveSparkClient(HiveSparkClientFactory.java:73) >> [hive-exec-2.3.0.jar:2.3.0] >> at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImp >> l.open(SparkSessionImpl.java:62) [hive-exec-2.3.0.jar:2.3.0] >> at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionMan >> agerImpl.getSession(SparkSessionManagerImpl.java:115) >> [hive-exec-2.3.0.jar:2.3.0] >> at org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSpark >> Session(SparkUtilities.java:126) [hive-exec-2.3.0.jar:2.3.0] >> at org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerPar >> allelism.getSparkMemoryAndCores(SetSparkReducerParallelism.java:236) >> [hive-exec-2.3.0.jar:2.3.0] >> >> >> i'll dig some more tomorrow. >> >> On Tue, Sep 26, 2017 at 8:23 PM, Stephen Sprague >> wrote: >> >>> oh. i missed Gopal's reply. oy... that sounds foreboding. I'll keep >>> you posted on my progress. >>> >>> On Tue, Sep 26, 2017 at 4:40 PM, Gopal Vijayaraghavan >> > wrote: >>> Hi, > org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark client. I get inexplicable errors with Hive-on-Spark unless I do a three step build. Build Hive first, use that version to build Spark, use that Spark version to rebuild Hive. I have to do this to make it work because Spark contains Hive jars and Hive contains Spark jars in the class-path. And specifically I have to edit the pom.xml files, instead of passing in params with -Dspark.version, because the installed pom files don
Re: hive on spark - why is it so hard?
You can try increasing the value of hive.spark.client.connect.timeout. Would also suggest taking a look at the HoS Remote Driver logs. The driver gets launched in a YARN container (assuming you are running Spark in yarn-client mode), so you just have to find the logs for that container. --Sahil On Tue, Sep 26, 2017 at 9:17 PM, Stephen Sprague wrote: > i _seem_ to be getting closer. Maybe its just wishful thinking. Here's > where i'm at now. > > 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: > 17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with > CreateSubmissionResponse: > 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: { > 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: > "action" : "CreateSubmissionResponse", > 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: > "message" : "Driver successfully submitted as driver-20170926211038-0003", > 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: > "serverSparkVersion" : "2.2.0", > 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: > "submissionId" : "driver-20170926211038-0003", > 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: > "success" : true > 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: } > 2017-09-26T21:10:45,701 DEBUG [IPC Client (425015667) connection to > dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC > Client (425015667) connection to dwrdevnn1.sv2.trulia.com/172. > 19.73.136:8020 from dwr: closed > 2017-09-26T21:10:45,702 DEBUG [IPC Client (425015667) connection to > dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC > Clien > t (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 > from dwr: stopped, remaining connections 0 > 2017-09-26T21:12:06,719 ERROR [2337b36e-86ca-47cd-b1ae-f0b32571b97e main] > client.SparkClientImpl: Timed out waiting for client to connect. > *Possible reasons include network issues, errors in remote driver or the > cluster has no available resources, etc.* > *Please check YARN or Spark driver's logs for further information.* > java.util.concurrent.ExecutionException: > java.util.concurrent.TimeoutException: > Timed out waiting for client connection. > at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37) > ~[netty-all-4.0.29.Final.jar:4.0.29.Final] > at > org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:108) > [hive-exec-2.3.0.jar:2.3.0] > at > org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80) > [hive-exec-2.3.0.jar:2.3.0] > at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient. > createRemoteClient(RemoteHiveSparkClient.java:101) > [hive-exec-2.3.0.jar:2.3.0] > at org.apache.hadoop.hive.ql.exec.spark. > RemoteHiveSparkClient.(RemoteHiveSparkClient.java:97) > [hive-exec-2.3.0.jar:2.3.0] > at org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory. > createHiveSparkClient(HiveSparkClientFactory.java:73) > [hive-exec-2.3.0.jar:2.3.0] > at org.apache.hadoop.hive.ql.exec.spark.session. > SparkSessionImpl.open(SparkSessionImpl.java:62) > [hive-exec-2.3.0.jar:2.3.0] > at org.apache.hadoop.hive.ql.exec.spark.session. > SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:115) > [hive-exec-2.3.0.jar:2.3.0] > at org.apache.hadoop.hive.ql.exec.spark.SparkUtilities. > getSparkSession(SparkUtilities.java:126) [hive-exec-2.3.0.jar:2.3.0] > at org.apache.hadoop.hive.ql.optimizer.spark. > SetSparkReducerParallelism.getSparkMemoryAndCores( > SetSparkReducerParallelism.java:236) [hive-exec-2.3.0.jar:2.3.0] > > > i'll dig some more tomorrow. > > On Tue, Sep 26, 2017 at 8:23 PM, Stephen Sprague > wrote: > >> oh. i missed Gopal's reply. oy... that sounds foreboding. I'll keep you >> posted on my progress. >> >> On Tue, Sep 26, 2017 at 4:40 PM, Gopal Vijayaraghavan >> wrote: >> >>> Hi, >>> >>> > org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a >>> spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed >>> to create spark client. >>> >>> I get inexplicable errors with Hive-on-Spark unless I do a three step >>> build. >>> >>> Build Hive first, use that version to build Spark, use that Spark >>> version to rebuild Hive. >>> >>> I have to do this to make it work because Spark contains Hive jars and >>> Hive contains Spark jars in the class-path. >>> >>> And specifically I have to edit the pom.xml files, instead of passing in >>> params with -Dspark.version, because the installed pom files don't get >>> replacements from the build args. >>> >>> Cheers, >>> Gopal >>> >>> >>> >> > -- Sahil Takiar Software Engineer at Cloudera takiar.sa...@gmail.com | (510) 673-0309
Re: hive on spark - why is it so hard?
i _seem_ to be getting closer. Maybe its just wishful thinking. Here's where i'm at now. 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: 17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with CreateSubmissionResponse: 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: { 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: "action" : "CreateSubmissionResponse", 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: "message" : "Driver successfully submitted as driver-20170926211038-0003", 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: "serverSparkVersion" : "2.2.0", 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: "submissionId" : "driver-20170926211038-0003", 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: "success" : true 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: } 2017-09-26T21:10:45,701 DEBUG [IPC Client (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC Client (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr: closed 2017-09-26T21:10:45,702 DEBUG [IPC Client (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC Clien t (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr: stopped, remaining connections 0 2017-09-26T21:12:06,719 ERROR [2337b36e-86ca-47cd-b1ae-f0b32571b97e main] client.SparkClientImpl: Timed out waiting for client to connect. *Possible reasons include network issues, errors in remote driver or the cluster has no available resources, etc.* *Please check YARN or Spark driver's logs for further information.* java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Timed out waiting for client connection. at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37) ~[netty-all-4.0.29.Final.jar:4.0.29.Final] at org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:108) [hive-exec-2.3.0.jar:2.3.0] at org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80) [hive-exec-2.3.0.jar:2.3.0] at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.createRemoteClient(RemoteHiveSparkClient.java:101) [hive-exec-2.3.0.jar:2.3.0] at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.(RemoteHiveSparkClient.java:97) [hive-exec-2.3.0.jar:2.3.0] at org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.createHiveSparkClient(HiveSparkClientFactory.java:73) [hive-exec-2.3.0.jar:2.3.0] at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:62) [hive-exec-2.3.0.jar:2.3.0] at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:115) [hive-exec-2.3.0.jar:2.3.0] at org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(SparkUtilities.java:126) [hive-exec-2.3.0.jar:2.3.0] at org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerParallelism.getSparkMemoryAndCores(SetSparkReducerParallelism.java:236) [hive-exec-2.3.0.jar:2.3.0] i'll dig some more tomorrow. On Tue, Sep 26, 2017 at 8:23 PM, Stephen Sprague wrote: > oh. i missed Gopal's reply. oy... that sounds foreboding. I'll keep you > posted on my progress. > > On Tue, Sep 26, 2017 at 4:40 PM, Gopal Vijayaraghavan > wrote: > >> Hi, >> >> > org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a >> spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed >> to create spark client. >> >> I get inexplicable errors with Hive-on-Spark unless I do a three step >> build. >> >> Build Hive first, use that version to build Spark, use that Spark version >> to rebuild Hive. >> >> I have to do this to make it work because Spark contains Hive jars and >> Hive contains Spark jars in the class-path. >> >> And specifically I have to edit the pom.xml files, instead of passing in >> params with -Dspark.version, because the installed pom files don't get >> replacements from the build args. >> >> Cheers, >> Gopal >> >> >> >
Re: hive on spark - why is it so hard?
oh. i missed Gopal's reply. oy... that sounds foreboding. I'll keep you posted on my progress. On Tue, Sep 26, 2017 at 4:40 PM, Gopal Vijayaraghavan wrote: > Hi, > > > org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a > spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed > to create spark client. > > I get inexplicable errors with Hive-on-Spark unless I do a three step > build. > > Build Hive first, use that version to build Spark, use that Spark version > to rebuild Hive. > > I have to do this to make it work because Spark contains Hive jars and > Hive contains Spark jars in the class-path. > > And specifically I have to edit the pom.xml files, instead of passing in > params with -Dspark.version, because the installed pom files don't get > replacements from the build args. > > Cheers, > Gopal > > >
Re: hive on spark - why is it so hard?
well this is the spark-submit line from above: 2017-09-26T14:04:45,678 INFO [4cb82b6d-9568-4518-8e00-f0cf7ac58cd3 main] client.SparkClientImpl: Running client driver with argv: */usr/li/spark-2.2.0-bin-**hadoop2.6/bin/spark-submit* and that's pretty clearly v2.2 I do have other versions of spark on the namenode so lemme remove those and see what happens A-HA! dang it! $ echo $SPARK_HOME /usr/local/spark well that clearly needs to be: */usr/lib/spark-2.2.0-bin-* *hadoop2.6 * how did i miss that? unbelievable. Thank you Sahil! Lets see what happens next! Cheers, Stephen On Tue, Sep 26, 2017 at 4:12 PM, Sahil Takiar wrote: > Are you sure you are using Spark 2.2.0? Based on the stack-trace it looks > like your call to spark-submit it using an older version of Spark (looks > like some early 1.x version). Do you have SPARK_HOME set locally? Do you > have older versions of Spark installed locally? > > --Sahil > > On Tue, Sep 26, 2017 at 3:33 PM, Stephen Sprague > wrote: > >> thanks Sahil. here it is. >> >> Exception in thread "main" java.lang.NoClassDefFoundError: >> org/apache/spark/scheduler/SparkListenerInterface >> at java.lang.Class.forName0(Native Method) >> at java.lang.Class.forName(Class.java:344) >> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit. >> scala:318) >> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala: >> 75) >> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) >> Caused by: java.lang.ClassNotFoundException: >> org.apache.spark.scheduler.SparkListenerInterface >> at java.net.URLClassLoader$1.run(URLClassLoader.java:372) >> at java.net.URLClassLoader$1.run(URLClassLoader.java:361) >> at java.security.AccessController.doPrivileged(Native Method) >> at java.net.URLClassLoader.findClass(URLClassLoader.java:360) >> at java.lang.ClassLoader.loadClass(ClassLoader.java:424) >> at java.lang.ClassLoader.loadClass(ClassLoader.java:357) >> ... 5 more >> >> at >> org.apache.hive.spark.client.rpc.RpcServer.cancelClient(RpcServer.java:212) >> ~[hive-exec-2.3.0.jar:2.3.0] >> at >> org.apache.hive.spark.client.SparkClientImpl$3.run(SparkClientImpl.java:500) >> ~[hive-exec-2.3.0.jar:2.3.0] >> at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_25] >> FAILED: SemanticException Failed to get a spark session: >> org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark >> client. >> 2017-09-26T14:04:46,470 ERROR [4cb82b6d-9568-4518-8e00-f0cf7ac58cd3 >> main] ql.Driver: FAILED: SemanticException Failed to get a spark session: >> org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark >> client. >> org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a spark >> session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to >> create spark client. >> at org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerPar >> allelism.getSparkMemoryAndCores(SetSparkReducerParallelism.java:240) >> at org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerPar >> allelism.process(SetSparkReducerParallelism.java:173) >> at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch >> (DefaultRuleDispatcher.java:90) >> at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAnd >> Return(DefaultGraphWalker.java:105) >> at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(De >> faultGraphWalker.java:89) >> at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWa >> lker.java:56) >> at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWa >> lker.java:61) >> at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWa >> lker.java:61) >> at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWa >> lker.java:61) >> at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalkin >> g(DefaultGraphWalker.java:120) >> at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.runSetRe >> ducerParallelism(SparkCompiler.java:288) >> at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimize >> OperatorPlan(SparkCompiler.java:122) >> at org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCom >> piler.java:140) >> at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInte >> rnal(SemanticAnalyzer.java:11253) >> at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeIntern >> al(CalcitePlanner.java:286) >> at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze >> (BaseSemanticAnalyzer.java:258) >> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:511) >> at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java >> :1316) >> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1456) >> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1236) >> at org.apache.hadoop.
Re: hive on spark - why is it so hard?
Hi, > org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a spark > session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create > spark client. I get inexplicable errors with Hive-on-Spark unless I do a three step build. Build Hive first, use that version to build Spark, use that Spark version to rebuild Hive. I have to do this to make it work because Spark contains Hive jars and Hive contains Spark jars in the class-path. And specifically I have to edit the pom.xml files, instead of passing in params with -Dspark.version, because the installed pom files don't get replacements from the build args. Cheers, Gopal
Re: hive on spark - why is it so hard?
Are you sure you are using Spark 2.2.0? Based on the stack-trace it looks like your call to spark-submit it using an older version of Spark (looks like some early 1.x version). Do you have SPARK_HOME set locally? Do you have older versions of Spark installed locally? --Sahil On Tue, Sep 26, 2017 at 3:33 PM, Stephen Sprague wrote: > thanks Sahil. here it is. > > Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/spark/scheduler/SparkListenerInterface > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:344) > at org.apache.spark.deploy.SparkSubmit$.launch( > SparkSubmit.scala:318) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.ClassNotFoundException: org.apache.spark.scheduler. > SparkListenerInterface > at java.net.URLClassLoader$1.run(URLClassLoader.java:372) > at java.net.URLClassLoader$1.run(URLClassLoader.java:361) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:360) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 5 more > > at > org.apache.hive.spark.client.rpc.RpcServer.cancelClient(RpcServer.java:212) > ~[hive-exec-2.3.0.jar:2.3.0] > at > org.apache.hive.spark.client.SparkClientImpl$3.run(SparkClientImpl.java:500) > ~[hive-exec-2.3.0.jar:2.3.0] > at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_25] > FAILED: SemanticException Failed to get a spark session: > org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark > client. > 2017-09-26T14:04:46,470 ERROR [4cb82b6d-9568-4518-8e00-f0cf7ac58cd3 main] > ql.Driver: FAILED: SemanticException Failed to get a spark session: > org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark > client. > org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a spark > session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to > create spark client. > at org.apache.hadoop.hive.ql.optimizer.spark. > SetSparkReducerParallelism.getSparkMemoryAndCores( > SetSparkReducerParallelism.java:240) > at org.apache.hadoop.hive.ql.optimizer.spark. > SetSparkReducerParallelism.process(SetSparkReducerParallelism.java:173) > at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch( > DefaultRuleDispatcher.java:90) > at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker. > dispatchAndReturn(DefaultGraphWalker.java:105) > at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch( > DefaultGraphWalker.java:89) > at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk( > PreOrderWalker.java:56) > at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk( > PreOrderWalker.java:61) > at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk( > PreOrderWalker.java:61) > at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk( > PreOrderWalker.java:61) > at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking( > DefaultGraphWalker.java:120) > at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler. > runSetReducerParallelism(SparkCompiler.java:288) > at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler. > optimizeOperatorPlan(SparkCompiler.java:122) > at org.apache.hadoop.hive.ql.parse.TaskCompiler.compile( > TaskCompiler.java:140) > at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer. > analyzeInternal(SemanticAnalyzer.java:11253) > at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal( > CalcitePlanner.java:286) > at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer. > analyze(BaseSemanticAnalyzer.java:258) > at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:511) > at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver. > java:1316) > at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1456) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1236) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1226) > at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd( > CliDriver.java:233) > at org.apache.hadoop.hive.cli.CliDriver.processCmd( > CliDriver.java:184) > at org.apache.hadoop.hive.cli.CliDriver.processLine( > CliDriver.java:403) > at org.apache.hadoop.hive.cli.CliDriver.processLine( > CliDriver.java:336) > at org.apache.hadoop.hive.cli.CliDriver.executeDriver( > CliDriver.java:787) > at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759) > at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(
Re: hive on spark - why is it so hard?
thanks Sahil. here it is. Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/scheduler/SparkListenerInterface at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:344) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:318) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: org.apache.spark.scheduler.SparkListenerInterface at java.net.URLClassLoader$1.run(URLClassLoader.java:372) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:360) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 5 more at org.apache.hive.spark.client.rpc.RpcServer.cancelClient(RpcServer.java:212) ~[hive-exec-2.3.0.jar:2.3.0] at org.apache.hive.spark.client.SparkClientImpl$3.run(SparkClientImpl.java:500) ~[hive-exec-2.3.0.jar:2.3.0] at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_25] FAILED: SemanticException Failed to get a spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark client. 2017-09-26T14:04:46,470 ERROR [4cb82b6d-9568-4518-8e00-f0cf7ac58cd3 main] ql.Driver: FAILED: SemanticException Failed to get a spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark client. org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark client. at org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerParallelism.getSparkMemoryAndCores(SetSparkReducerParallelism.java:240) at org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerParallelism.process(SetSparkReducerParallelism.java:173) at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89) at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWalker.java:56) at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWalker.java:61) at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWalker.java:61) at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWalker.java:61) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120) at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.runSetReducerParallelism(SparkCompiler.java:288) at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeOperatorPlan(SparkCompiler.java:122) at org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCompiler.java:140) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:11253) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:286) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:258) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:511) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1316) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1456) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1236) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1226) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:787) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) I bugs me that that class is in spark-core_2.11-2.2.0.jar yet so seemingly out of reach. :( On Tue, Sep 26, 2017 at 2:44 PM, Sahil Takiar wrote: > Hey Stephen, > > Can you send the full stack trace for the NoClassDefFou
Re: hive on spark - why is it so hard?
Hey Stephen, Can you send the full stack trace for the NoClassDefFoundError? For Hive 2.3.0, we only support Spark 2.0.0. Hive may work with more recent versions of Spark, but we only test with Spark 2.0.0. --Sahil On Tue, Sep 26, 2017 at 2:35 PM, Stephen Sprague wrote: > * i've installed hive 2.3 and spark 2.2 > > * i've read this doc plenty of times -> https://cwiki.apache.org/ > confluence/display/Hive/Hive+on+Spark%3A+Getting+Started > > * i run this query: > >hive --hiveconf hive.root.logger=DEBUG,console -e 'set > hive.execution.engine=spark; select date_key, count(*) from > fe_inventory.merged_properties_hist group by 1 order by 1;' > > > * i get this error: > > * Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/spark/scheduler/SparkListenerInterface* > > > * this class in: > /usr/lib/spark-2.2.0-bin-hadoop2.6/jars/spark-core_2.11-2.2.0.jar > > * i have copied all the spark jars to hdfs://dwrdevnn1/spark-2.2-jars > > * i have updated hive-site.xml to set spark.yarn.jars to it. > > * i see this is the console: > > 2017-09-26T13:34:15,505 INFO [334aa7db-ad0c-48c3-9ada-467aaf05cff3 main] > spark.HiveSparkClientFactory: load spark property from hive configuration > (spark.yarn.jars -> hdfs://dwrdevnn1.sv2.trulia.com:8020/spark-2.2-jars/* > ). > > * i see this on the console > > 2017-09-26T14:04:45,678 INFO [4cb82b6d-9568-4518-8e00-f0cf7ac58cd3 main] > client.SparkClientImpl: Running client driver with argv: > /usr/lib/spark-2.2.0-bin-hadoop2.6/bin/spark-submit --properties-file > /tmp/spark-submit.6105784757200912217.properties --class > org.apache.hive.spark.client.RemoteDriver > /usr/lib/apache-hive-2.3.0-bin/lib/hive-exec-2.3.0.jar > --remote-host dwrdevnn1.sv2.trulia.com --remote-port 53393 --conf > hive.spark.client.connect.timeout=1000 --conf > hive.spark.client.server.connect.timeout=9 > --conf hive.spark.client.channel.log.level=null --conf > hive.spark.client.rpc.max.size=52428800 --conf > hive.spark.client.rpc.threads=8 --conf hive.spark.client.secret.bits=256 > --conf hive.spark.client.rpc.server.address=null > > * i even print out CLASSPATH in this script: /usr/lib/spark-2.2.0-bin- > hadoop2.6/bin/spark-submit > > and /usr/lib/spark-2.2.0-bin-hadoop2.6/jars/spark-core_2.11-2.2.0.jar is > in it. > > so i ask... what am i missing? > > thanks, > Stephen > > > > > > -- Sahil Takiar Software Engineer at Cloudera takiar.sa...@gmail.com | (510) 673-0309
hive on spark - why is it so hard?
* i've installed hive 2.3 and spark 2.2 * i've read this doc plenty of times -> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started * i run this query: hive --hiveconf hive.root.logger=DEBUG,console -e 'set hive.execution.engine=spark; select date_key, count(*) from fe_inventory.merged_properties_hist group by 1 order by 1;' * i get this error: * Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/scheduler/SparkListenerInterface* * this class in: /usr/lib/spark-2.2.0-bin-hadoop2.6/jars/spark-core_2.11-2.2.0.jar * i have copied all the spark jars to hdfs://dwrdevnn1/spark-2.2-jars * i have updated hive-site.xml to set spark.yarn.jars to it. * i see this is the console: 2017-09-26T13:34:15,505 INFO [334aa7db-ad0c-48c3-9ada-467aaf05cff3 main] spark.HiveSparkClientFactory: load spark property from hive configuration (spark.yarn.jars -> hdfs://dwrdevnn1.sv2.trulia.com:8020/spark-2.2-jars/*). * i see this on the console 2017-09-26T14:04:45,678 INFO [4cb82b6d-9568-4518-8e00-f0cf7ac58cd3 main] client.SparkClientImpl: Running client driver with argv: /usr/lib/spark-2.2.0-bin-hadoop2.6/bin/spark-submit --properties-file /tmp/spark-submit.6105784757200912217.properties --class org.apache.hive.spark.client.RemoteDriver /usr/lib/apache-hive-2.3.0-bin/lib/hive-exec-2.3.0.jar --remote-host dwrdevnn1.sv2.trulia.com --remote-port 53393 --conf hive.spark.client.connect.timeout=1000 --conf hive.spark.client.server.connect.timeout=9 --conf hive.spark.client.channel.log.level=null --conf hive.spark.client.rpc.max.size=52428800 --conf hive.spark.client.rpc.threads=8 --conf hive.spark.client.secret.bits=256 --conf hive.spark.client.rpc.server.address=null * i even print out CLASSPATH in this script: /usr/lib/spark-2.2.0-bin-hadoop2.6/bin/spark-submit and /usr/lib/spark-2.2.0-bin-hadoop2.6/jars/spark-core_2.11-2.2.0.jar is in it. so i ask... what am i missing? thanks, Stephen