Re: Hive From Spark: Jdbc VS sparkContext
Hey Finally I improved a lot the spark-hive sql performances. I had some problem with some topology_script.py that made huge log error trace and reduced spark performances in python mode. I just corrected the python2 scripts to be python3 ready. I had some problem with broadcast variable while joining tables. I just deactivated this fucntionality. As a result our users are now able to use spark-hive with very limited resources (2 executors with 4core) and get decent performances for analytics. Compared to JDBC presto, this has several advantages: - integrated solution - single security layer (hive/kerberos) - direct partitionned lazy datasets versus complicated jdbc dataset management - more robust for analytics with less memory (apparently) However presto still makes sence for sub second analytics, and oltp like queries and data discovery. Le 05 nov. 2017 à 13:57, Nicolas Paris écrivait : > Hi > > After some testing, I have been quite disapointed with hiveContext way of > accessing hive tables. > > The main problem is resource allocation: I have tons of users and they > get a limited subset of workers. Then this does not allow to query huge > datasetsn because to few memory allocated (or maybe I am missing > something). > > If using Hive jdbc, Hive resources are shared by all my users and then > queries are able to finish. > > Then I have been testing other jdbc based approach and for now, "presto" > looks like the most appropriate solution to access hive tables. > > In order to load huge datasets into spark, the proposed approach is to > use presto distributed CTAS to build an ORC dataset, and access to that > dataset from spark dataframe loader ability (instead of direct jdbc > access tha would break the driver memory). > > > > Le 15 oct. 2017 à 19:24, Gourav Sengupta écrivait : > > Hi Nicolas, > > > > without the hive thrift server, if you try to run a select * on a table > > which > > has around 10,000 partitions, SPARK will give you some surprises. PRESTO > > works > > fine in these scenarios, and I am sure SPARK community will soon learn from > > their algorithms. > > > > > > Regards, > > Gourav > > > > On Sun, Oct 15, 2017 at 3:43 PM, Nicolas Pariswrote: > > > > > I do not think that SPARK will automatically determine the partitions. > > Actually > > > it does not automatically determine the partitions. In case a table > > has a > > few > > > million records, it all goes through the driver. > > > > Hi Gourav > > > > Actualy spark jdbc driver is able to deal direclty with partitions. > > Sparks creates a jdbc connection for each partition. > > > > All details explained in this post : > > http://www.gatorsmile.io/numpartitionsinjdbc/ > > > > Also an example with greenplum database: > > http://engineering.pivotal.io/post/getting-started-with-greenplum-spark/ > > > > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Hive From Spark: Jdbc VS sparkContext
Yes, my thought exactly. Kindly let me know if you need any help to port in pyspark. On Mon, Nov 6, 2017 at 8:54 AM, Nicolas Pariswrote: > Le 05 nov. 2017 à 22:46, ayan guha écrivait : > > Thank you for the clarification. That was my understanding too. However > how to > > provide the upper bound as it changes for every call in real life. For > example > > it is not required for sqoop. > > True. AFAIK sqoop begins with doing a > "select min(column_split),max(column_split) > from () as query;" > and then splits the result. > > I was thinking doing the same with wrapper with spark jdbc that would > infer the number partition, and the upper/lower bound itself. > > -- Best Regards, Ayan Guha
Re: Hive From Spark: Jdbc VS sparkContext
Le 05 nov. 2017 à 22:46, ayan guha écrivait : > Thank you for the clarification. That was my understanding too. However how to > provide the upper bound as it changes for every call in real life. For example > it is not required for sqoop. True. AFAIK sqoop begins with doing a "select min(column_split),max(column_split) from () as query;" and then splits the result. I was thinking doing the same with wrapper with spark jdbc that would infer the number partition, and the upper/lower bound itself. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Hive From Spark: Jdbc VS sparkContext
Thank you for the clarification. That was my understanding too. However how to provide the upper bound as it changes for every call in real life. For example it is not required for sqoop. On Mon, 6 Nov 2017 at 8:20 am, Nicolas Pariswrote: > Le 05 nov. 2017 à 22:02, ayan guha écrivait : > > Can you confirm if JDBC DF Reader actually loads all data from source to > driver > > memory and then distributes to the executors? > > apparently yes when not using partition column > > > > And this is true even when a > > partition column is provided? > > No, in this case, each worker send a jdbc call accordingly to > documentation > https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases > > -- Best Regards, Ayan Guha
Re: Hive From Spark: Jdbc VS sparkContext
Le 05 nov. 2017 à 22:02, ayan guha écrivait : > Can you confirm if JDBC DF Reader actually loads all data from source to > driver > memory and then distributes to the executors? apparently yes when not using partition column > And this is true even when a > partition column is provided? No, in this case, each worker send a jdbc call accordingly to documentation https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Hive From Spark: Jdbc VS sparkContext
Hi Can you confirm if JDBC DF Reader actually loads all data from source to driver memory and then distributes to the executors? And this is true even when a partition column is provided? Best Ayan On Mon, Nov 6, 2017 at 3:00 AM, David Hodeffi < david.hode...@niceactimize.com> wrote: > Testing Spark group e-mail > > Confidentiality: This communication and any attachments are intended for > the above-named persons only and may be confidential and/or legally > privileged. Any opinions expressed in this communication are not > necessarily those of NICE Actimize. If this communication has come to you > in error you must take no action based on it, nor must you copy or show it > to anyone; please delete/destroy and inform the sender by e-mail > immediately. > Monitoring: NICE Actimize may monitor incoming and outgoing e-mails. > Viruses: Although we have taken steps toward ensuring that this e-mail and > attachments are free from any virus, we advise that in keeping with good > computing practice the recipient should ensure they are actually virus free. > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards, Ayan Guha
RE: Hive From Spark: Jdbc VS sparkContext
Testing Spark group e-mail Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. Monitoring: NICE Actimize may monitor incoming and outgoing e-mails. Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Hive From Spark: Jdbc VS sparkContext
Le 05 nov. 2017 à 14:11, Gourav Sengupta écrivait : > thanks a ton for your kind response. Have you used SPARK Session ? I think > that > hiveContext is a very old way of solving things in SPARK, and since then new > algorithms have been introduced in SPARK. I will give a try out sparkSession. > It will be a lot of help, given how kind you have been by sharing your > experience, if you could kindly share your code as well and provide details > like SPARK , HADOOP, HIVE, and other environment version and details. I am testing a HDP 2.6 distrib and also: SPARK: 2.1.1 HADOOP: 2.7.3 HIVE: 1.2.1000 PRESTO: 1.87 > After all, no one wants to use SPARK 1.x version to solve problems anymore, > though I have seen couple of companies who are stuck with these versions as > they are using in house deployments which they cannot upgrade because of > incompatibility issues. Didn't know hiveContext was legacy spark way. I will give a try to sparkSession and conclude. After all, I would prefer to provide our users, a unique and uniform framework such spark, instead of multiple complicated layers such spark + whatever jdbc access > > > Regards, > Gourav Sengupta > > > On Sun, Nov 5, 2017 at 12:57 PM, Nicolas Pariswrote: > > Hi > > After some testing, I have been quite disapointed with hiveContext way of > accessing hive tables. > > The main problem is resource allocation: I have tons of users and they > get a limited subset of workers. Then this does not allow to query huge > datasetsn because to few memory allocated (or maybe I am missing > something). > > If using Hive jdbc, Hive resources are shared by all my users and then > queries are able to finish. > > Then I have been testing other jdbc based approach and for now, "presto" > looks like the most appropriate solution to access hive tables. > > In order to load huge datasets into spark, the proposed approach is to > use presto distributed CTAS to build an ORC dataset, and access to that > dataset from spark dataframe loader ability (instead of direct jdbc > access tha would break the driver memory). > > > > Le 15 oct. 2017 à 19:24, Gourav Sengupta écrivait : > > Hi Nicolas, > > > > without the hive thrift server, if you try to run a select * on a table > which > > has around 10,000 partitions, SPARK will give you some surprises. PRESTO > works > > fine in these scenarios, and I am sure SPARK community will soon learn > from > > their algorithms. > > > > > > Regards, > > Gourav > > > > On Sun, Oct 15, 2017 at 3:43 PM, Nicolas Paris > wrote: > > > > > I do not think that SPARK will automatically determine the > partitions. > > Actually > > > it does not automatically determine the partitions. In case a > table > has a > > few > > > million records, it all goes through the driver. > > > > Hi Gourav > > > > Actualy spark jdbc driver is able to deal direclty with partitions. > > Sparks creates a jdbc connection for each partition. > > > > All details explained in this post : > > http://www.gatorsmile.io/numpartitionsinjdbc/ > > > > Also an example with greenplum database: > > http://engineering.pivotal.io/post/getting-started-with- > greenplum-spark/ > > > > > > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Hive From Spark: Jdbc VS sparkContext
Hi Nicolas, thanks a ton for your kind response. Have you used SPARK Session ? I think that hiveContext is a very old way of solving things in SPARK, and since then new algorithms have been introduced in SPARK. It will be a lot of help, given how kind you have been by sharing your experience, if you could kindly share your code as well and provide details like SPARK , HADOOP, HIVE, and other environment version and details. After all, no one wants to use SPARK 1.x version to solve problems anymore, though I have seen couple of companies who are stuck with these versions as they are using in house deployments which they cannot upgrade because of incompatibility issues. Regards, Gourav Sengupta On Sun, Nov 5, 2017 at 12:57 PM, Nicolas Pariswrote: > Hi > > After some testing, I have been quite disapointed with hiveContext way of > accessing hive tables. > > The main problem is resource allocation: I have tons of users and they > get a limited subset of workers. Then this does not allow to query huge > datasetsn because to few memory allocated (or maybe I am missing > something). > > If using Hive jdbc, Hive resources are shared by all my users and then > queries are able to finish. > > Then I have been testing other jdbc based approach and for now, "presto" > looks like the most appropriate solution to access hive tables. > > In order to load huge datasets into spark, the proposed approach is to > use presto distributed CTAS to build an ORC dataset, and access to that > dataset from spark dataframe loader ability (instead of direct jdbc > access tha would break the driver memory). > > > > Le 15 oct. 2017 à 19:24, Gourav Sengupta écrivait : > > Hi Nicolas, > > > > without the hive thrift server, if you try to run a select * on a table > which > > has around 10,000 partitions, SPARK will give you some surprises. PRESTO > works > > fine in these scenarios, and I am sure SPARK community will soon learn > from > > their algorithms. > > > > > > Regards, > > Gourav > > > > On Sun, Oct 15, 2017 at 3:43 PM, Nicolas Paris > wrote: > > > > > I do not think that SPARK will automatically determine the > partitions. > > Actually > > > it does not automatically determine the partitions. In case a > table has a > > few > > > million records, it all goes through the driver. > > > > Hi Gourav > > > > Actualy spark jdbc driver is able to deal direclty with partitions. > > Sparks creates a jdbc connection for each partition. > > > > All details explained in this post : > > http://www.gatorsmile.io/numpartitionsinjdbc/ > > > > Also an example with greenplum database: > > http://engineering.pivotal.io/post/getting-started-with- > greenplum-spark/ > > > > >
Re: Hive From Spark: Jdbc VS sparkContext
Hi After some testing, I have been quite disapointed with hiveContext way of accessing hive tables. The main problem is resource allocation: I have tons of users and they get a limited subset of workers. Then this does not allow to query huge datasetsn because to few memory allocated (or maybe I am missing something). If using Hive jdbc, Hive resources are shared by all my users and then queries are able to finish. Then I have been testing other jdbc based approach and for now, "presto" looks like the most appropriate solution to access hive tables. In order to load huge datasets into spark, the proposed approach is to use presto distributed CTAS to build an ORC dataset, and access to that dataset from spark dataframe loader ability (instead of direct jdbc access tha would break the driver memory). Le 15 oct. 2017 à 19:24, Gourav Sengupta écrivait : > Hi Nicolas, > > without the hive thrift server, if you try to run a select * on a table which > has around 10,000 partitions, SPARK will give you some surprises. PRESTO works > fine in these scenarios, and I am sure SPARK community will soon learn from > their algorithms. > > > Regards, > Gourav > > On Sun, Oct 15, 2017 at 3:43 PM, Nicolas Pariswrote: > > > I do not think that SPARK will automatically determine the partitions. > Actually > > it does not automatically determine the partitions. In case a table has > a > few > > million records, it all goes through the driver. > > Hi Gourav > > Actualy spark jdbc driver is able to deal direclty with partitions. > Sparks creates a jdbc connection for each partition. > > All details explained in this post : > http://www.gatorsmile.io/numpartitionsinjdbc/ > > Also an example with greenplum database: > http://engineering.pivotal.io/post/getting-started-with-greenplum-spark/ > > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Hive From Spark: Jdbc VS sparkContext
Hi Nicolas, without the hive thrift server, if you try to run a select * on a table which has around 10,000 partitions, SPARK will give you some surprises. PRESTO works fine in these scenarios, and I am sure SPARK community will soon learn from their algorithms. Regards, Gourav On Sun, Oct 15, 2017 at 3:43 PM, Nicolas Pariswrote: > > I do not think that SPARK will automatically determine the partitions. > Actually > > it does not automatically determine the partitions. In case a table has > a few > > million records, it all goes through the driver. > > Hi Gourav > > Actualy spark jdbc driver is able to deal direclty with partitions. > Sparks creates a jdbc connection for each partition. > > All details explained in this post : > http://www.gatorsmile.io/numpartitionsinjdbc/ > > Also an example with greenplum database: > http://engineering.pivotal.io/post/getting-started-with-greenplum-spark/ >
Re: Hive From Spark: Jdbc VS sparkContext
> I do not think that SPARK will automatically determine the partitions. > Actually > it does not automatically determine the partitions. In case a table has a few > million records, it all goes through the driver. Hi Gourav Actualy spark jdbc driver is able to deal direclty with partitions. Sparks creates a jdbc connection for each partition. All details explained in this post : http://www.gatorsmile.io/numpartitionsinjdbc/ Also an example with greenplum database: http://engineering.pivotal.io/post/getting-started-with-greenplum-spark/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Hive From Spark: Jdbc VS sparkContext
Hi Gourav > what if the table has partitions and sub-partitions? well this also work with multiple orc files having same schema: val people = sqlContext.read.format("orc").load("hdfs://cluster/people*") Am I missing something? > And you do not want to access the entire data? This works for static datasets, or when new data is comming by batch processes, the spark application should be reloaded to get the new files in the folder >> On Sun, Oct 15, 2017 at 12:55 PM, Nicolas Pariswrote: > > Le 03 oct. 2017 à 20:08, Nicolas Paris écrivait : > > I wonder the differences accessing HIVE tables in two different ways: > > - with jdbc access > > - with sparkContext > > Well there is also a third way to access the hive data from spark: > - with direct file access (here ORC format) > > > For example: > > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > sqlContext.setConf("spark.sql.orc.filterPushdown", "true") > val people = sqlContext.read.format("orc").load("hdfs://cluster//orc_ > people") > people.createOrReplaceTempView("people") > sqlContext.sql("SELECT count(1) FROM people WHERE ...").show() > > > This method looks much faster than both: > - with jdbc access > - with sparkContext > > Any experience on that ? > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Hive From Spark: Jdbc VS sparkContext
Hi Nicolas, what if the table has partitions and sub-partitions? And you do not want to access the entire data? Regards, Gourav On Sun, Oct 15, 2017 at 12:55 PM, Nicolas Pariswrote: > Le 03 oct. 2017 à 20:08, Nicolas Paris écrivait : > > I wonder the differences accessing HIVE tables in two different ways: > > - with jdbc access > > - with sparkContext > > Well there is also a third way to access the hive data from spark: > - with direct file access (here ORC format) > > > For example: > > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > sqlContext.setConf("spark.sql.orc.filterPushdown", "true") > val people = sqlContext.read.format("orc").load("hdfs://cluster//orc_ > people") > people.createOrReplaceTempView("people") > sqlContext.sql("SELECT count(1) FROM people WHERE ...").show() > > > This method looks much faster than both: > - with jdbc access > - with sparkContext > > Any experience on that ? > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: Hive From Spark: Jdbc VS sparkContext
Le 03 oct. 2017 à 20:08, Nicolas Paris écrivait : > I wonder the differences accessing HIVE tables in two different ways: > - with jdbc access > - with sparkContext Well there is also a third way to access the hive data from spark: - with direct file access (here ORC format) For example: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.setConf("spark.sql.orc.filterPushdown", "true") val people = sqlContext.read.format("orc").load("hdfs://cluster//orc_people") people.createOrReplaceTempView("people") sqlContext.sql("SELECT count(1) FROM people WHERE ...").show() This method looks much faster than both: - with jdbc access - with sparkContext Any experience on that ? - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Hive From Spark: Jdbc VS sparkContext
My take on this might sound a bit different. Here are few points to consider below: 1. Going through Hive JDBC means that the application is restricted by the # of queries that can be compiled. HS2 can only compile one SQL at a time and if users have bad SQL, it can take a long time just to compile (not map reduce). This will reduce the query throughput i.e. # of queries you can fire through the JDBC. 2. Going through Hive JDBC does have an advantage that HMS service is protected. The JIRA: https://issues.apache.org/jira/browse/HIVE-13884 does protect HMS from crashing - because at the end of the day retrieving metadata about a Hive table that may have millions or simply put 1000s of partitions hits jvm limit on the array size that it can hold for the metadata retrieved. JVM array size limit is hit and there is a crash on HMS. So in effect this is good to have to protect HMS & the relational database on its back end. Note: Hive community does propose to move the database to HBase that scales but I dont think this will get implemented sooner. 3. Going through the SparkContext, it directly interfaces with the Hive MetaStore. I have tried to put a sequence of code flow below. The bit I didnt have time to dive into is that I believe if the table is really large i.e. say partitions in the table are more than 32K (size of a short) then some sort of slicing does occur (I didnt have time to dive and get this piece of code but from experience this does seem to occur). Code flow: Spark uses Hive External catalog - goo.gl/7CZcDw HiveClient version of getPartitions is -> goo.gl/ZAEsqQ HiveClientImpl of getPartitions is: -> goo.gl/msPrr5 The Hive call is made at: -> goo.gl/TB4NFU ThriftHiveMetastore.java -> get_partitions_ps_with_auth -1 value is sent within Spark all the way throughout to Hive Metastore thrift. So in effect for large tables at a time 32K partitions are retrieved. This also has led to a few HMS crashes but I am yet to identify if this is really the cause. Based on the 3 points above, I would prefer to use SparkContext. If the cause of crash is indeed high # of partitions retrieval, then I may opt for the JDBC route. Thanks Kabeer. On Fri, 13 Oct 2017 09:22:37 +0200, Nicolas Paris wrote: >> In case a table has a few >> million records, it all goes through the driver. > > This sounds clear in JDBC mode, the driver get all the rows and then it > spreads the RDD over the executors. > > I d'say that most use cases deal with SQL to aggregate huge datasets, > and retrieve small amount of rows to be then transformed for ML tasks. > Then using JDBC offers the robustness of HIVE to produce a small aggregated > dataset into spark. While using SPARK SQL uses RDD to produce the small > one from huge. > > Not very clear how SPARK SQL deal with huge HIVE table. Does it load > everything into memory and crash, or does this never happend? > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Sent using Dekko from my Ubuntu device - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Hive From Spark: Jdbc VS sparkContext
> In case a table has a few > million records, it all goes through the driver. This sounds clear in JDBC mode, the driver get all the rows and then it spreads the RDD over the executors. I d'say that most use cases deal with SQL to aggregate huge datasets, and retrieve small amount of rows to be then transformed for ML tasks. Then using JDBC offers the robustness of HIVE to produce a small aggregated dataset into spark. While using SPARK SQL uses RDD to produce the small one from huge. Not very clear how SPARK SQL deal with huge HIVE table. Does it load everything into memory and crash, or does this never happend? - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Hive From Spark: Jdbc VS sparkContext
Hi, I do not think that SPARK will automatically determine the partitions. Actually it does not automatically determine the partitions. In case a table has a few million records, it all goes through the driver. Ofcourse, I have only tried JDBC connections in AURORA, Oracle and Postgres. Regards, Gourav Sengupta On Tue, Oct 10, 2017 at 10:14 PM, weand <andreas.we...@gmail.com> wrote: > Is Hive from Spark via JDBC working for you? In case it does, I would be > interested in your setup :-) > > We can't get this working. See bug here, especially my last comment: > https://issues.apache.org/jira/browse/SPARK-21063 > > Regards > Andreas > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
RE: Hive From Spark: Jdbc VS sparkContext
I am able to connect to Spark via JDBC - tested with Squirrel. I am referencing all the jars of current Spark distribution under /usr/hdp/current/spark2-client/jars/* Thanks, Reema -Original Message- From: weand [mailto:andreas.we...@gmail.com] Sent: Tuesday, October 10, 2017 5:14 PM To: user@spark.apache.org Subject: Re: Hive From Spark: Jdbc VS sparkContext [ External Email ] Is Hive from Spark via JDBC working for you? In case it does, I would be interested in your setup :-) We can't get this working. See bug here, especially my last comment: https://issues.apache.org/jira/browse/SPARK-21063 Regards Andreas -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org _ This message is for the designated recipient only and may contain privileged, proprietary or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited. Dansk - Deutsch - Espanol - Francais - Italiano - Japanese - Nederlands - Norsk - Portuguese - Chinese Svenska: http://www.cardinalhealth.com/en/support/terms-and-conditions-english.html - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Hive From Spark: Jdbc VS sparkContext
Is Hive from Spark via JDBC working for you? In case it does, I would be interested in your setup :-) We can't get this working. See bug here, especially my last comment: https://issues.apache.org/jira/browse/SPARK-21063 Regards Andreas -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Hive From Spark: Jdbc VS sparkContext
That is not correct, IMHO. If I am not wrong, Spark will still load data in executor, by running some stats on the data itself to identify partitions On Tue, Oct 10, 2017 at 9:23 PM, 郭鹏飞wrote: > > > 在 2017年10月4日,上午2:08,Nicolas Paris 写道: > > > > Hi > > > > I wonder the differences accessing HIVE tables in two different ways: > > - with jdbc access > > - with sparkContext > > > > I would say that jdbc is better since it uses HIVE that is based on > > map-reduce / TEZ and then works on disk. > > Using spark rdd can lead to memory errors on very huge datasets. > > > > > > Anybody knows or can point me to relevant documentation ? > > > > - > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > The jdbc will load data into the driver node, this may slow down the > speed,and may OOM. > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards, Ayan Guha
Re: Hive From Spark: Jdbc VS sparkContext
> 在 2017年10月4日,上午2:08,Nicolas Paris写道: > > Hi > > I wonder the differences accessing HIVE tables in two different ways: > - with jdbc access > - with sparkContext > > I would say that jdbc is better since it uses HIVE that is based on > map-reduce / TEZ and then works on disk. > Using spark rdd can lead to memory errors on very huge datasets. > > > Anybody knows or can point me to relevant documentation ? > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org The jdbc will load data into the driver node, this may slow down the speed,and may OOM. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Hive From Spark: Jdbc VS sparkContext
Well the obvious point is security. Ranger and Sentry can secure jdbc endpoints only. For performance aspect, I am equally curious 邏 On Wed, 4 Oct 2017 at 10:30 pm, Gourav Senguptawrote: > Hi, > > I am genuinely curious to see whether any one responds to this question. > > Its very hard to shake off JAVA, OOPs and JDBC's :) > > > > Regards, > Gourav Sengupta > > On Tue, Oct 3, 2017 at 7:08 PM, Nicolas Paris wrote: > >> Hi >> >> I wonder the differences accessing HIVE tables in two different ways: >> - with jdbc access >> - with sparkContext >> >> I would say that jdbc is better since it uses HIVE that is based on >> map-reduce / TEZ and then works on disk. >> Using spark rdd can lead to memory errors on very huge datasets. >> >> >> Anybody knows or can point me to relevant documentation ? >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> > -- Best Regards, Ayan Guha
Re: Hive From Spark: Jdbc VS sparkContext
Hi, I am genuinely curious to see whether any one responds to this question. Its very hard to shake off JAVA, OOPs and JDBC's :) Regards, Gourav Sengupta On Tue, Oct 3, 2017 at 7:08 PM, Nicolas Pariswrote: > Hi > > I wonder the differences accessing HIVE tables in two different ways: > - with jdbc access > - with sparkContext > > I would say that jdbc is better since it uses HIVE that is based on > map-reduce / TEZ and then works on disk. > Using spark rdd can lead to memory errors on very huge datasets. > > > Anybody knows or can point me to relevant documentation ? > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Hive From Spark: Jdbc VS sparkContext
Hi I wonder the differences accessing HIVE tables in two different ways: - with jdbc access - with sparkContext I would say that jdbc is better since it uses HIVE that is based on map-reduce / TEZ and then works on disk. Using spark rdd can lead to memory errors on very huge datasets. Anybody knows or can point me to relevant documentation ? - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)
Hi CLOSE_WAIT! According to this <https://access.redhat.com/solutions/437133> link - CLOSE_WAIT - Indicates that the server has received the first FIN signal from the client and the connection is in the process of being closed .So this essentially means that his is a state where socket is waiting for the application to execute close() . A socket can be in CLOSE_WAIT state indefinitely until the application closes it. Faulty scenarios would be like file descriptor leak, server not being execute close() on socket leading to pile up of close_wait sockets - The CLOSE_WAIT status means that the other side has initiated a connection close, but the application on the local side has not yet closed the socket Normally it should be LISTEN or ESTABLISHED. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On 17 September 2016 at 16:14, <anupama.gangad...@daimler.com> wrote: > Hi, > > > > Yes. I am able to connect to Hive from simple Java program running in the > cluster. When using spark-submit I faced the issue. > > The output of command is given below > > > > $> netstat -alnp |grep 10001 > > (Not all processes could be identified, non-owned process info > > will not be shown, you would have to be root to see it all.) > > tcp1 0 53.244.194.223:2561253.244.194.221:10001 > CLOSE_WAIT - > > > > Thanks > > Anupama > > > > *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com] > *Sent:* Saturday, September 17, 2016 12:36 AM > *To:* Gangadhar, Anupama (623) > *Cc:* user @spark > *Subject:* Re: Error trying to connect to Hive from Spark (Yarn-Cluster > Mode) > > > > Is your Hive Thrift Server up and running on port > jdbc:hive2://10001? > > > > Do the following > > > > netstat -alnp |grep 10001 > > and see whether it is actually running > > > > HTH > > > > > > > > > Dr Mich Talebzadeh > > > > LinkedIn > *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > > On 16 September 2016 at 19:53, <anupama.gangad...@daimler.com> wrote: > > Hi, > > > > I am trying to connect to Hive from Spark application in Kerborized > cluster and get the following exception. Spark version is 1.4.1 and Hive > is 1.2.1. Outside of spark the connection goes through fine. > > Am I missing any configuration parameters? > > > > ava.sql.SQLException: Could not open connection to > jdbc:hive2://10001/default;principal=hive/ server2 host>;ssl=false;transportMode=http;httpPath=cliservice: null > >at org.apache.hive.jdbc.HiveConnection.openTransport( > HiveConnection.java:206) > >at org.apache.hive.jdbc.HiveConnection.( > HiveConnection.java:178) > >at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver. > java:105) > >at java.sql.DriverManager.getConnection(DriverManager. > java:571) > >at java.sql.DriverManager.getConnection(DriverManager. > java:215) > >at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:124) > >at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:1) > >at org.apache.spark.api.java.JavaPairRDD$$anonfun$ > toScalaFunction$1.apply(JavaPairRDD.scala:1027) > >at scala.collection.Iterator$$anon$11.next(Iterator.scala: > 328) > >at scala.collection.Iterator$$anon$11.next(Iterator.scala: > 328) > >at org.apache.spark.rdd.PairRDDFunctions$$anonfun$ > saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6. > apply$mcV$sp(PairRDDFunctions.scala:1109) > >at org.apache.spark.r
RE: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)
Hi, Yes. I am able to connect to Hive from simple Java program running in the cluster. When using spark-submit I faced the issue. The output of command is given below $> netstat -alnp |grep 10001 (Not all processes could be identified, non-owned process info will not be shown, you would have to be root to see it all.) tcp1 0 53.244.194.223:2561253.244.194.221:10001CLOSE_WAIT - Thanks Anupama From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com] Sent: Saturday, September 17, 2016 12:36 AM To: Gangadhar, Anupama (623) Cc: user @spark Subject: Re: Error trying to connect to Hive from Spark (Yarn-Cluster Mode) Is your Hive Thrift Server up and running on port jdbc:hive2://10001? Do the following netstat -alnp |grep 10001 and see whether it is actually running HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On 16 September 2016 at 19:53, <anupama.gangad...@daimler.com<mailto:anupama.gangad...@daimler.com>> wrote: Hi, I am trying to connect to Hive from Spark application in Kerborized cluster and get the following exception. Spark version is 1.4.1 and Hive is 1.2.1. Outside of spark the connection goes through fine. Am I missing any configuration parameters? ava.sql.SQLException: Could not open connection to jdbc:hive2://10001/default;principal=hive/;ssl=false;transportMode=http;httpPath=cliservice: null at org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:206) at org.apache.hive.jdbc.HiveConnection.(HiveConnection.java:178) at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105) at java.sql.DriverManager.getConnection(DriverManager.java:571) at java.sql.DriverManager.getConnection(DriverManager.java:215) at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:124) at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:1) at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1027) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1109) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:182) at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:258) at org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37) at org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52) at org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInform
RE: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)
Hi, @Deepak I have used a separate user keytab(not hadoop services keytab) and able to connect to Hive via simple java program. I am able to connect to Hive from spark-shell as well. However when I submit a spark job using this same keytab, I see the issue. Do cache have a role to play here? In the cluster, transport mode is http and ssl is disabled. Thanks Anupama From: Deepak Sharma [mailto:deepakmc...@gmail.com] Sent: Saturday, September 17, 2016 8:35 AM To: Gangadhar, Anupama (623) Cc: spark users Subject: Re: Error trying to connect to Hive from Spark (Yarn-Cluster Mode) Hi Anupama To me it looks like issue with the SPN with which you are trying to connect to hive2 , i.e. hive@hostname. Are you able to connect to hive from spark-shell? Try getting the tkt using any other user keytab but not hadoop services keytab and then try running the spark submit. Thanks Deepak On 17 Sep 2016 12:23 am, <anupama.gangad...@daimler.com<mailto:anupama.gangad...@daimler.com>> wrote: Hi, I am trying to connect to Hive from Spark application in Kerborized cluster and get the following exception. Spark version is 1.4.1 and Hive is 1.2.1. Outside of spark the connection goes through fine. Am I missing any configuration parameters? ava.sql.SQLException: Could not open connection to jdbc:hive2://10001/default;principal=hive/;ssl=false;transportMode=http;httpPath=cliservice: null at org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:206) at org.apache.hive.jdbc.HiveConnection.(HiveConnection.java:178) at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105) at java.sql.DriverManager.getConnection(DriverManager.java:571) at java.sql.DriverManager.getConnection(DriverManager.java:215) at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:124) at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:1) at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1027) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1109) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.thrift.transport.TT<http://org.apache.thrift.transport.TT>ransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TT<http://org.apache.thrift.transport.TT>ransport.readAll(TTransport.java:84) at org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:182) at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:258) at org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37) at org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52) at org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.do<http://javax.security.auth.Subject.do>As(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49) at org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:203) ... 21 more In spark conf directory hive-site.xm
Re: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)
Hi Anupama To me it looks like issue with the SPN with which you are trying to connect to hive2 , i.e. hive@hostname. Are you able to connect to hive from spark-shell? Try getting the tkt using any other user keytab but not hadoop services keytab and then try running the spark submit. Thanks Deepak On 17 Sep 2016 12:23 am, <anupama.gangad...@daimler.com> wrote: > Hi, > > > > I am trying to connect to Hive from Spark application in Kerborized > cluster and get the following exception. Spark version is 1.4.1 and Hive > is 1.2.1. Outside of spark the connection goes through fine. > > Am I missing any configuration parameters? > > > > ava.sql.SQLException: Could not open connection to > jdbc:hive2://10001/default;principal=hive/ server2 host>;ssl=false;transportMode=http;httpPath=cliservice: null > >at org.apache.hive.jdbc.HiveConne > ction.openTransport(HiveConnection.java:206) > >at org.apache.hive.jdbc.HiveConne > ction.(HiveConnection.java:178) > >at org.apache.hive.jdbc.HiveDrive > r.connect(HiveDriver.java:105) > >at java.sql.DriverManager.getConn > ection(DriverManager.java:571) > >at java.sql.DriverManager.getConn > ection(DriverManager.java:215) > >at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:124) > >at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:1) > >at org.apache.spark.api.java.Java > PairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1027) > >at scala.collection.Iterator$$ano > n$11.next(Iterator.scala:328) > >at scala.collection.Iterator$$ano > n$11.next(Iterator.scala:328) > >at org.apache.spark.rdd.PairRDDFu > nctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$ > apply$6.apply$mcV$sp(PairRDDFunctions.scala:1109) > >at org.apache.spark.rdd.PairRDDFu > nctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply( > PairRDDFunctions.scala:1108) > >at org.apache.spark.rdd.PairRDDFu > nctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply( > PairRDDFunctions.scala:1108) > >at org.apache.spark.util.Utils$.t > ryWithSafeFinally(Utils.scala:1285) > >at org.apache.spark.rdd.PairRDDFu > nctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(Pai > rRDDFunctions.scala:1116) > >at org.apache.spark.rdd.PairRDDFu > nctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(Pai > rRDDFunctions.scala:1095) > >at org.apache.spark.scheduler.Res > ultTask.runTask(ResultTask.scala:63) > >at org.apache.spark.scheduler.Task.run(Task.scala:70) > >at org.apache.spark.executor.Exec > utor$TaskRunner.run(Executor.scala:213) > >at java.util.concurrent.ThreadPoo > lExecutor.runWorker(ThreadPoolExecutor.java:1145) > >at java.util.concurrent.ThreadPoo > lExecutor$Worker.run(ThreadPoolExecutor.java:615) > >at java.lang.Thread.run(Thread.java:745) > > Caused by: org.apache.thrift.transport.TTransportException > >at org.apache.thrift.transport.TI > OStreamTransport.read(TIOStreamTransport.java:132) > >at org.apache.thrift.transport.TT > ransport.readAll(TTransport.java:84) > >at org.apache.thrift.transport.TS > aslTransport.receiveSaslMessage(TSaslTransport.java:182) > >at org.apache.thrift.transport.TS > aslTransport.open(TSaslTransport.java:258) > >at org.apache.thrift.transport.TS > aslClientTransport.open(TSaslClientTransport.java:37) > >at org.apache.hadoop.hive.thrift. > client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52) > >at org.apache.hadoop.hive.thrift. > client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49) > >at java.security.AccessController.doPrivileged(Native > Method) > >at javax.security.auth.Subject.doAs(Subject.java:415) > >at org.apache.hadoop.security.Use > rGroupInformation.doAs(UserGroupInformation.java:1657) > >at org.apache.hadoop.hive.thrift. > client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49) > >at org.apache.hive.jdbc.HiveConne > ction.openTransport(HiveConnection.java:203) > >... 21 more > > > > In spark conf directory hive-site.xml has the following properties > > > > > > > > > > hive.metastore.kerberos.keytab.file >
Re: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)
Is your Hive Thrift Server up and running on port jdbc:hive2://10001? Do the following netstat -alnp |grep 10001 and see whether it is actually running HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On 16 September 2016 at 19:53, <anupama.gangad...@daimler.com> wrote: > Hi, > > > > I am trying to connect to Hive from Spark application in Kerborized > cluster and get the following exception. Spark version is 1.4.1 and Hive > is 1.2.1. Outside of spark the connection goes through fine. > > Am I missing any configuration parameters? > > > > ava.sql.SQLException: Could not open connection to > jdbc:hive2://10001/default;principal=hive/ server2 host>;ssl=false;transportMode=http;httpPath=cliservice: null > >at org.apache.hive.jdbc.HiveConnection.openTransport( > HiveConnection.java:206) > >at org.apache.hive.jdbc.HiveConnection.( > HiveConnection.java:178) > >at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver. > java:105) > >at java.sql.DriverManager.getConnection(DriverManager. > java:571) > >at java.sql.DriverManager.getConnection(DriverManager. > java:215) > >at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:124) > >at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:1) > >at org.apache.spark.api.java.JavaPairRDD$$anonfun$ > toScalaFunction$1.apply(JavaPairRDD.scala:1027) > >at scala.collection.Iterator$$anon$11.next(Iterator.scala: > 328) > >at scala.collection.Iterator$$anon$11.next(Iterator.scala: > 328) > >at org.apache.spark.rdd.PairRDDFunctions$$anonfun$ > saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6. > apply$mcV$sp(PairRDDFunctions.scala:1109) > >at org.apache.spark.rdd.PairRDDFunctions$$anonfun$ > saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6. > apply(PairRDDFunctions.scala:1108) > >at org.apache.spark.rdd.PairRDDFunctions$$anonfun$ > saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6. > apply(PairRDDFunctions.scala:1108) > >at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils. > scala:1285) > >at org.apache.spark.rdd.PairRDDFunctions$$anonfun$ > saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116) > >at org.apache.spark.rdd.PairRDDFunctions$$anonfun$ > saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095) > >at org.apache.spark.scheduler. > ResultTask.runTask(ResultTask.scala:63) > >at org.apache.spark.scheduler.Task.run(Task.scala:70) > >at org.apache.spark.executor.Executor$TaskRunner.run( > Executor.scala:213) > >at java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1145) > >at java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:615) > >at java.lang.Thread.run(Thread.java:745) > > Caused by: org.apache.thrift.transport.TTransportException > >at org.apache.thrift.transport.TIOStreamTransport.read( > TIOStreamTransport.java:132) > >at org.apache.thrift.transport. > TTransport.readAll(TTransport.java:84) > >at org.apache.thrift.transport.TSaslTransport. > receiveSaslMessage(TSaslTransport.java:182) > >at org.apache.thrift.transport.TSaslTransport.open( > TSaslTransport.java:258) > >at org.apache.thrift.transport.TSaslClientTransport.open( > TSaslClientTransport.java:37) > >at org.apache.hadoop.hive.thrift. > client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52) > >at org.apache.hadoop.hive.thrift. > client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49) > >at java.security.AccessController.doPrivileged(Native > Method) > >at javax.security.auth.Subject.doAs(Subject.java:415) > >at org.apache.hadoop.security.UserGroupInformation.doAs( > UserGroupInf
Error trying to connect to Hive from Spark (Yarn-Cluster Mode)
Hi, I am trying to connect to Hive from Spark application in Kerborized cluster and get the following exception. Spark version is 1.4.1 and Hive is 1.2.1. Outside of spark the connection goes through fine. Am I missing any configuration parameters? ava.sql.SQLException: Could not open connection to jdbc:hive2://10001/default;principal=hive/;ssl=false;transportMode=http;httpPath=cliservice: null at org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:206) at org.apache.hive.jdbc.HiveConnection.(HiveConnection.java:178) at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105) at java.sql.DriverManager.getConnection(DriverManager.java:571) at java.sql.DriverManager.getConnection(DriverManager.java:215) at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:124) at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:1) at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1027) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1109) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:182) at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:258) at org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37) at org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52) at org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49) at org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:203) ... 21 more In spark conf directory hive-site.xml has the following properties hive.metastore.kerberos.keytab.file /etc/security/keytabs/hive.service.keytab hive.metastore.kerberos.principal hive/_HOST@ hive.metastore.sasl.enabled true hive.metastore.uris thrift://:9083 hive.server2.authentication KERBEROS hive.server2.authentication.kerberos.keytab /etc/security/keytabs/hive.service.keytab hive.server2.authentication.kerberos.principal hive/_HOST@ hive.server2.authentication.spnego.keytab /etc/security/keytabs/spnego.service.keytab hive.server2.authentication.spnego.principal HTTP/_HOST@ --Thank you If you are not the addressee, please inform us immediately that you have received this e-mail by mistake, and delete it. We thank you for your support.
Unable To access Hive From Spark
Hi All, I am trying to access hive from Spark but getting exception The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw- Code :- String logFile = "hdfs://hdp23ha/logs"; // Should be some file on // your system] System.setProperty("HADOOP_USER_NAME","hadoop"); SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]"); JavaSparkContext sc = new JavaSparkContext(conf); sc.hadoopConfiguration().set("fs.defaultFS", "hdfs://hdp23ha"); sc.hadoopConfiguration().set("hive.metastore.warehouse.dir", "/apps/hive/warehouse"); sc.hadoopConfiguration().set("hive.exec.local.scratchdir", "D://"); sc.hadoopConfiguration().set("dfs.nameservices", "hdp23ha"); sc.hadoopConfiguration().set("hive.exec.scratchdir", "/tmp/hive/"); sc.hadoopConfiguration().setInt("hive.exec.scratchdir.permission", 777); sc.hadoopConfiguration().set("dfs.ha.namenodes.hdp23ha", "nn1,nn2"); sc.hadoopConfiguration().set("dfs.namenode.rpc-address.hdp23ha.nn1", "ambarimaster:8020"); sc.hadoopConfiguration().set("dfs.namenode.rpc-address.hdp23ha.nn2", "hdp231:8020"); sc.hadoopConfiguration().set("dfs.client.failover.proxy.provider.hdp23ha", "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"); HiveContext hiveContext = new org.apache.spark.sql.hive.HiveContext(sc); JavaRDD logData = sc.textFile(logFile).cache(); hiveContext.sql("CREATE EXTERNAL TABLE IF NOT EXISTS Logs (date STRING,msg STRING) STORED AS ORC"); JavaRDD logsRDD=logData.map(new Function<String, com.upwork.sparketl.core.bean.Logs>() { public Logs call(String arg0) throws Exception { String array[]=arg0.split(","); Logs logs=new Logs(array[0],array[1]); return logs; } }); // sc is an existing JavaSparkContext. SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc); DataFrame schemaPeople = sqlContext.createDataFrame(logsRDD, Logs.class); schemaPeople.registerTempTable("logs"); DataFrame results = sqlContext.sql("SELECT * FROM logs"); results.write().format("orc").saveAsTable("Logs"); Any help would of a great help Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-To-access-Hive-From-Spark-tp26788.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
create table in hive from spark-sql
Probably a noob question. But I am trying to create a hive table using spark-sql. Here is what I am trying to do: hc = HiveContext(sc) hdf = hc.parquetFile(output_path) data_types = hdf.dtypes schema = "(" + " ,".join(map(lambda x: x[0] + " " + x[1], data_types)) +")" hc.sql(" CREATE TABLE IF NOT EXISTS example.foo " + schema) There is already a database called "example" in hive. But I see an error: An error occurred while calling o35.sql. : org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: hdfs://some_path/foo) Also, I was wondering on how to use saveAsTable(..) construct hdf.saveAsTable(tablename) tries to store into default db? How do I specify the database name (example in this case) while trying to store this table? Thanks -- Mohit "When you want success as badly as you want the air, then you will get it. There is no other secret of success." -Socrates
Re: No suitable driver found error, Create table in hive from spark sql
Hi Dhimant, I believe if you change your spark-shell to pass -driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar vs putting it in --jars. -Todd On Wed, Feb 18, 2015 at 10:41 PM, Dhimant dhimant84.jays...@gmail.com wrote: Found solution from one of the post found on internet. I updated spark/bin/compute-classpath.sh and added database connector jar into classpath. CLASSPATH=$CLASSPATH:/data/mysql-connector-java-5.1.14-bin.jar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-suitable-driver-found-error-Create-table-in-hive-from-spark-sql-tp21714p21715.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
No suitable driver found error, Create table in hive from spark sql
No suitable driver found error, Create table in hive from spark sql. I am trying to execute following example. SPARKGIT: spark/examples/src/main/scala/org/apache/spark/examples/sql/hive/HiveFromSpark.scala My setup :- hadoop 1.6,spark 1.2, hive 1.0, mysql server (installed via yum install mysql55w mysql55w-server) I can create tables in hive from hive command prompt. / hive select * from person_parquet; OK Barack Obama M BillClinton M Hillary Clinton F Time taken: 1.945 seconds, Fetched: 3 row(s) / I am starting spark shell via following command:- ./spark-1.2.0-bin-hadoop2.4/bin/spark-shell --master spark://sparkmaster.company.com:7077 --jars /data/mysql-connector-java-5.1.14-bin.jar /scala Class.forName(com.mysql.jdbc.Driver) res0: Class[_] = class com.mysql.jdbc.Driver scala Class.forName(com.mysql.jdbc.Driver).newInstance res1: Any = com.mysql.jdbc.Driver@2dec8e27 scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@32ecf100 scala sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING)) 15/02/18 22:23:01 INFO parse.ParseDriver: Parsing command: CREATE TABLE IF NOT EXISTS src (key INT, value STRING) 15/02/18 22:23:02 INFO parse.ParseDriver: Parse Completed 15/02/18 22:23:02 INFO metastore.HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/02/18 22:23:02 INFO metastore.ObjectStore: ObjectStore, initialize called 15/02/18 22:23:02 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored 15/02/18 22:23:02 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored 15/02/18 22:23:02 WARN DataNucleus.Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/02/18 22:23:02 WARN DataNucleus.Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/02/18 22:23:02 ERROR Datastore.Schema: Failed initialising database. No suitable driver found for jdbc:mysql://sparkmaster.company.com:3306/hive org.datanucleus.exceptions.NucleusDataStoreException: No suitable driver found for jdbc:mysql://sparkmaster.company.com:3306/hive at org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:516) at org.datanucleus.store.rdbms.RDBMSStoreManager.init(RDBMSStoreManager.java:298) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631) at org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301) at org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187) at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965) at java.security.AccessController.doPrivileged(Native Method) at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960) at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701) at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:310) at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:339) at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:248) at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:223) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133
Re: No suitable driver found error, Create table in hive from spark sql
Found solution from one of the post found on internet. I updated spark/bin/compute-classpath.sh and added database connector jar into classpath. CLASSPATH=$CLASSPATH:/data/mysql-connector-java-5.1.14-bin.jar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-suitable-driver-found-error-Create-table-in-hive-from-spark-sql-tp21714p21715.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Hive From Spark
Hi Du, I didn't notice the ticket was updated recently. SPARK-2848 is a sub-task of Spark-2420, and it's already resolved in Spark 1.1.0.It looks like Spark-2420 will release in Spark 1.2.0 according to the current JIRA status. I'm tracking branch-1.1 instead of the master and haven't seen the results merged. Still seeing guava 14.0.1 so I don't think Spark 2848 has been merged yet. Will be great to have someone to confirm or clarify the expectation. From: l...@yahoo-inc.com.INVALID To: van...@cloudera.com; alee...@hotmail.com CC: user@spark.apache.org Subject: Re: Hive From Spark Date: Sat, 23 Aug 2014 00:08:47 + I thought the fix had been pushed to the apache master ref. commit [SPARK-2848] Shade Guava in uber-jars By Marcelo Vanzin on 8/20. So my previous email was based on own build of the apache master, which turned out not working yet. Marcelo: Please correct me if I got that commit wrong. Thanks, Du On 8/22/14, 11:41 AM, Marcelo Vanzin van...@cloudera.com wrote: SPARK-2420 is fixed. I don't think it will be in 1.1, though - might be too risky at this point. I'm not familiar with spark-sql. On Fri, Aug 22, 2014 at 11:25 AM, Andrew Lee alee...@hotmail.com wrote: Hopefully there could be some progress on SPARK-2420. It looks like shading may be the voted solution among downgrading. Any idea when this will happen? Could it happen in Spark 1.1.1 or Spark 1.1.2? By the way, regarding bin/spark-sql? Is this more of a debugging tool for Spark job integrating with Hive? How does people use spark-sql? I'm trying to understand the rationale and motivation behind this script, any idea? Date: Thu, 21 Aug 2014 16:31:08 -0700 Subject: Re: Hive From Spark From: van...@cloudera.com To: l...@yahoo-inc.com.invalid CC: user@spark.apache.org; u...@spark.incubator.apache.org; pwend...@gmail.com Hi Du, I don't believe the Guava change has made it to the 1.1 branch. The Guava doc says hashInt was added in 12.0, so what's probably happening is that you have and old version of Guava in your classpath before the Spark jars. (Hadoop ships with Guava 11, so that may be the source of your problem.) On Thu, Aug 21, 2014 at 4:23 PM, Du Li l...@yahoo-inc.com.invalid wrote: Hi, This guava dependency conflict problem should have been fixed as of yesterday according to https://issues.apache.org/jira/browse/SPARK-2420 However, I just got java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/Ha shCode; by the following code snippet and ³mvn3 test² on Mac. I built the latest version of spark (1.1.0-SNAPSHOT) and installed the jar files to the local maven repo. From my pom file I explicitly excluded guava from almost all possible dependencies, such as spark-hive_2.10-1.1.0.SNAPSHOT, and hadoop-client. This snippet is abstracted from a larger project. So the pom.xml includes many dependencies although not all are required by this snippet. The pom.xml is attached. Anybody knows what to fix it? Thanks, Du --- package com.myself.test import org.scalatest._ import org.apache.hadoop.io.{NullWritable, BytesWritable} import org.apache.spark.{SparkContext, SparkConf} import org.apache.spark.SparkContext._ class MyRecord(name: String) extends Serializable { def getWritable(): BytesWritable = { new BytesWritable(Option(name).getOrElse(\\N).toString.getBytes(UTF-8)) } final override def equals(that: Any): Boolean = { if( !that.isInstanceOf[MyRecord] ) false else { val other = that.asInstanceOf[MyRecord] this.getWritable == other.getWritable } } } class MyRecordTestSuite extends FunSuite { // construct an MyRecord by Consumer.schema val rec: MyRecord = new MyRecord(James Bond) test(generated SequenceFile should be readable from spark) { val path = ./testdata/ val conf = new SparkConf(false).setMaster(local).setAppName(test data exchange with Hive) conf.set(spark.driver.host, localhost) val sc = new SparkContext(conf) val rdd = sc.makeRDD(Seq(rec)) rdd.map((x: MyRecord) = (NullWritable.get(), x.getWritable())) .saveAsSequenceFile(path) val bytes = sc.sequenceFile(path, classOf[NullWritable], classOf[BytesWritable]).first._2 assert(rec.getWritable() == bytes) sc.stop() System.clearProperty(spark.driver.port) } } From: Andrew Lee alee...@hotmail.com Reply-To: user@spark.apache.org user@spark.apache.org Date: Monday, July 21, 2014 at 10:27 AM To: user@spark.apache.org user@spark.apache.org, u...@spark.incubator.apache.org u...@spark.incubator.apache.org Subject: RE: Hive From Spark Hi All, Currently, if you are running Spark HiveContext API with Hive 0.12, it won't work due to the following 2 libraries which are not consistent with Hive
Re: Hive From Spark
I thought the fix had been pushed to the apache master ref. commit [SPARK-2848] Shade Guava in uber-jars By Marcelo Vanzin on 8/20. So my previous email was based on own build of the apache master, which turned out not working yet. Marcelo: Please correct me if I got that commit wrong. Thanks, Du On 8/22/14, 11:41 AM, Marcelo Vanzin van...@cloudera.com wrote: SPARK-2420 is fixed. I don't think it will be in 1.1, though - might be too risky at this point. I'm not familiar with spark-sql. On Fri, Aug 22, 2014 at 11:25 AM, Andrew Lee alee...@hotmail.com wrote: Hopefully there could be some progress on SPARK-2420. It looks like shading may be the voted solution among downgrading. Any idea when this will happen? Could it happen in Spark 1.1.1 or Spark 1.1.2? By the way, regarding bin/spark-sql? Is this more of a debugging tool for Spark job integrating with Hive? How does people use spark-sql? I'm trying to understand the rationale and motivation behind this script, any idea? Date: Thu, 21 Aug 2014 16:31:08 -0700 Subject: Re: Hive From Spark From: van...@cloudera.com To: l...@yahoo-inc.com.invalid CC: user@spark.apache.org; u...@spark.incubator.apache.org; pwend...@gmail.com Hi Du, I don't believe the Guava change has made it to the 1.1 branch. The Guava doc says hashInt was added in 12.0, so what's probably happening is that you have and old version of Guava in your classpath before the Spark jars. (Hadoop ships with Guava 11, so that may be the source of your problem.) On Thu, Aug 21, 2014 at 4:23 PM, Du Li l...@yahoo-inc.com.invalid wrote: Hi, This guava dependency conflict problem should have been fixed as of yesterday according to https://issues.apache.org/jira/browse/SPARK-2420 However, I just got java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/Ha shCode; by the following code snippet and ³mvn3 test² on Mac. I built the latest version of spark (1.1.0-SNAPSHOT) and installed the jar files to the local maven repo. From my pom file I explicitly excluded guava from almost all possible dependencies, such as spark-hive_2.10-1.1.0.SNAPSHOT, and hadoop-client. This snippet is abstracted from a larger project. So the pom.xml includes many dependencies although not all are required by this snippet. The pom.xml is attached. Anybody knows what to fix it? Thanks, Du --- package com.myself.test import org.scalatest._ import org.apache.hadoop.io.{NullWritable, BytesWritable} import org.apache.spark.{SparkContext, SparkConf} import org.apache.spark.SparkContext._ class MyRecord(name: String) extends Serializable { def getWritable(): BytesWritable = { new BytesWritable(Option(name).getOrElse(\\N).toString.getBytes(UTF-8)) } final override def equals(that: Any): Boolean = { if( !that.isInstanceOf[MyRecord] ) false else { val other = that.asInstanceOf[MyRecord] this.getWritable == other.getWritable } } } class MyRecordTestSuite extends FunSuite { // construct an MyRecord by Consumer.schema val rec: MyRecord = new MyRecord(James Bond) test(generated SequenceFile should be readable from spark) { val path = ./testdata/ val conf = new SparkConf(false).setMaster(local).setAppName(test data exchange with Hive) conf.set(spark.driver.host, localhost) val sc = new SparkContext(conf) val rdd = sc.makeRDD(Seq(rec)) rdd.map((x: MyRecord) = (NullWritable.get(), x.getWritable())) .saveAsSequenceFile(path) val bytes = sc.sequenceFile(path, classOf[NullWritable], classOf[BytesWritable]).first._2 assert(rec.getWritable() == bytes) sc.stop() System.clearProperty(spark.driver.port) } } From: Andrew Lee alee...@hotmail.com Reply-To: user@spark.apache.org user@spark.apache.org Date: Monday, July 21, 2014 at 10:27 AM To: user@spark.apache.org user@spark.apache.org, u...@spark.incubator.apache.org u...@spark.incubator.apache.org Subject: RE: Hive From Spark Hi All, Currently, if you are running Spark HiveContext API with Hive 0.12, it won't work due to the following 2 libraries which are not consistent with Hive 0.12 and Hadoop as well. (Hive libs aligns with Hadoop libs, and as a common practice, they should be consistent to work inter-operable). These are under discussion in the 2 JIRA tickets: https://issues.apache.org/jira/browse/HIVE-7387 https://issues.apache.org/jira/browse/SPARK-2420 When I ran the command by tweaking the classpath and build for Spark 1.0.1-rc3, I was able to create table through HiveContext, however, when I fetch the data, due to incompatible API calls in Guava, it breaks. This is critical since it needs to map the cllumns to the RDD schema. Hive and Hadoop are using an older version of guava libraries (11.0.1) where Spark Hive is using guava 14.0.1+. The community isn't willing to downgrade to 11.0.1 which
Re: Hive From Spark
Hi, This guava dependency conflict problem should have been fixed as of yesterday according to https://issues.apache.org/jira/browse/SPARK-2420 However, I just got java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode; by the following code snippet and “mvn3 test” on Mac. I built the latest version of spark (1.1.0-SNAPSHOT) and installed the jar files to the local maven repo. From my pom file I explicitly excluded guava from almost all possible dependencies, such as spark-hive_2.10-1.1.0.SNAPSHOT, and hadoop-client. This snippet is abstracted from a larger project. So the pom.xml includes many dependencies although not all are required by this snippet. The pom.xml is attached. Anybody knows what to fix it? Thanks, Du --- package com.myself.test import org.scalatest._ import org.apache.hadoop.io.{NullWritable, BytesWritable} import org.apache.spark.{SparkContext, SparkConf} import org.apache.spark.SparkContext._ class MyRecord(name: String) extends Serializable { def getWritable(): BytesWritable = { new BytesWritable(Option(name).getOrElse(\\N).toString.getBytes(UTF-8)) } final override def equals(that: Any): Boolean = { if( !that.isInstanceOf[MyRecord] ) false else { val other = that.asInstanceOf[MyRecord] this.getWritable == other.getWritable } } } class MyRecordTestSuite extends FunSuite { // construct an MyRecord by Consumer.schema val rec: MyRecord = new MyRecord(James Bond) test(generated SequenceFile should be readable from spark) { val path = ./testdata/ val conf = new SparkConf(false).setMaster(local).setAppName(test data exchange with Hive) conf.set(spark.driver.host, localhost) val sc = new SparkContext(conf) val rdd = sc.makeRDD(Seq(rec)) rdd.map((x: MyRecord) = (NullWritable.get(), x.getWritable())) .saveAsSequenceFile(path) val bytes = sc.sequenceFile(path, classOf[NullWritable], classOf[BytesWritable]).first._2 assert(rec.getWritable() == bytes) sc.stop() System.clearProperty(spark.driver.port) } } From: Andrew Lee alee...@hotmail.commailto:alee...@hotmail.com Reply-To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Date: Monday, July 21, 2014 at 10:27 AM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org, u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org Subject: RE: Hive From Spark Hi All, Currently, if you are running Spark HiveContext API with Hive 0.12, it won't work due to the following 2 libraries which are not consistent with Hive 0.12 and Hadoop as well. (Hive libs aligns with Hadoop libs, and as a common practice, they should be consistent to work inter-operable). These are under discussion in the 2 JIRA tickets: https://issues.apache.org/jira/browse/HIVE-7387 https://issues.apache.org/jira/browse/SPARK-2420 When I ran the command by tweaking the classpath and build for Spark 1.0.1-rc3, I was able to create table through HiveContext, however, when I fetch the data, due to incompatible API calls in Guava, it breaks. This is critical since it needs to map the cllumns to the RDD schema. Hive and Hadoop are using an older version of guava libraries (11.0.1) where Spark Hive is using guava 14.0.1+. The community isn't willing to downgrade to 11.0.1 which is the current version for Hadoop 2.2 and Hive 0.12. Be aware of protobuf version as well in Hive 0.12 (it uses protobuf 2.4). scala scala import org.apache.spark.SparkContext import org.apache.spark.SparkContext scala import org.apache.spark.sql.hive._ import org.apache.spark.sql.hive._ scala scala val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@34bee01a scala scala hiveContext.hql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING)) res0: org.apache.spark.sql.SchemaRDD = SchemaRDD[0] at RDD at SchemaRDD.scala:104 == Query Plan == Native command: executed by Hive scala hiveContext.hql(LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src) res1: org.apache.spark.sql.SchemaRDD = SchemaRDD[3] at RDD at SchemaRDD.scala:104 == Query Plan == Native command: executed by Hive scala scala // Queries are expressed in HiveQL scala hiveContext.hql(FROM src SELECT key, value).collect().foreach(println) java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode; at org.apache.spark.util.collection.OpenHashSet.org$apache$spark$util$collection$OpenHashSet$$hashcode(OpenHashSet.scala:261) at org.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHashSet.scala:165) at org.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(OpenHashSet.scala:102
Re: Hive From Spark
Hi Du, I don't believe the Guava change has made it to the 1.1 branch. The Guava doc says hashInt was added in 12.0, so what's probably happening is that you have and old version of Guava in your classpath before the Spark jars. (Hadoop ships with Guava 11, so that may be the source of your problem.) On Thu, Aug 21, 2014 at 4:23 PM, Du Li l...@yahoo-inc.com.invalid wrote: Hi, This guava dependency conflict problem should have been fixed as of yesterday according to https://issues.apache.org/jira/browse/SPARK-2420 However, I just got java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode; by the following code snippet and “mvn3 test” on Mac. I built the latest version of spark (1.1.0-SNAPSHOT) and installed the jar files to the local maven repo. From my pom file I explicitly excluded guava from almost all possible dependencies, such as spark-hive_2.10-1.1.0.SNAPSHOT, and hadoop-client. This snippet is abstracted from a larger project. So the pom.xml includes many dependencies although not all are required by this snippet. The pom.xml is attached. Anybody knows what to fix it? Thanks, Du --- package com.myself.test import org.scalatest._ import org.apache.hadoop.io.{NullWritable, BytesWritable} import org.apache.spark.{SparkContext, SparkConf} import org.apache.spark.SparkContext._ class MyRecord(name: String) extends Serializable { def getWritable(): BytesWritable = { new BytesWritable(Option(name).getOrElse(\\N).toString.getBytes(UTF-8)) } final override def equals(that: Any): Boolean = { if( !that.isInstanceOf[MyRecord] ) false else { val other = that.asInstanceOf[MyRecord] this.getWritable == other.getWritable } } } class MyRecordTestSuite extends FunSuite { // construct an MyRecord by Consumer.schema val rec: MyRecord = new MyRecord(James Bond) test(generated SequenceFile should be readable from spark) { val path = ./testdata/ val conf = new SparkConf(false).setMaster(local).setAppName(test data exchange with Hive) conf.set(spark.driver.host, localhost) val sc = new SparkContext(conf) val rdd = sc.makeRDD(Seq(rec)) rdd.map((x: MyRecord) = (NullWritable.get(), x.getWritable())) .saveAsSequenceFile(path) val bytes = sc.sequenceFile(path, classOf[NullWritable], classOf[BytesWritable]).first._2 assert(rec.getWritable() == bytes) sc.stop() System.clearProperty(spark.driver.port) } } From: Andrew Lee alee...@hotmail.com Reply-To: user@spark.apache.org user@spark.apache.org Date: Monday, July 21, 2014 at 10:27 AM To: user@spark.apache.org user@spark.apache.org, u...@spark.incubator.apache.org u...@spark.incubator.apache.org Subject: RE: Hive From Spark Hi All, Currently, if you are running Spark HiveContext API with Hive 0.12, it won't work due to the following 2 libraries which are not consistent with Hive 0.12 and Hadoop as well. (Hive libs aligns with Hadoop libs, and as a common practice, they should be consistent to work inter-operable). These are under discussion in the 2 JIRA tickets: https://issues.apache.org/jira/browse/HIVE-7387 https://issues.apache.org/jira/browse/SPARK-2420 When I ran the command by tweaking the classpath and build for Spark 1.0.1-rc3, I was able to create table through HiveContext, however, when I fetch the data, due to incompatible API calls in Guava, it breaks. This is critical since it needs to map the cllumns to the RDD schema. Hive and Hadoop are using an older version of guava libraries (11.0.1) where Spark Hive is using guava 14.0.1+. The community isn't willing to downgrade to 11.0.1 which is the current version for Hadoop 2.2 and Hive 0.12. Be aware of protobuf version as well in Hive 0.12 (it uses protobuf 2.4). scala scala import org.apache.spark.SparkContext import org.apache.spark.SparkContext scala import org.apache.spark.sql.hive._ import org.apache.spark.sql.hive._ scala scala val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@34bee01a scala scala hiveContext.hql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING)) res0: org.apache.spark.sql.SchemaRDD = SchemaRDD[0] at RDD at SchemaRDD.scala:104 == Query Plan == Native command: executed by Hive scala hiveContext.hql(LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src) res1: org.apache.spark.sql.SchemaRDD = SchemaRDD[3] at RDD at SchemaRDD.scala:104 == Query Plan == Native command: executed by Hive scala scala // Queries are expressed in HiveQL scala hiveContext.hql(FROM src SELECT key, value).collect().foreach(println) java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode; at org.apache.spark.util.collection.OpenHashSet.org$apache$spark$util
RE: Hive From Spark
Hi Sean, Thanks for clarifying. I re-read SPARK-2420 and now have a better understanding. From a user perspective, what would you recommend to build Spark with Hive 0.12 / 0.13+ libraries moving forward and deploy to production cluster that runs on a older version of Hadoop (e.g. 2.2 or 2.4) ? My concern is that there's going to be a lag for technology adoption and since Spark is moving fast, the libraries may always be newer. Protobuf is one good example, shading. From a biz point of view, if there is no benefit to upgrade the library, the chances that this will happen with a higher priority is low due to stability concern and re-running the entire test suite. Just by observation, there's still a lot of ppl running Hadoop 2.2 instead of 2.4 or 2.5 and the release and upgrade is depending on other big players such as Cloudera, Hortonwork, etc for their distro. Not to mention the process of upgrading. Is there any benefit to use Guava 14 in Spark? I believe there is usually some competitive reason why Spark choose Guava 14, however, I'm not sure if anyone raise that in the conversation so I don't know if that is necessary. Looking forward to seeing Hive on Spark to work soon. Please let me know if there's any help or feedback I can provide. Thanks Sean. From: so...@cloudera.com Date: Mon, 21 Jul 2014 18:36:10 +0100 Subject: Re: Hive From Spark To: user@spark.apache.org I haven't seen anyone actively 'unwilling' -- I hope not. See discussion at https://issues.apache.org/jira/browse/SPARK-2420 where I sketch what a downgrade means. I think it just hasn't gotten a looking over. Contrary to what I thought earlier, the conflict does in fact cause problems in theory, and you show it causes a problem in practice. Not to mention it causes issues for Hive-on-Spark now. On Mon, Jul 21, 2014 at 6:27 PM, Andrew Lee alee...@hotmail.com wrote: Hive and Hadoop are using an older version of guava libraries (11.0.1) where Spark Hive is using guava 14.0.1+. The community isn't willing to downgrade to 11.0.1 which is the current version for Hadoop 2.2 and Hive 0.12.
RE: Hive From Spark
$$iwC.init(console:19) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:24) at $iwC$$iwC$$iwC$$iwC.init(console:26) at $iwC$$iwC$$iwC.init(console:28) at $iwC$$iwC.init(console:30) at $iwC.init(console:32) at init(console:34) at .init(console:38) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) From: hao.ch...@intel.com To: user@spark.apache.org; u...@spark.incubator.apache.org Subject: RE: Hive From Spark Date: Mon, 21 Jul 2014 01:14:19 + JiaJia, I've checkout the latest 1.0 branch, and then do the following steps: SPAKR_HIVE=true sbt/sbt clean assembly cd examples ../bin/run-example sql.hive.HiveFromSpark It works well in my local From your log output, it shows Invalid method name: 'get_table', seems an incompatible jar version or something wrong between the Hive metastore service and client, can you double check the jar versions of Hive metastore service or thrift? -Original Message- From: JiajiaJing [mailto:jj.jing0...@gmail.com] Sent: Saturday, July 19, 2014 7:29 AM To: u...@spark.incubator.apache.org Subject: RE: Hive From Spark Hi Cheng Hao, Thank you very much for your reply. Basically, the program runs on Spark 1.0.0 and Hive 0.12.0 . Some setups of the environment are done by running SPARK_HIVE=true sbt/sbt assembly/assembly, including the jar in all the workers, and copying the hive-site.xml to spark's conf dir. And then run the program as: ./bin/run-example org.apache.spark.examples.sql.hive.HiveFromSpark It's good to know that this example runs well on your machine, could you please give me some insight about your have done as well? Thank you very much! Jiajia -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Hive-From-Spark-tp10110p10215.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Hive From Spark
I haven't seen anyone actively 'unwilling' -- I hope not. See discussion at https://issues.apache.org/jira/browse/SPARK-2420 where I sketch what a downgrade means. I think it just hasn't gotten a looking over. Contrary to what I thought earlier, the conflict does in fact cause problems in theory, and you show it causes a problem in practice. Not to mention it causes issues for Hive-on-Spark now. On Mon, Jul 21, 2014 at 6:27 PM, Andrew Lee alee...@hotmail.com wrote: Hive and Hadoop are using an older version of guava libraries (11.0.1) where Spark Hive is using guava 14.0.1+. The community isn't willing to downgrade to 11.0.1 which is the current version for Hadoop 2.2 and Hive 0.12.
RE: Hive From Spark
JiaJia, I've checkout the latest 1.0 branch, and then do the following steps: SPAKR_HIVE=true sbt/sbt clean assembly cd examples ../bin/run-example sql.hive.HiveFromSpark It works well in my local From your log output, it shows Invalid method name: 'get_table', seems an incompatible jar version or something wrong between the Hive metastore service and client, can you double check the jar versions of Hive metastore service or thrift? -Original Message- From: JiajiaJing [mailto:jj.jing0...@gmail.com] Sent: Saturday, July 19, 2014 7:29 AM To: u...@spark.incubator.apache.org Subject: RE: Hive From Spark Hi Cheng Hao, Thank you very much for your reply. Basically, the program runs on Spark 1.0.0 and Hive 0.12.0 . Some setups of the environment are done by running SPARK_HIVE=true sbt/sbt assembly/assembly, including the jar in all the workers, and copying the hive-site.xml to spark's conf dir. And then run the program as: ./bin/run-example org.apache.spark.examples.sql.hive.HiveFromSpark It's good to know that this example runs well on your machine, could you please give me some insight about your have done as well? Thank you very much! Jiajia -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Hive-From-Spark-tp10110p10215.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
RE: Hive From Spark
Hi Cheng Hao, Thank you very much for your reply. Basically, the program runs on Spark 1.0.0 and Hive 0.12.0 . Some setups of the environment are done by running SPARK_HIVE=true sbt/sbt assembly/assembly, including the jar in all the workers, and copying the hive-site.xml to spark's conf dir. And then run the program as: ./bin/run-example org.apache.spark.examples.sql.hive.HiveFromSpark It's good to know that this example runs well on your machine, could you please give me some insight about your have done as well? Thank you very much! Jiajia -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Hive-From-Spark-tp10110p10215.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hive From Spark
Hello Spark Users, I am new to Spark SQL and now trying to first get the HiveFromSpark example working. However, I got the following error when running HiveFromSpark.scala program. May I get some help on this please? ERROR MESSAGE: org.apache.thrift.TApplicationException: Invalid method name: 'get_table' at org.apache.thrift.TApplicationException.read(TApplicationException.java:108) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:936) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table(ThriftHiveMetastore.java:922) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:854) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89) at $Proxy9.getTable(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:905) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:8999) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:8313) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:441) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:186) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:160) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:250) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:247) at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:85) at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:90) at HiveFromSpark$.main(HiveFromSpark.scala:38) at HiveFromSpark.main(HiveFromSpark.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Thank you very much! JJing -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Hive-From-Spark-tp10110.html Sent from the Apache Spark User List mailing list archive at Nabble.com.