Dear Felix and Richikesh and list,
Thank you very much for your previous help. So far I have tried two ways to
trigger Spark SQL: one is to use R with sparklyr library and SparkR library;
the other way is to use SparkR shell from Spark. I am not connecting a remote
spark cluster, but a local one. Both failed with or without hive-site.xml. I
suspect the content of hive-site.xml I found online was not appropriate for
this case, as the spark session can not be initialized after adding this
hive-site.xml. My questions are:
1. Is there any example for the content of hive-site.xml for this case?
2. I used sql() function to call the Spark SQL, is this the right way to do it?
###################################
##Here is the content in the hive-site.xml:##
###################################
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://192.168.76.100:3306/hive?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123</value>
<description>password to use against metastore database</description>
</property>
</configuration>
################################
##Here is the situation happened in R:##
################################
> library(sparklyr) # load sparklyr package
> sc=spark_connect(master="local",spark_home="/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7")
> # connect sparklyr with spark
> sql('create database learnsql')
Error in sql("create database learnsql") : could not find function "sql"
> library(SparkR)
Attaching package: ‘SparkR’
The following object is masked from ‘package:sparklyr’:
collect
The following objects are masked from ‘package:stats’:
cov, filter, lag, na.omit, predict, sd, var, window
The following objects are masked from ‘package:base’:
as.data.frame, colnames, colnames<-, drop, endsWith, intersect, rank, rbind,
sample, startsWith, subset, summary, transform, union
> sql('create database learnsql')
Error in getSparkSession() : SparkSession not initialized
> Sys.setenv(SPARK_HOME='/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7')
> sparkR.session(sparkHome=Sys.getenv('/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7'))
Spark not found in SPARK_HOME:
Spark package found in SPARK_HOME:
/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7
Launching java with spark-submit command
/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7/bin/spark-submit
sparkr-shell
/var/folders/d8/7j6xswf92c3gmhwy_lrk63pm0000gn/T//Rtmpz22kK9/backend_port103d4cfcfd2c
19/06/08 11:14:57 WARN NativeCodeLoader: Unable to load native-hadoop library
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
Error in handleErrors(returnStatus, conn) :
…... hundreds of lines of information and mistakes here ……
> sql('create database learnsql')
Error in getSparkSession() : SparkSession not initialized
###################################
##Here is what happened in SparkR shell:##
####################################
Error in handleErrors(returnStatus, conn) :
java.lang.IllegalArgumentException: Error while instantiating
'org.apache.spark.sql.hive.HiveSessionStateBuilder':
at
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1107)
at
org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:145)
at
org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:144)
at scala.Option.getOrElse(Option.scala:121)
at
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:144)
at
org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:141)
at
org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:80)
at
org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:79)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.Iterator$class.foreach(Iterator.sca
> sql('create database learnsql')
Error in getSparkSession() : SparkSession not initialized
Thank you very much.
YA
> 在 2019年6月8日,上午1:44,Rishikesh Gawade <[email protected]> 写道:
>
> Hi.
> 1. Yes you can connect to spark via R. If you are connecting to a remote
> spark cluster then you'll need EITHER a spark binary along with hive-site.xml
> in its config direcctory on the machine running R OR livy server installed on
> the cluster. You can then go on to use SparklyR, which, although has almost
> the same functions as of SparkR, is recommended over the latter.
> For the first method mentioned above, use
> sc <- sparklyr::spark_connect(master = "yarn-client", spark_home =
> Sys.getenv("SPARK_HOME"), conf = spark_config())
> For the second method, use
> sc <- sparklyr::spark_connect( master = "livyserverIP:port", method = "livy",
> conf = livy_config(conf = spark_config(), username = "foo", password = "bar"))
>
> 2. The reason that you're not getting the desired result could be that
> hive-site.xml is missing.To be able to connect to Hive from
> Spark-shell/Spark-submit/SparkR/SparklyR and perform sql operations, you need
> to have hive-site.xml in the $SPARK_HOME/conf directory. This is
> hive-site.xml should contain one and only one configuration which would be
> 'hive.metastore.uris'.
>
> 3. In case of spark-sql shell, it should work after putting the
> aforementioned hive-site.xml in the config directory of Spark. If it doesn't
> work, then please check the syntax.
>
> Regards,
> Rishikesh Gawade
>
>
> On Thu, Jun 6, 2019, 12:18 PM ya <[email protected] <mailto:[email protected]>>
> wrote:
> Dear list,
>
> I am trying to use sparksql within my R, I am having the following questions,
> could you give me some advice please? Thank you very much.
>
> 1. I connect my R and spark using the library sparkR, probably some of the
> members here also are R users? Do I understand correctly that SparkSQL can be
> connected and triggered via SparkR and used in R (not in sparkR shell of
> spark)?
>
> 2. I ran sparkR library in R, trying to create a new sql database and a
> table, I could not get the database and the table I want. The code looks like
> below:
>
> library(SparkR)
> Sys.setenv(SPARK_HOME='/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7')
> sparkR.session(sparkHome=Sys.getenv('/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7'))
> sql("create database learnsql; use learnsql")
> sql("
> create table employee_tbl
> (emp_id varchar(10) not null,
> emp_name char(10) not null,
> emp_st_addr char(10) not null,
> emp_city char(10) not null,
> emp_st char(10) not null,
> emp_zip integer(5) not null,
> emp_phone integer(10) null,
> emp_pager integer(10) null);
> insert into employee_tbl values ('0001','john','yanlanjie
> 1','gz','jiaoqiaojun','510006','1353');
> select*from employee_tbl;
> “)
>
> I ran the following code in spark-sql shell, I get the database learnsql,
> however, I still can’t get the table.
>
> spark-sql> create database learnsql;show databases;
> 19/06/06 14:42:36 INFO HiveMetaStore: 0: create_database:
> Database(name:learnsql, description:,
> locationUri:file:/Users/ya/spark-warehouse/learnsql.db, parameters:{})
> 19/06/06 14:42:36 INFO audit: ugi=ya ip=unknown-ip-addr
> cmd=create_database: Database(name:learnsql, description:,
> locationUri:file:/Users/ya/spark-warehouse/learnsql.db, parameters:{})
> Error in query: org.apache.hadoop.hive.metastore.api.AlreadyExistsException:
> Database learnsql already exists;
>
> spark-sql> create table employee_tbl
> > (emp_id varchar(10) not null,
> > emp_name char(10) not null,
> > emp_st_addr char(10) not null,
> > emp_city char(10) not null,
> > emp_st char(10) not null,
> > emp_zip integer(5) not null,
> > emp_phone integer(10) null,
> > emp_pager integer(10) null);
> Error in query:
> no viable alternative at input 'create table employee_tbl\n(emp_id
> varchar(10) not'(line 2, pos 20)
>
> == SQL ==
> create table employee_tbl
> (emp_id varchar(10) not null,
> --------------------^^^
> emp_name char(10) not null,
> emp_st_addr char(10) not null,
> emp_city char(10) not null,
> emp_st char(10) not null,
> emp_zip integer(5) not null,
> emp_phone integer(10) null,
> emp_pager integer(10) null)
>
> spark-sql> insert into employee_tbl values ('0001','john','yanlanjie
> 1','gz','jiaoqiaojun','510006','1353');
> 19/06/06 14:43:43 INFO HiveMetaStore: 0: get_table : db=default
> tbl=employee_tbl
> 19/06/06 14:43:43 INFO audit: ugi=ya ip=unknown-ip-addr cmd=get_table
> : db=default tbl=employee_tbl
> Error in query: Table or view not found: employee_tbl; line 1 pos 0
>
>
> Does sparkSQL has different coding grammar? What did I miss?
>
> Thank you very much.
>
> Best regards,
>
> YA
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
> <mailto:[email protected]>
>