Spark SQL in R?

ya Fri, 07 Jun 2019 20:27:43 -0700

Dear Felix and Richikesh and list,

Thank you very much for your previous help. So far I have tried two ways to 
trigger Spark SQL: one is to use R with sparklyr library and SparkR library; 
the other way is to use SparkR shell from Spark. I am not connecting a remote 
spark cluster, but a local one. Both failed with or without hive-site.xml. I 
suspect the content of hive-site.xml I found online was not appropriate for 
this case, as the spark session can not be initialized after adding this 
hive-site.xml. My questions are:


1. Is there any example for the content of hive-site.xml for this case?

2. I used sql() function to call the Spark SQL, is this the right way to do it?

###################################
##Here is the content in the hive-site.xml:##
###################################

<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://192.168.76.100:3306/hive?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
 
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
 
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>
 
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123</value>
<description>password to use against metastore database</description>
</property>
</configuration>



################################
##Here is the situation happened in R:##
################################

> library(sparklyr) # load sparklyr package
> sc=spark_connect(master="local",spark_home="/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7")
>  # connect sparklyr with spark
> sql('create database learnsql')
Error in sql("create database learnsql") : could not find function "sql"
> library(SparkR)

Attaching package: ‘SparkR’

The following object is masked from ‘package:sparklyr’:

    collect

The following objects are masked from ‘package:stats’:

    cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from ‘package:base’:

    as.data.frame, colnames, colnames<-, drop, endsWith, intersect, rank, rbind,
    sample, startsWith, subset, summary, transform, union

> sql('create database learnsql')
Error in getSparkSession() : SparkSession not initialized
> Sys.setenv(SPARK_HOME='/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7') 
> sparkR.session(sparkHome=Sys.getenv('/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7'))
Spark not found in SPARK_HOME: 
Spark package found in SPARK_HOME: 
/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7
Launching java with spark-submit command 
/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7/bin/spark-submit   
sparkr-shell 
/var/folders/d8/7j6xswf92c3gmhwy_lrk63pm0000gn/T//Rtmpz22kK9/backend_port103d4cfcfd2c
 
19/06/08 11:14:57 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Error in handleErrors(returnStatus, conn) : 

…... hundreds of lines of information and mistakes here ……

> sql('create database learnsql')
Error in getSparkSession() : SparkSession not initialized



###################################
##Here is what happened in SparkR shell:##
####################################

Error in handleErrors(returnStatus, conn) : 
  java.lang.IllegalArgumentException: Error while instantiating 
'org.apache.spark.sql.hive.HiveSessionStateBuilder':
        at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1107)
        at 
org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:145)
        at 
org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:144)
        at scala.Option.getOrElse(Option.scala:121)
        at 
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:144)
        at 
org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:141)
        at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:80)
        at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:79)
        at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
        at scala.collection.Iterator$class.foreach(Iterator.sca
> sql('create database learnsql')
Error in getSparkSession() : SparkSession not initialized



Thank you very much.

YA







> 在 2019年6月8日，上午1:44，Rishikesh Gawade <rishikeshg1...@gmail.com> 写道：
> 
> Hi.
> 1. Yes you can connect to spark via R. If you are connecting to a remote 
> spark cluster then you'll need EITHER a spark binary along with hive-site.xml 
> in its config direcctory on the machine running R OR livy server installed on 
> the cluster. You can then go on to use SparklyR, which, although has almost 
> the same functions as of SparkR, is recommended over the latter.
> For the first method mentioned above, use
> sc <- sparklyr::spark_connect(master = "yarn-client", spark_home = 
> Sys.getenv("SPARK_HOME"), conf = spark_config())
> For the second method, use
> sc <- sparklyr::spark_connect( master = "livyserverIP:port", method = "livy", 
> conf = livy_config(conf = spark_config(), username = "foo", password = "bar"))
> 
> 2. The reason that you're not getting the desired result could be that 
> hive-site.xml is missing.To be able to connect to Hive from 
> Spark-shell/Spark-submit/SparkR/SparklyR and perform sql operations, you need 
> to have hive-site.xml in the $SPARK_HOME/conf directory. This is 
> hive-site.xml should contain one and only one configuration which would be 
> 'hive.metastore.uris'. 
> 
> 3. In case of spark-sql shell, it should work after putting the 
> aforementioned hive-site.xml in the config directory of Spark. If it doesn't 
> work, then please check the syntax.
> 
> Regards,
> Rishikesh Gawade
> 
> 
> On Thu, Jun 6, 2019, 12:18 PM ya <xinxi...@126.com <mailto:xinxi...@126.com>> 
> wrote:
> Dear list,
> 
> I am trying to use sparksql within my R, I am having the following questions, 
> could you give me some advice please? Thank you very much.
> 
> 1. I connect my R and spark using the library sparkR, probably some of the 
> members here also are R users? Do I understand correctly that SparkSQL can be 
> connected and triggered via SparkR and used in R (not in sparkR shell of 
> spark)?
> 
> 2. I ran sparkR library in R, trying to create a new sql database and a 
> table, I could not get the database and the table I want. The code looks like 
> below:
> 
> library(SparkR)
> Sys.setenv(SPARK_HOME='/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7') 
> sparkR.session(sparkHome=Sys.getenv('/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7'))
> sql("create database learnsql; use learnsql")
> sql("
> create table employee_tbl
> (emp_id varchar(10) not null,
> emp_name char(10) not null,
> emp_st_addr char(10) not null,
> emp_city char(10) not null,
> emp_st char(10) not null,
> emp_zip integer(5) not null,
> emp_phone integer(10) null,
> emp_pager integer(10) null);
> insert into employee_tbl values ('0001','john','yanlanjie 
> 1','gz','jiaoqiaojun','510006','1353');
> select*from employee_tbl;
> “)
> 
> I ran the following code in spark-sql shell, I get the database learnsql, 
> however, I still can’t get the table. 
> 
> spark-sql> create database learnsql;show databases;
> 19/06/06 14:42:36 INFO HiveMetaStore: 0: create_database: 
> Database(name:learnsql, description:, 
> locationUri:file:/Users/ya/spark-warehouse/learnsql.db, parameters:{})
> 19/06/06 14:42:36 INFO audit: ugi=ya    ip=unknown-ip-addr      
> cmd=create_database: Database(name:learnsql, description:, 
> locationUri:file:/Users/ya/spark-warehouse/learnsql.db, parameters:{})       
> Error in query: org.apache.hadoop.hive.metastore.api.AlreadyExistsException: 
> Database learnsql already exists;
> 
> spark-sql> create table employee_tbl
>          > (emp_id varchar(10) not null,
>          > emp_name char(10) not null,
>          > emp_st_addr char(10) not null,
>          > emp_city char(10) not null,
>          > emp_st char(10) not null,
>          > emp_zip integer(5) not null,
>          > emp_phone integer(10) null,
>          > emp_pager integer(10) null);
> Error in query: 
> no viable alternative at input 'create table employee_tbl\n(emp_id 
> varchar(10) not'(line 2, pos 20)
> 
> == SQL ==
> create table employee_tbl
> (emp_id varchar(10) not null,
> --------------------^^^
> emp_name char(10) not null,
> emp_st_addr char(10) not null,
> emp_city char(10) not null,
> emp_st char(10) not null,
> emp_zip integer(5) not null,
> emp_phone integer(10) null,
> emp_pager integer(10) null)
> 
> spark-sql> insert into employee_tbl values ('0001','john','yanlanjie 
> 1','gz','jiaoqiaojun','510006','1353');
> 19/06/06 14:43:43 INFO HiveMetaStore: 0: get_table : db=default 
> tbl=employee_tbl
> 19/06/06 14:43:43 INFO audit: ugi=ya    ip=unknown-ip-addr      cmd=get_table 
> : db=default tbl=employee_tbl     
> Error in query: Table or view not found: employee_tbl; line 1 pos 0
> 
> 
> Does sparkSQL has different coding grammar? What did I miss?
> 
> Thank you very much.
> 
> Best regards,
> 
> YA
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
>

Spark SQL in R?

Reply via email to