connecting spark with mysql

2019-06-19 Thread ya
Hi everyone,

I tried to manipulate MySQL tables from spark, I do not want to move these 
tables from MySQL to spark, as these tables can easily get very big. It is 
ideal that the data stays in the database where it was stored. For me, spark is 
only used to speed up the read and write process (as I am more a data analyst 
rather than an application developer). So I did not install hadoop. People here 
have helped me a lot, but I still cannot connect MySQL to spark, possible 
reasons are, for instance, java version, java files location, connector files 
location, MySQL version, environment variable location, the use of jdbc or 
odbc, and so on. My questions are:

1. Do we need to install hadoop and java before installing spark?

2. Which version of each of these package are stable for successful 
installation and connection, if anyone had any possible experience? (the 
solutions online might worked on older version of these packages, but seems not 
working anymore in my case, I’m on mac by the way).

3. So far, the only way I tried successfully is to utilize the sqldf package on 
SparkR to connect MySQL, but does it mean that spark is working (to speed up 
the process) when I run the sql queries with sqldf package on SparkR? 

I hope I described my questions clearly. Thank you very much for the help.

Best regards,

YA

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark SQL in R?

2019-06-07 Thread ya
Dear Felix and Richikesh and list,

Thank you very much for your previous help. So far I have tried two ways to 
trigger Spark SQL: one is to use R with sparklyr library and SparkR library; 
the other way is to use SparkR shell from Spark. I am not connecting a remote 
spark cluster, but a local one. Both failed with or without hive-site.xml. I 
suspect the content of hive-site.xml I found online was not appropriate for 
this case, as the spark session can not be initialized after adding this 
hive-site.xml. My questions are:

1. Is there any example for the content of hive-site.xml for this case?

2. I used sql() function to call the Spark SQL, is this the right way to do it?

###
##Here is the content in the hive-site.xml:##
###



javax.jdo.option.ConnectionURL
jdbc:mysql://192.168.76.100:3306/hive?createDatabaseIfNotExist=true
JDBC connect string for a JDBC metastore

 

javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
Driver class name for a JDBC metastore

 

javax.jdo.option.ConnectionUserName
root
username to use against metastore database

 

javax.jdo.option.ConnectionPassword
123
password to use against metastore database






##Here is the situation happened in R:##


> library(sparklyr) # load sparklyr package
> sc=spark_connect(master="local",spark_home="/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7")
>  # connect sparklyr with spark
> sql('create database learnsql')
Error in sql("create database learnsql") : could not find function "sql"
> library(SparkR)

Attaching package: ‘SparkR’

The following object is masked from ‘package:sparklyr’:

collect

The following objects are masked from ‘package:stats’:

cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from ‘package:base’:

as.data.frame, colnames, colnames<-, drop, endsWith, intersect, rank, rbind,
sample, startsWith, subset, summary, transform, union

> sql('create database learnsql')
Error in getSparkSession() : SparkSession not initialized
> Sys.setenv(SPARK_HOME='/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7') 
> sparkR.session(sparkHome=Sys.getenv('/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7'))
Spark not found in SPARK_HOME: 
Spark package found in SPARK_HOME: 
/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7
Launching java with spark-submit command 
/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7/bin/spark-submit   
sparkr-shell 
/var/folders/d8/7j6xswf92c3gmhwy_lrk63pmgn/T//Rtmpz22kK9/backend_port103d4cfcfd2c
 
19/06/08 11:14:57 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Error in handleErrors(returnStatus, conn) : 

…... hundreds of lines of information and mistakes here ……

> sql('create database learnsql')
Error in getSparkSession() : SparkSession not initialized



###
##Here is what happened in SparkR shell:##


Error in handleErrors(returnStatus, conn) : 
  java.lang.IllegalArgumentException: Error while instantiating 
'org.apache.spark.sql.hive.HiveSessionStateBuilder':
at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1107)
at 
org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:145)
at 
org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:144)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:144)
at 
org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:141)
at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:80)
at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:79)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.Iterator$class.foreach(Iterator.sca
> sql('create database learnsql')
Error in getSparkSession() : SparkSession not initialized



Thank you very much.

YA







> 在 2019年6月8日,上午1:44,Rishikesh Gawade  写道:
> 
> Hi.
> 1. Yes you can connect to spark via R. If you are connecting to a remote 
> spark cluster then you'll need EITHER a spark binary along with hive-site.xml 
> in its config direcctory on the machine ru

sparksql in sparkR?

2019-06-05 Thread ya
Dear list,

I am trying to use sparksql within my R, I am having the following questions, 
could you give me some advice please? Thank you very much.

1. I connect my R and spark using the library sparkR, probably some of the 
members here also are R users? Do I understand correctly that SparkSQL can be 
connected and triggered via SparkR and used in R (not in sparkR shell of spark)?

2. I ran sparkR library in R, trying to create a new sql database and a table, 
I could not get the database and the table I want. The code looks like below:

library(SparkR)
Sys.setenv(SPARK_HOME='/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7') 
sparkR.session(sparkHome=Sys.getenv('/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7'))
sql("create database learnsql; use learnsql")
sql("
create table employee_tbl
(emp_id varchar(10) not null,
emp_name char(10) not null,
emp_st_addr char(10) not null,
emp_city char(10) not null,
emp_st char(10) not null,
emp_zip integer(5) not null,
emp_phone integer(10) null,
emp_pager integer(10) null);
insert into employee_tbl values ('0001','john','yanlanjie 
1','gz','jiaoqiaojun','510006','1353');
select*from employee_tbl;
“)

I ran the following code in spark-sql shell, I get the database learnsql, 
however, I still can’t get the table. 

spark-sql> create database learnsql;show databases;
19/06/06 14:42:36 INFO HiveMetaStore: 0: create_database: 
Database(name:learnsql, description:, 
locationUri:file:/Users/ya/spark-warehouse/learnsql.db, parameters:{})
19/06/06 14:42:36 INFO audit: ugi=yaip=unknown-ip-addr  
cmd=create_database: Database(name:learnsql, description:, 
locationUri:file:/Users/ya/spark-warehouse/learnsql.db, parameters:{})   
Error in query: org.apache.hadoop.hive.metastore.api.AlreadyExistsException: 
Database learnsql already exists;

spark-sql> create table employee_tbl
 > (emp_id varchar(10) not null,
 > emp_name char(10) not null,
 > emp_st_addr char(10) not null,
 > emp_city char(10) not null,
 > emp_st char(10) not null,
 > emp_zip integer(5) not null,
 > emp_phone integer(10) null,
 > emp_pager integer(10) null);
Error in query: 
no viable alternative at input 'create table employee_tbl\n(emp_id varchar(10) 
not'(line 2, pos 20)

== SQL ==
create table employee_tbl
(emp_id varchar(10) not null,
^^^
emp_name char(10) not null,
emp_st_addr char(10) not null,
emp_city char(10) not null,
emp_st char(10) not null,
emp_zip integer(5) not null,
emp_phone integer(10) null,
emp_pager integer(10) null)

spark-sql> insert into employee_tbl values ('0001','john','yanlanjie 
1','gz','jiaoqiaojun','510006','1353');
19/06/06 14:43:43 INFO HiveMetaStore: 0: get_table : db=default tbl=employee_tbl
19/06/06 14:43:43 INFO audit: ugi=yaip=unknown-ip-addr  cmd=get_table : 
db=default tbl=employee_tbl 
Error in query: Table or view not found: employee_tbl; line 1 pos 0


Does sparkSQL has different coding grammar? What did I miss?

Thank you very much.

Best regards,

YA




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



installation of spark

2019-06-04 Thread ya
Dear list,


I am very new to spark, and I am having trouble installing it on my mac. I have 
following questions, please give me some guidance. Thank you very much.


1. How many and what software should I install before installing spark? I have 
been searching online, people discussing their experiences on this topic with 
different opinions, some says there is no need to install hadoop before install 
spark, some says hadoop has to be installed before spark. Some other people say 
scala has to be installed, whereas others say scala is included in spark, and 
it is installed automatically once spark in installed. So I am confused what to 
install for a start.


2.  Is there an simple way to configure these software? for instance, an 
all-in-one configuration file? It takes forever for me to configure things 
before I can really use it for data analysis.


I hope my questions make sense. Thank you very much.


Best regards,


YA

dummy coding in sparklyr

2019-02-27 Thread ya
Dear list,

I am trying to run some regression models with big data set using sparklyr. 
Some of the explanatory variables (Xs) in my model are categorical variables, 
they have to be converted into dummy codes before the analysis. I understand 
that in spark columns need to be treated as string type and ft_one_hot_encoder 
to the dummy code, there are some discussions online, however, I could not 
figure out how to properly write the code, could you give me some suggestions 
please? Thank you very much.

The code looks as below:

> sc_mtcars%>%ft_string_indexer("gear","gear1")%>%ft_one_hot_encoder("gear1","gear2")%>%ml_linear_regression(hp~gear1+wt)
>  
Formula: hp ~ gear1 + wt

Coefficients:
(Intercept)   gear1  wt 
  -78.3828536.4141662.17596 

As you can see, it seems "ft_one_hot_encoder("gear1","gear2”)” didn’t work, 
otherwise there should be two coefficients for gear2. Any idea what when wrong?

One more thing, there are some earlier posts online showing regression results 
with significance test info (standard errors and p values), is there any way to 
extract these info with the latest release of sparklyr? standard error, maybe?

Thank you very much.

Best regards,

YA. 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org