Mich, I completely agree with you. I built another Spark SQL application which reads data from MySQL and SQL server and writes the data into Hive(parquet+snappy format). I have this problem only when I read directly from remote SAS system. The interesting part is I am using same driver to read data through pure Java app and spark app. It works fine in Java app, so I cannot blame SAS driver here. Trying to understand where the problem could be. Thanks for sharing this with me.
On Friday, June 10, 2016, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > I personally use Scala to do something similar. For example here I extract > data from an Oracle table and store in ORC table in Hive. This is compiled > via sbt as run with SparkSubmit. > > It is similar to your code but in Scala. Note that I do not enclose my > column names in double quotes. > > import org.apache.spark.SparkContext > import org.apache.spark.SparkConf > import org.apache.spark.sql.Row > import org.apache.spark.sql.hive.HiveContext > import org.apache.spark.sql.types._ > import org.apache.spark.sql.SQLContext > import org.apache.spark.sql.functions._ > > object ETL_scratchpad_dummy { > def main(args: Array[String]) { > val conf = new SparkConf(). > setAppName("ETL_scratchpad_dummy"). > set("spark.driver.allowMultipleContexts", "true") > val sc = new SparkContext(conf) > // Create sqlContext based on HiveContext > val sqlContext = new HiveContext(sc) > import sqlContext.implicits._ > val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) > println ("\nStarted at"); sqlContext.sql("SELECT > FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') > ").collect.foreach(println) > HiveContext.sql("use oraclehadoop") > var _ORACLEserver : String = "jdbc:oracle:thin:@rhes564:1521:mydb12" > var _username : String = "scratchpad" > var _password : String = "xxxxxxxx" > > // Get data from Oracle table scratchpad.dummy > val d = HiveContext.load("jdbc", > Map("url" -> _ORACLEserver, > "dbtable" -> "(SELECT to_char(ID) AS ID, to_char(CLUSTERED) AS > CLUSTERED, to_char(SCATTERED) AS SCATTERED, to_char(RANDOMISED) AS > RANDOMISED, RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)", > "user" -> _username, > "password" -> _password)) > > d.registerTempTable("tmp") > // > // Need to create and populate target ORC table oraclehadoop.dummy > // > HiveContext.sql("use oraclehadoop") > // > // Drop and create table dummy > // > HiveContext.sql("DROP TABLE IF EXISTS oraclehadoop.dummy") > var sqltext : String = "" > sqltext = """ > CREATE TABLE oraclehadoop.dummy ( > ID INT > , CLUSTERED INT > , SCATTERED INT > , RANDOMISED INT > , RANDOM_STRING VARCHAR(50) > , SMALL_VC VARCHAR(10) > , PADDING VARCHAR(10) > ) > CLUSTERED BY (ID) INTO 256 BUCKETS > STORED AS ORC > TBLPROPERTIES ( > "orc.create.index"="true", > "orc.bloom.filter.columns"="ID", > "orc.bloom.filter.fpp"="0.05", > "orc.compress"="SNAPPY", > "orc.stripe.size"="16777216", > "orc.row.index.stride"="10000" ) > """ > HiveContext.sql(sqltext) > // > // Put data in Hive table. Clean up is already done > // > sqltext = """ > INSERT INTO TABLE oraclehadoop.dummy > SELECT > ID > , CLUSTERED > , SCATTERED > , RANDOMISED > , RANDOM_STRING > , SMALL_VC > , PADDING > FROM tmp > """ > HiveContext.sql(sqltext) > println ("\nFinished at"); sqlContext.sql("SELECT > FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') > ").collect.foreach(println) > sys.exit() > } > } > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 10 June 2016 at 23:38, Ajay Chander <itsche...@gmail.com > <javascript:_e(%7B%7D,'cvml','itsche...@gmail.com');>> wrote: > >> Hi Mich, >> >> Thanks for the response. If you look at my programs, I am not writings my >> queries to include column names in a pair of "". My driver in spark >> program is generating such query with column names in "" which I do not >> want. On the other hand, I am using the same driver in my pure Java program >> which is attached, in that program the same driver is generating a proper >> sql query with out "". >> >> Pure Java log: >> >> 2016-06-10 10:35:21,584] INFO stmt(1.1)#executeQuery SELECT >> a.sr_no,a.start_dt,a.end_dt FROM sasLib.run_control a; created result >> set 1.1.1; time= 0.122 secs (com.sas.rio.MVAStatement:590) >> Spark SQL log: >> >> [2016-06-10 10:29:05,834] INFO conn(2)#prepareStatement sql=SELECT >> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; prepared statement >> 2.1; time= 0.038 secs (com.sas.rio.MVAConnection:538) >> >> [2016-06-10 10:29:05,935] INFO ps(2.1)#executeQuery SELECT >> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; created result set >> 2.1.1; time= 0.102 secs (com.sas.rio.MVAStatement:590) >> Please find complete program and full logs attached in the below thread. >> Thank you. >> >> Regards, >> Ajay >> >> >> On Friday, June 10, 2016, Mich Talebzadeh <mich.talebza...@gmail.com >> <javascript:_e(%7B%7D,'cvml','mich.talebza...@gmail.com');>> wrote: >> >>> Assuming I understood your query, in Spark SQL (that is you log in to >>> spark sql like spark-sql --master spark://<HOST_NAME>:7077 you do not >>> need double quotes around column names for sql to work >>> >>> spark-sql> select "hello from Mich" from oraclehadoop.sales limit 1; >>> hello from Mich >>> >>> Anything between a pair of "" will be interpreted as text NOT column >>> name. >>> >>> In Spark SQL you do not need double quotes. So simply >>> >>> spark-sql> select prod_id, cust_id from sales limit 2; >>> 17 28017 >>> 18 10419 >>> >>> HTH >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> >>> On 10 June 2016 at 21:54, Ajay Chander <itsche...@gmail.com> wrote: >>> >>>> Hi again, anyone in this group tried to access SAS dataset through >>>> Spark SQL ? Thank you >>>> >>>> Regards, >>>> Ajay >>>> >>>> >>>> On Friday, June 10, 2016, Ajay Chander <itsche...@gmail.com> wrote: >>>> >>>>> Hi Spark Users, >>>>> >>>>> I hope everyone here are doing great. >>>>> >>>>> I am trying to read data from SAS through Spark SQL and write into >>>>> HDFS. Initially, I started with pure java program please find the >>>>> program and logs in the attached file sas_pure_java.txt . My program >>>>> ran successfully and it returned the data from Sas to Spark_SQL. Please >>>>> note the highlighted part in the log. >>>>> >>>>> My SAS dataset has 4 rows, >>>>> >>>>> Program ran successfully. So my output is, >>>>> >>>>> [2016-06-10 10:35:21,584] INFO stmt(1.1)#executeQuery SELECT >>>>> a.sr_no,a.start_dt,a.end_dt FROM sasLib.run_control a; created result >>>>> set 1.1.1; time= 0.122 secs (com.sas.rio.MVAStatement:590) >>>>> >>>>> [2016-06-10 10:35:21,630] INFO rs(1.1.1)#next (first call to next); >>>>> time= 0.045 secs (com.sas.rio.MVAResultSet:773) >>>>> >>>>> 1,'2016-01-01','2016-01-31' >>>>> >>>>> 2,'2016-02-01','2016-02-29' >>>>> >>>>> 3,'2016-03-01','2016-03-31' >>>>> >>>>> 4,'2016-04-01','2016-04-30' >>>>> >>>>> >>>>> Please find the full logs attached to this email in file >>>>> sas_pure_java.txt. >>>>> >>>>> _______________________ >>>>> >>>>> >>>>> Now I am trying to do the same via Spark SQL. Please find my program >>>>> and logs attached to this email in file sas_spark_sql.txt . >>>>> >>>>> Connection to SAS dataset is established successfully. But please >>>>> note the highlighted log below. >>>>> >>>>> [2016-06-10 10:29:05,834] INFO conn(2)#prepareStatement sql=SELECT >>>>> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; prepared >>>>> statement 2.1; time= 0.038 secs (com.sas.rio.MVAConnection:538) >>>>> >>>>> [2016-06-10 10:29:05,935] INFO ps(2.1)#executeQuery SELECT >>>>> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; created result >>>>> set 2.1.1; time= 0.102 secs (com.sas.rio.MVAStatement:590) >>>>> Please find the full logs attached to this email in file >>>>> sas_spark_sql.txt >>>>> >>>>> I am using same driver in both pure java and spark sql programs. But >>>>> the query generated in spark sql has quotes around the column >>>>> names(Highlighted above). >>>>> So my resulting output for that query is like this, >>>>> >>>>> +-----+--------+------+ >>>>> | _c0| _c1| _c2| >>>>> +-----+--------+------+ >>>>> |SR_NO|start_dt|end_dt| >>>>> |SR_NO|start_dt|end_dt| >>>>> |SR_NO|start_dt|end_dt| >>>>> |SR_NO|start_dt|end_dt| >>>>> +-----+--------+------+ >>>>> >>>>> Since both programs are using the same driver com.sas.rio.MVADriver . >>>>> Expected output should be same as my pure java programs output. But >>>>> something else is happening behind the scenes. >>>>> >>>>> Any insights on this issue. Thanks for your time. >>>>> >>>>> >>>>> Regards, >>>>> >>>>> Ajay >>>>> >>>> >>> >