Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

Ajay Chander Fri, 10 Jun 2016 17:39:06 -0700

Mich, I completely agree with you. I built another Spark SQL application
which reads data from MySQL and SQL server and writes the data into
Hive(parquet+snappy format). I have this problem only when I read directly
from remote SAS system. The interesting part is I am using same driver to
read data through pure Java app and spark app. It works fine in Java app,
so I cannot blame SAS driver here. Trying to understand where the problem
could be. Thanks for sharing this with me.


On Friday, June 10, 2016, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:

> I personally use Scala to do something similar. For example here I extract
> data from an Oracle table and store in ORC table in Hive. This is compiled
> via sbt as run with SparkSubmit.
>
> It is similar to your code but in Scala. Note that I do not enclose my
> column names in double quotes.
>
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.sql.functions._
>
> object ETL_scratchpad_dummy {
>   def main(args: Array[String]) {
>   val conf = new SparkConf().
>                setAppName("ETL_scratchpad_dummy").
>                set("spark.driver.allowMultipleContexts", "true")
>   val sc = new SparkContext(conf)
>   // Create sqlContext based on HiveContext
>   val sqlContext = new HiveContext(sc)
>   import sqlContext.implicits._
>   val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>   println ("\nStarted at"); sqlContext.sql("SELECT
> FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss')
> ").collect.foreach(println)
>   HiveContext.sql("use oraclehadoop")
>   var _ORACLEserver : String = "jdbc:oracle:thin:@rhes564:1521:mydb12"
>   var _username : String = "scratchpad"
>   var _password : String = "xxxxxxxx"
>
>   // Get data from Oracle table scratchpad.dummy
>   val d = HiveContext.load("jdbc",
>   Map("url" -> _ORACLEserver,
>   "dbtable" -> "(SELECT to_char(ID) AS ID, to_char(CLUSTERED) AS
> CLUSTERED, to_char(SCATTERED) AS SCATTERED, to_char(RANDOMISED) AS
> RANDOMISED, RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
>   "user" -> _username,
>   "password" -> _password))
>
>    d.registerTempTable("tmp")
>   //
>   // Need to create and populate target ORC table oraclehadoop.dummy
>   //
>   HiveContext.sql("use oraclehadoop")
>   //
>   // Drop and create table dummy
>   //
>   HiveContext.sql("DROP TABLE IF EXISTS oraclehadoop.dummy")
>   var sqltext : String = ""
>   sqltext = """
>   CREATE TABLE oraclehadoop.dummy (
>      ID INT
>    , CLUSTERED INT
>    , SCATTERED INT
>    , RANDOMISED INT
>    , RANDOM_STRING VARCHAR(50)
>    , SMALL_VC VARCHAR(10)
>    , PADDING  VARCHAR(10)
>   )
>   CLUSTERED BY (ID) INTO 256 BUCKETS
>   STORED AS ORC
>   TBLPROPERTIES (
>   "orc.create.index"="true",
>   "orc.bloom.filter.columns"="ID",
>   "orc.bloom.filter.fpp"="0.05",
>   "orc.compress"="SNAPPY",
>   "orc.stripe.size"="16777216",
>   "orc.row.index.stride"="10000" )
>   """
>    HiveContext.sql(sqltext)
>   //
>   // Put data in Hive table. Clean up is already done
>   //
>   sqltext = """
>   INSERT INTO TABLE oraclehadoop.dummy
>   SELECT
>           ID
>         , CLUSTERED
>         , SCATTERED
>         , RANDOMISED
>         , RANDOM_STRING
>         , SMALL_VC
>         , PADDING
>   FROM tmp
>   """
>    HiveContext.sql(sqltext)
>   println ("\nFinished at"); sqlContext.sql("SELECT
> FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss')
> ").collect.foreach(println)
>   sys.exit()
>  }
> }
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 10 June 2016 at 23:38, Ajay Chander <itsche...@gmail.com
> <javascript:_e(%7B%7D,'cvml','itsche...@gmail.com');>> wrote:
>
>> Hi Mich,
>>
>> Thanks for the response. If you look at my programs, I am not writings my
>> queries to include column names in a pair of "". My driver in spark
>> program is generating such query with column names in "" which I do not
>> want. On the other hand, I am using the same driver in my pure Java program
>> which is attached, in that program the same driver is generating a proper
>> sql query with out "".
>>
>> Pure Java log:
>>
>> 2016-06-10 10:35:21,584] INFO stmt(1.1)#executeQuery SELECT
>> a.sr_no,a.start_dt,a.end_dt FROM sasLib.run_control a; created result
>> set 1.1.1; time= 0.122 secs (com.sas.rio.MVAStatement:590)
>> Spark SQL log:
>>
>> [2016-06-10 10:29:05,834] INFO conn(2)#prepareStatement sql=SELECT
>> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; prepared statement
>> 2.1; time= 0.038 secs (com.sas.rio.MVAConnection:538)
>>
>> [2016-06-10 10:29:05,935] INFO ps(2.1)#executeQuery SELECT
>> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; created result set
>> 2.1.1; time= 0.102 secs (com.sas.rio.MVAStatement:590)
>> Please find complete program and full logs attached in the below thread.
>> Thank you.
>>
>> Regards,
>> Ajay
>>
>>
>> On Friday, June 10, 2016, Mich Talebzadeh <mich.talebza...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','mich.talebza...@gmail.com');>> wrote:
>>
>>> Assuming I understood your query, in Spark SQL (that is you log in to
>>> spark sql like  spark-sql --master spark://<HOST_NAME>:7077 you do not
>>> need double quotes around column names for sql to work
>>>
>>> spark-sql> select "hello from Mich" from oraclehadoop.sales limit 1;
>>> hello from Mich
>>>
>>> Anything between a pair of "" will be interpreted as text NOT column
>>> name.
>>>
>>> In Spark SQL you do not need double quotes. So simply
>>>
>>> spark-sql> select prod_id, cust_id from sales limit 2;
>>> 17      28017
>>> 18      10419
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 10 June 2016 at 21:54, Ajay Chander <itsche...@gmail.com> wrote:
>>>
>>>> Hi again, anyone in this group tried to access SAS dataset through
>>>> Spark SQL ? Thank you
>>>>
>>>> Regards,
>>>> Ajay
>>>>
>>>>
>>>> On Friday, June 10, 2016, Ajay Chander <itsche...@gmail.com> wrote:
>>>>
>>>>> Hi Spark Users,
>>>>>
>>>>> I hope everyone here are doing great.
>>>>>
>>>>> I am trying to read data from SAS through Spark SQL and write into
>>>>> HDFS. Initially, I started with pure java program please find the
>>>>> program and logs in the attached file sas_pure_java.txt . My program
>>>>> ran successfully and it returned the data from Sas to Spark_SQL. Please
>>>>> note the highlighted part in the log.
>>>>>
>>>>> My SAS dataset has 4 rows,
>>>>>
>>>>> Program ran successfully. So my output is,
>>>>>
>>>>> [2016-06-10 10:35:21,584] INFO stmt(1.1)#executeQuery SELECT
>>>>> a.sr_no,a.start_dt,a.end_dt FROM sasLib.run_control a; created result
>>>>> set 1.1.1; time= 0.122 secs (com.sas.rio.MVAStatement:590)
>>>>>
>>>>> [2016-06-10 10:35:21,630] INFO rs(1.1.1)#next (first call to next);
>>>>> time= 0.045 secs (com.sas.rio.MVAResultSet:773)
>>>>>
>>>>> 1,'2016-01-01','2016-01-31'
>>>>>
>>>>> 2,'2016-02-01','2016-02-29'
>>>>>
>>>>> 3,'2016-03-01','2016-03-31'
>>>>>
>>>>> 4,'2016-04-01','2016-04-30'
>>>>>
>>>>>
>>>>> Please find the full logs attached to this email in file
>>>>> sas_pure_java.txt.
>>>>>
>>>>> _______________________
>>>>>
>>>>>
>>>>> Now I am trying to do the same via Spark SQL. Please find my program
>>>>> and logs attached to this email in file sas_spark_sql.txt .
>>>>>
>>>>> Connection to SAS dataset is established successfully. But please
>>>>> note the highlighted log below.
>>>>>
>>>>> [2016-06-10 10:29:05,834] INFO conn(2)#prepareStatement sql=SELECT
>>>>> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; prepared
>>>>> statement 2.1; time= 0.038 secs (com.sas.rio.MVAConnection:538)
>>>>>
>>>>> [2016-06-10 10:29:05,935] INFO ps(2.1)#executeQuery SELECT
>>>>> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; created result
>>>>> set 2.1.1; time= 0.102 secs (com.sas.rio.MVAStatement:590)
>>>>> Please find the full logs attached to this email in file
>>>>>  sas_spark_sql.txt
>>>>>
>>>>> I am using same driver in both pure java and spark sql programs. But
>>>>> the query generated in spark sql has quotes around the column
>>>>> names(Highlighted above).
>>>>> So my resulting output for that query is like this,
>>>>>
>>>>> +-----+--------+------+
>>>>> |  _c0|     _c1|   _c2|
>>>>> +-----+--------+------+
>>>>> |SR_NO|start_dt|end_dt|
>>>>> |SR_NO|start_dt|end_dt|
>>>>> |SR_NO|start_dt|end_dt|
>>>>> |SR_NO|start_dt|end_dt|
>>>>> +-----+--------+------+
>>>>>
>>>>> Since both programs are using the same driver com.sas.rio.MVADriver .
>>>>> Expected output should be same as my pure java programs output. But
>>>>> something else is happening behind the scenes.
>>>>>
>>>>> Any insights on this issue. Thanks for your time.
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Ajay
>>>>>
>>>>
>>>
>

Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

Reply via email to