SAS_TO_SPARK_SQL_(Could be a Bug?)

Ajay Chander Mon, 13 Jun 2016 09:45:54 -0700

Hi Mohit,

Thanks for your time. Please find my response below.


Did you try the same with another database?
I do load the data from MySQL and SQL Server the same way(through SPARK SQL
JDBC) which works perfectly alright.

As a workaround you can write the select statement yourself instead of just
providing the table name?
Yes I did that too. It did not made any difference.

Thank you,
Ajay

On Sunday, June 12, 2016, Mohit Jaggi <mohitja...@gmail.com> wrote:

> Looks like a bug in the code generating the SQL query…why would it be
> specific to SAS, I can’t guess. Did you try the same with another database?
> As a workaround you can write the select statement yourself instead of just
> providing the table name.
>
> On Jun 11, 2016, at 6:27 PM, Ajay Chander <itsche...@gmail.com> wrote:
>
> I tried implementing the same functionality through Scala as well. But no
> luck so far. Just wondering if anyone here tried using Spark SQL to read
> SAS dataset? Thank you
>
> Regards,
> Ajay
>
> On Friday, June 10, 2016, Ajay Chander <itsche...@gmail.com> wrote:
>
>> Mich, I completely agree with you. I built another Spark SQL application
>> which reads data from MySQL and SQL server and writes the data into
>> Hive(parquet+snappy format). I have this problem only when I read directly
>> from remote SAS system. The interesting part is I am using same driver to
>> read data through pure Java app and spark app. It works fine in Java
>> app, so I cannot blame SAS driver here. Trying to understand where the
>> problem could be. Thanks for sharing this with me.
>>
>> On Friday, June 10, 2016, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>>> I personally use Scala to do something similar. For example here I
>>> extract data from an Oracle table and store in ORC table in Hive. This is
>>> compiled via sbt as run with SparkSubmit.
>>>
>>> It is similar to your code but in Scala. Note that I do not enclose my
>>> column names in double quotes.
>>>
>>> import org.apache.spark.SparkContext
>>> import org.apache.spark.SparkConf
>>> import org.apache.spark.sql.Row
>>> import org.apache.spark.sql.hive.HiveContext
>>> import org.apache.spark.sql.types._
>>> import org.apache.spark.sql.SQLContext
>>> import org.apache.spark.sql.functions._
>>>
>>> object ETL_scratchpad_dummy {
>>>   def main(args: Array[String]) {
>>>   val conf = new SparkConf().
>>>                setAppName("ETL_scratchpad_dummy").
>>>                set("spark.driver.allowMultipleContexts", "true")
>>>   val sc = new SparkContext(conf)
>>>   // Create sqlContext based on HiveContext
>>>   val sqlContext = new HiveContext(sc)
>>>   import sqlContext.implicits._
>>>   val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>>   println ("\nStarted at"); sqlContext.sql("SELECT
>>> FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss')
>>> ").collect.foreach(println)
>>>   HiveContext.sql("use oraclehadoop")
>>>   var _ORACLEserver : String = "jdbc:oracle:thin:@rhes564:1521:mydb12"
>>>   var _username : String = "scratchpad"
>>>   var _password : String = "xxxxxxxx"
>>>
>>>   // Get data from Oracle table scratchpad.dummy
>>>   val d = HiveContext.load("jdbc",
>>>   Map("url" -> _ORACLEserver,
>>>   "dbtable" -> "(SELECT to_char(ID) AS ID, to_char(CLUSTERED) AS
>>> CLUSTERED, to_char(SCATTERED) AS SCATTERED, to_char(RANDOMISED) AS
>>> RANDOMISED, RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
>>>   "user" -> _username,
>>>   "password" -> _password))
>>>
>>>    d.registerTempTable("tmp")
>>>   //
>>>   // Need to create and populate target ORC table oraclehadoop.dummy
>>>   //
>>>   HiveContext.sql("use oraclehadoop")
>>>   //
>>>   // Drop and create table dummy
>>>   //
>>>   HiveContext.sql("DROP TABLE IF EXISTS oraclehadoop.dummy")
>>>   var sqltext : String = ""
>>>   sqltext = """
>>>   CREATE TABLE oraclehadoop.dummy (
>>>      ID INT
>>>    , CLUSTERED INT
>>>    , SCATTERED INT
>>>    , RANDOMISED INT
>>>    , RANDOM_STRING VARCHAR(50)
>>>    , SMALL_VC VARCHAR(10)
>>>    , PADDING  VARCHAR(10)
>>>   )
>>>   CLUSTERED BY (ID) INTO 256 BUCKETS
>>>   STORED AS ORC
>>>   TBLPROPERTIES (
>>>   "orc.create.index"="true",
>>>   "orc.bloom.filter.columns"="ID",
>>>   "orc.bloom.filter.fpp"="0.05",
>>>   "orc.compress"="SNAPPY",
>>>   "orc.stripe.size"="16777216",
>>>   "orc.row.index.stride"="10000" )
>>>   """
>>>    HiveContext.sql(sqltext)
>>>   //
>>>   // Put data in Hive table. Clean up is already done
>>>   //
>>>   sqltext = """
>>>   INSERT INTO TABLE oraclehadoop.dummy
>>>   SELECT
>>>           ID
>>>         , CLUSTERED
>>>         , SCATTERED
>>>         , RANDOMISED
>>>         , RANDOM_STRING
>>>         , SMALL_VC
>>>         , PADDING
>>>   FROM tmp
>>>   """
>>>    HiveContext.sql(sqltext)
>>>   println ("\nFinished at"); sqlContext.sql("SELECT
>>> FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss')
>>> ").collect.foreach(println)
>>>   sys.exit()
>>>  }
>>> }
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 10 June 2016 at 23:38, Ajay Chander <itsche...@gmail.com> wrote:
>>>
>>>> Hi Mich,
>>>>
>>>> Thanks for the response. If you look at my programs, I am not writings
>>>> my queries to include column names in a pair of "". My driver in spark
>>>> program is generating such query with column names in "" which I do not
>>>> want. On the other hand, I am using the same driver in my pure Java program
>>>> which is attached, in that program the same driver is generating a proper
>>>> sql query with out "".
>>>>
>>>> Pure Java log:
>>>>
>>>> 2016-06-10 10:35:21,584] INFO stmt(1.1)#executeQuery SELECT
>>>> a.sr_no,a.start_dt,a.end_dt FROM sasLib.run_control a; created result
>>>> set 1.1.1; time= 0.122 secs (com.sas.rio.MVAStatement:590)
>>>> Spark SQL log:
>>>>
>>>> [2016-06-10 10:29:05,834] INFO conn(2)#prepareStatement sql=SELECT
>>>> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; prepared statement
>>>> 2.1; time= 0.038 secs (com.sas.rio.MVAConnection:538)
>>>>
>>>> [2016-06-10 10:29:05,935] INFO ps(2.1)#executeQuery SELECT
>>>> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; created result set
>>>> 2.1.1; time= 0.102 secs (com.sas.rio.MVAStatement:590)
>>>> Please find complete program and full logs attached in the below
>>>> thread. Thank you.
>>>>
>>>> Regards,
>>>> Ajay
>>>>
>>>>
>>>> On Friday, June 10, 2016, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>> wrote:
>>>>
>>>>> Assuming I understood your query, in Spark SQL (that is you log in to
>>>>> spark sql like  spark-sql --master spark://<HOST_NAME>:7077 you do
>>>>> not need double quotes around column names for sql to work
>>>>>
>>>>> spark-sql> select "hello from Mich" from oraclehadoop.sales limit 1;
>>>>> hello from Mich
>>>>>
>>>>> Anything between a pair of "" will be interpreted as text NOT column
>>>>> name.
>>>>>
>>>>> In Spark SQL you do not need double quotes. So simply
>>>>>
>>>>> spark-sql> select prod_id, cust_id from sales limit 2;
>>>>> 17      28017
>>>>> 18      10419
>>>>>
>>>>> HTH
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 10 June 2016 at 21:54, Ajay Chander <itsche...@gmail.com> wrote:
>>>>>
>>>>>> Hi again, anyone in this group tried to access SAS dataset through
>>>>>> Spark SQL ? Thank you
>>>>>>
>>>>>> Regards,
>>>>>> Ajay
>>>>>>
>>>>>>
>>>>>> On Friday, June 10, 2016, Ajay Chander <itsche...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Spark Users,
>>>>>>>
>>>>>>> I hope everyone here are doing great.
>>>>>>>
>>>>>>> I am trying to read data from SAS through Spark SQL and write into
>>>>>>> HDFS. Initially, I started with pure java program please find the
>>>>>>> program and logs in the attached file sas_pure_java.txt . My
>>>>>>> program ran successfully and it returned the data from Sas to 
>>>>>>> Spark_SQL. Please
>>>>>>> note the highlighted part in the log.
>>>>>>>
>>>>>>> My SAS dataset has 4 rows,
>>>>>>>
>>>>>>> Program ran successfully. So my output is,
>>>>>>>
>>>>>>> [2016-06-10 10:35:21,584] INFO stmt(1.1)#executeQuery SELECT
>>>>>>> a.sr_no,a.start_dt,a.end_dt FROM sasLib.run_control a; created
>>>>>>> result set 1.1.1; time= 0.122 secs (com.sas.rio.MVAStatement:590)
>>>>>>>
>>>>>>> [2016-06-10 10:35:21,630] INFO rs(1.1.1)#next (first call to next);
>>>>>>> time= 0.045 secs (com.sas.rio.MVAResultSet:773)
>>>>>>>
>>>>>>> 1,'2016-01-01','2016-01-31'
>>>>>>>
>>>>>>> 2,'2016-02-01','2016-02-29'
>>>>>>>
>>>>>>> 3,'2016-03-01','2016-03-31'
>>>>>>>
>>>>>>> 4,'2016-04-01','2016-04-30'
>>>>>>>
>>>>>>>
>>>>>>> Please find the full logs attached to this email in file
>>>>>>> sas_pure_java.txt.
>>>>>>>
>>>>>>> _______________________
>>>>>>>
>>>>>>>
>>>>>>> Now I am trying to do the same via Spark SQL. Please find my
>>>>>>> program and logs attached to this email in file sas_spark_sql.txt .
>>>>>>>
>>>>>>> Connection to SAS dataset is established successfully. But please
>>>>>>> note the highlighted log below.
>>>>>>>
>>>>>>> [2016-06-10 10:29:05,834] INFO conn(2)#prepareStatement sql=SELECT
>>>>>>> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; prepared
>>>>>>> statement 2.1; time= 0.038 secs (com.sas.rio.MVAConnection:538)
>>>>>>>
>>>>>>> [2016-06-10 10:29:05,935] INFO ps(2.1)#executeQuery SELECT
>>>>>>> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; created
>>>>>>> result set 2.1.1; time= 0.102 secs (com.sas.rio.MVAStatement:590)
>>>>>>> Please find the full logs attached to this email in file
>>>>>>>  sas_spark_sql.txt
>>>>>>>
>>>>>>> I am using same driver in both pure java and spark sql programs. But
>>>>>>> the query generated in spark sql has quotes around the column
>>>>>>> names(Highlighted above).
>>>>>>> So my resulting output for that query is like this,
>>>>>>>
>>>>>>> +-----+--------+------+
>>>>>>> |  _c0|     _c1|   _c2|
>>>>>>> +-----+--------+------+
>>>>>>> |SR_NO|start_dt|end_dt|
>>>>>>> |SR_NO|start_dt|end_dt|
>>>>>>> |SR_NO|start_dt|end_dt|
>>>>>>> |SR_NO|start_dt|end_dt|
>>>>>>> +-----+--------+------+
>>>>>>>
>>>>>>> Since both programs are using the same driver com.sas.rio.MVADriver
>>>>>>> . Expected output should be same as my pure java programs output. But
>>>>>>> something else is happening behind the scenes.
>>>>>>>
>>>>>>> Any insights on this issue. Thanks for your time.
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Ajay
>>>>>>>
>>>>>>
>>>>>
>>>
>

SAS_TO_SPARK_SQL_(Could be a Bug?)

Reply via email to