Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

Deepak Sharma Mon, 13 Jun 2016 10:53:36 -0700

Hi Ajay
Looking at spark code , i can see you used hive context.
Can you try using  sql context instead of hive context there?


Thanks
Deepak

On Mon, Jun 13, 2016 at 10:15 PM, Ajay Chander <itsche...@gmail.com> wrote:

> Hi Mohit,
>
> Thanks for your time. Please find my response below.
>
> Did you try the same with another database?
> I do load the data from MySQL and SQL Server the same way(through SPARK
> SQL JDBC) which works perfectly alright.
>
> As a workaround you can write the select statement yourself instead of
> just providing the table name?
> Yes I did that too. It did not made any difference.
>
> Thank you,
> Ajay
>
> On Sunday, June 12, 2016, Mohit Jaggi <mohitja...@gmail.com> wrote:
>
>> Looks like a bug in the code generating the SQL query…why would it be
>> specific to SAS, I can’t guess. Did you try the same with another database?
>> As a workaround you can write the select statement yourself instead of just
>> providing the table name.
>>
>> On Jun 11, 2016, at 6:27 PM, Ajay Chander <itsche...@gmail.com> wrote:
>>
>> I tried implementing the same functionality through Scala as well. But no
>> luck so far. Just wondering if anyone here tried using Spark SQL to read
>> SAS dataset? Thank you
>>
>> Regards,
>> Ajay
>>
>> On Friday, June 10, 2016, Ajay Chander <itsche...@gmail.com> wrote:
>>
>>> Mich, I completely agree with you. I built another Spark SQL application
>>> which reads data from MySQL and SQL server and writes the data into
>>> Hive(parquet+snappy format). I have this problem only when I read directly
>>> from remote SAS system. The interesting part is I am using same driver to
>>> read data through pure Java app and spark app. It works fine in Java
>>> app, so I cannot blame SAS driver here. Trying to understand where the
>>> problem could be. Thanks for sharing this with me.
>>>
>>> On Friday, June 10, 2016, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>>> I personally use Scala to do something similar. For example here I
>>>> extract data from an Oracle table and store in ORC table in Hive. This is
>>>> compiled via sbt as run with SparkSubmit.
>>>>
>>>> It is similar to your code but in Scala. Note that I do not enclose my
>>>> column names in double quotes.
>>>>
>>>> import org.apache.spark.SparkContext
>>>> import org.apache.spark.SparkConf
>>>> import org.apache.spark.sql.Row
>>>> import org.apache.spark.sql.hive.HiveContext
>>>> import org.apache.spark.sql.types._
>>>> import org.apache.spark.sql.SQLContext
>>>> import org.apache.spark.sql.functions._
>>>>
>>>> object ETL_scratchpad_dummy {
>>>>   def main(args: Array[String]) {
>>>>   val conf = new SparkConf().
>>>>                setAppName("ETL_scratchpad_dummy").
>>>>                set("spark.driver.allowMultipleContexts", "true")
>>>>   val sc = new SparkContext(conf)
>>>>   // Create sqlContext based on HiveContext
>>>>   val sqlContext = new HiveContext(sc)
>>>>   import sqlContext.implicits._
>>>>   val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>>>   println ("\nStarted at"); sqlContext.sql("SELECT
>>>> FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss')
>>>> ").collect.foreach(println)
>>>>   HiveContext.sql("use oraclehadoop")
>>>>   var _ORACLEserver : String = "jdbc:oracle:thin:@rhes564:1521:mydb12"
>>>>   var _username : String = "scratchpad"
>>>>   var _password : String = "xxxxxxxx"
>>>>
>>>>   // Get data from Oracle table scratchpad.dummy
>>>>   val d = HiveContext.load("jdbc",
>>>>   Map("url" -> _ORACLEserver,
>>>>   "dbtable" -> "(SELECT to_char(ID) AS ID, to_char(CLUSTERED) AS
>>>> CLUSTERED, to_char(SCATTERED) AS SCATTERED, to_char(RANDOMISED) AS
>>>> RANDOMISED, RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
>>>>   "user" -> _username,
>>>>   "password" -> _password))
>>>>
>>>>    d.registerTempTable("tmp")
>>>>   //
>>>>   // Need to create and populate target ORC table oraclehadoop.dummy
>>>>   //
>>>>   HiveContext.sql("use oraclehadoop")
>>>>   //
>>>>   // Drop and create table dummy
>>>>   //
>>>>   HiveContext.sql("DROP TABLE IF EXISTS oraclehadoop.dummy")
>>>>   var sqltext : String = ""
>>>>   sqltext = """
>>>>   CREATE TABLE oraclehadoop.dummy (
>>>>      ID INT
>>>>    , CLUSTERED INT
>>>>    , SCATTERED INT
>>>>    , RANDOMISED INT
>>>>    , RANDOM_STRING VARCHAR(50)
>>>>    , SMALL_VC VARCHAR(10)
>>>>    , PADDING  VARCHAR(10)
>>>>   )
>>>>   CLUSTERED BY (ID) INTO 256 BUCKETS
>>>>   STORED AS ORC
>>>>   TBLPROPERTIES (
>>>>   "orc.create.index"="true",
>>>>   "orc.bloom.filter.columns"="ID",
>>>>   "orc.bloom.filter.fpp"="0.05",
>>>>   "orc.compress"="SNAPPY",
>>>>   "orc.stripe.size"="16777216",
>>>>   "orc.row.index.stride"="10000" )
>>>>   """
>>>>    HiveContext.sql(sqltext)
>>>>   //
>>>>   // Put data in Hive table. Clean up is already done
>>>>   //
>>>>   sqltext = """
>>>>   INSERT INTO TABLE oraclehadoop.dummy
>>>>   SELECT
>>>>           ID
>>>>         , CLUSTERED
>>>>         , SCATTERED
>>>>         , RANDOMISED
>>>>         , RANDOM_STRING
>>>>         , SMALL_VC
>>>>         , PADDING
>>>>   FROM tmp
>>>>   """
>>>>    HiveContext.sql(sqltext)
>>>>   println ("\nFinished at"); sqlContext.sql("SELECT
>>>> FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss')
>>>> ").collect.foreach(println)
>>>>   sys.exit()
>>>>  }
>>>> }
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 10 June 2016 at 23:38, Ajay Chander <itsche...@gmail.com> wrote:
>>>>
>>>>> Hi Mich,
>>>>>
>>>>> Thanks for the response. If you look at my programs, I am not writings
>>>>> my queries to include column names in a pair of "". My driver in spark
>>>>> program is generating such query with column names in "" which I do not
>>>>> want. On the other hand, I am using the same driver in my pure Java 
>>>>> program
>>>>> which is attached, in that program the same driver is generating a proper
>>>>> sql query with out "".
>>>>>
>>>>> Pure Java log:
>>>>>
>>>>> 2016-06-10 10:35:21,584] INFO stmt(1.1)#executeQuery SELECT
>>>>> a.sr_no,a.start_dt,a.end_dt FROM sasLib.run_control a; created result
>>>>> set 1.1.1; time= 0.122 secs (com.sas.rio.MVAStatement:590)
>>>>> Spark SQL log:
>>>>>
>>>>> [2016-06-10 10:29:05,834] INFO conn(2)#prepareStatement sql=SELECT
>>>>> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; prepared statement
>>>>> 2.1; time= 0.038 secs (com.sas.rio.MVAConnection:538)
>>>>>
>>>>> [2016-06-10 10:29:05,935] INFO ps(2.1)#executeQuery SELECT
>>>>> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; created result set
>>>>> 2.1.1; time= 0.102 secs (com.sas.rio.MVAStatement:590)
>>>>> Please find complete program and full logs attached in the below
>>>>> thread. Thank you.
>>>>>
>>>>> Regards,
>>>>> Ajay
>>>>>
>>>>>
>>>>> On Friday, June 10, 2016, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Assuming I understood your query, in Spark SQL (that is you log in to
>>>>>> spark sql like  spark-sql --master spark://<HOST_NAME>:7077 you do
>>>>>> not need double quotes around column names for sql to work
>>>>>>
>>>>>> spark-sql> select "hello from Mich" from oraclehadoop.sales limit 1;
>>>>>> hello from Mich
>>>>>>
>>>>>> Anything between a pair of "" will be interpreted as text NOT column
>>>>>> name.
>>>>>>
>>>>>> In Spark SQL you do not need double quotes. So simply
>>>>>>
>>>>>> spark-sql> select prod_id, cust_id from sales limit 2;
>>>>>> 17      28017
>>>>>> 18      10419
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>> LinkedIn * 
>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 10 June 2016 at 21:54, Ajay Chander <itsche...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi again, anyone in this group tried to access SAS dataset through
>>>>>>> Spark SQL ? Thank you
>>>>>>>
>>>>>>> Regards,
>>>>>>> Ajay
>>>>>>>
>>>>>>>
>>>>>>> On Friday, June 10, 2016, Ajay Chander <itsche...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Spark Users,
>>>>>>>>
>>>>>>>> I hope everyone here are doing great.
>>>>>>>>
>>>>>>>> I am trying to read data from SAS through Spark SQL and write into
>>>>>>>> HDFS. Initially, I started with pure java program please find the
>>>>>>>> program and logs in the attached file sas_pure_java.txt . My
>>>>>>>> program ran successfully and it returned the data from Sas to 
>>>>>>>> Spark_SQL. Please
>>>>>>>> note the highlighted part in the log.
>>>>>>>>
>>>>>>>> My SAS dataset has 4 rows,
>>>>>>>>
>>>>>>>> Program ran successfully. So my output is,
>>>>>>>>
>>>>>>>> [2016-06-10 10:35:21,584] INFO stmt(1.1)#executeQuery SELECT
>>>>>>>> a.sr_no,a.start_dt,a.end_dt FROM sasLib.run_control a; created
>>>>>>>> result set 1.1.1; time= 0.122 secs (com.sas.rio.MVAStatement:590)
>>>>>>>>
>>>>>>>> [2016-06-10 10:35:21,630] INFO rs(1.1.1)#next (first call to next);
>>>>>>>> time= 0.045 secs (com.sas.rio.MVAResultSet:773)
>>>>>>>>
>>>>>>>> 1,'2016-01-01','2016-01-31'
>>>>>>>>
>>>>>>>> 2,'2016-02-01','2016-02-29'
>>>>>>>>
>>>>>>>> 3,'2016-03-01','2016-03-31'
>>>>>>>>
>>>>>>>> 4,'2016-04-01','2016-04-30'
>>>>>>>>
>>>>>>>>
>>>>>>>> Please find the full logs attached to this email in file
>>>>>>>> sas_pure_java.txt.
>>>>>>>>
>>>>>>>> _______________________
>>>>>>>>
>>>>>>>>
>>>>>>>> Now I am trying to do the same via Spark SQL. Please find my
>>>>>>>> program and logs attached to this email in file sas_spark_sql.txt .
>>>>>>>>
>>>>>>>> Connection to SAS dataset is established successfully. But please
>>>>>>>> note the highlighted log below.
>>>>>>>>
>>>>>>>> [2016-06-10 10:29:05,834] INFO conn(2)#prepareStatement sql=SELECT
>>>>>>>> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; prepared
>>>>>>>> statement 2.1; time= 0.038 secs (com.sas.rio.MVAConnection:538)
>>>>>>>>
>>>>>>>> [2016-06-10 10:29:05,935] INFO ps(2.1)#executeQuery SELECT
>>>>>>>> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; created
>>>>>>>> result set 2.1.1; time= 0.102 secs (com.sas.rio.MVAStatement:590)
>>>>>>>> Please find the full logs attached to this email in file
>>>>>>>>  sas_spark_sql.txt
>>>>>>>>
>>>>>>>> I am using same driver in both pure java and spark sql programs.
>>>>>>>> But the query generated in spark sql has quotes around the column
>>>>>>>> names(Highlighted above).
>>>>>>>> So my resulting output for that query is like this,
>>>>>>>>
>>>>>>>> +-----+--------+------+
>>>>>>>> |  _c0|     _c1|   _c2|
>>>>>>>> +-----+--------+------+
>>>>>>>> |SR_NO|start_dt|end_dt|
>>>>>>>> |SR_NO|start_dt|end_dt|
>>>>>>>> |SR_NO|start_dt|end_dt|
>>>>>>>> |SR_NO|start_dt|end_dt|
>>>>>>>> +-----+--------+------+
>>>>>>>>
>>>>>>>> Since both programs are using the same driver com.sas.rio.MVADriver
>>>>>>>> . Expected output should be same as my pure java programs output. But
>>>>>>>> something else is happening behind the scenes.
>>>>>>>>
>>>>>>>> Any insights on this issue. Thanks for your time.
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Ajay
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

Reply via email to