Re: Spark Dataframe and HIVE

रविशंकर नायर Sun, 11 Feb 2018 00:09:00 -0800

I have created it using Spark SQL. Then I want to retrieve from HIVE. Thats
where the issue is. I can , still retrieve from Spark. No problems. Why
HIVE is not giving me the data??




On Sun, Feb 11, 2018 at 3:06 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> simple question have you created the table through spark sql or hive?
>
> I recall similar issues a while back.
>
> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> //val sqlContext = new HiveContext(sc)
> println ("\nStarted at"); spark.sql("SELECT FROM_unixtime(unix_timestamp(),
> 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println)
> //
> // Need to create and populate target ORC table sales in database test in
> Hive
> //
> HiveContext.sql("use test")
> //
> // Drop and create table
> //
> HiveContext.sql("DROP TABLE IF EXISTS test.dummy2")
> var sqltext = ""
> sqltext = """
> CREATE TABLE test.dummy2
>  (
>      ID INT
>    , CLUSTERED INT
>    , SCATTERED INT
>    , RANDOMISED INT
>    , RANDOM_STRING VARCHAR(50)
>    , SMALL_VC VARCHAR(10)
>    , PADDING  VARCHAR(10)
> )
> --CLUSTERED BY (ID) INTO 256 BUCKETS
> STORED AS ORC
> TBLPROPERTIES ( "orc.compress"="SNAPPY",
> "orc.create.index"="true",
> "orc.bloom.filter.columns"="ID",
> "orc.bloom.filter.fpp"="0.05",
> "orc.stripe.size"="268435456",
> "orc.row.index.stride"="10000" )
> """
> HiveContext.sql(sqltext)
> //
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 11 February 2018 at 07:54, ☼ R Nair (रविशंकर नायर) <
> ravishankar.n...@gmail.com> wrote:
>
>> Hi,
>> Here you go:
>>
>> hive> show create table mine;
>> OK
>> CREATE TABLE `mine`(
>>   `policyid` int,
>>   `statecode` string,
>>   `socialid` string,
>>   `county` string,
>>   `eq_site_limit` decimal(10,2),
>>   `hu_site_limit` decimal(10,2),
>>   `fl_site_limit` decimal(10,2),
>>   `fr_site_limit` decimal(10,2),
>>   `tiv_2014` decimal(10,2),
>>   `tiv_2015` decimal(10,2),
>>   `eq_site_deductible` int,
>>   `hu_site_deductible` int,
>>   `fl_site_deductible` int,
>>   `fr_site_deductible` int,
>>   `latitude` decimal(6,6),
>>   `longitude` decimal(6,6),
>>   `line` string,
>>   `construction` string,
>>   `point_granularity` int)
>> ROW FORMAT SERDE
>>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
>> WITH SERDEPROPERTIES (
>>   'path'='hdfs://localhost:8020/user/hive/warehouse/mine')
>> STORED AS INPUTFORMAT
>>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
>> OUTPUTFORMAT
>>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
>> LOCATION
>>   'file:/Users/ravishankarnair/spark-warehouse/mine'
>> TBLPROPERTIES (
>>   'spark.sql.sources.provider'='parquet',
>>   'spark.sql.sources.schema.numParts'='1',
>>   'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"
>> fields\":[{\"name\":\"policyid\",\"type\":\"integer\",\"
>> nullable\":true,\"metadata\":{\"name\":\"policyid\",\"scale\
>> ":0}},{\"name\":\"statecode\",\"type\":\"string\",\"
>> nullable\":true,\"metadata\":{\"name\":\"statecode\",\"
>> scale\":0}},{\"name\":\"Socialid\",\"type\":\"string\"
>> ,\"nullable\":true,\"metadata\":{\"name\":\"Socialid\",\"
>> scale\":0}},{\"name\":\"county\",\"type\":\"string\",\
>> "nullable\":true,\"metadata\":{\"name\":\"county\",\"scale\"
>> :0}},{\"name\":\"eq_site_limit\",\"type\":\"decimal(10,
>> 2)\",\"nullable\":true,\"metadata\":{\"name\":\"eq_
>> site_limit\",\"scale\":2}},{\"name\":\"hu_site_limit\",\"typ
>> e\":\"decimal(10,2)\",\"nullable\":true,\"metadata\":{\"
>> name\":\"hu_site_limit\",\"scale\":2}},{\"name\":\"fl_site_
>> limit\",\"type\":\"decimal(10,2)\",\"nullable\":true,\"
>> metadata\":{\"name\":\"fl_site_limit\",\"scale\":2}},{\"
>> name\":\"fr_site_limit\",\"type\":\"decimal(10,2)\",\"nullab
>> le\":true,\"metadata\":{\"name\":\"fr_site_limit\",\"sca
>> le\":2}},{\"name\":\"tiv_2014\",\"type\":\"decimal(10,2)\",\
>> "nullable\":true,\"metadata\":{\"name\":\"tiv_2014\",\"
>> scale\":2}},{\"name\":\"tiv_2015\",\"type\":\"decimal(10,
>> 2)\",\"nullable\":true,\"metadata\":{\"name\":\"tiv_
>> 2015\",\"scale\":2}},{\"name\":\"eq_site_deductible\",\"
>> type\":\"integer\",\"nullable\":true,\"metadata\":{\"name\":
>> \"eq_site_deductible\",\"scale\":0}},{\"name\":\"hu_
>> site_deductible\",\"type\":\"integer\",\"nullable\":true,\"
>> metadata\":{\"name\":\"hu_site_deductible\",\"scale\":0}},{\
>> "name\":\"fl_site_deductible\",\"type\":\"integer\",\"
>> nullable\":true,\"metadata\":{\"name\":\"fl_site_deductible\
>> ",\"scale\":0}},{\"name\":\"fr_site_deductible\",\"type\":
>> \"integer\",\"nullable\":true,\"metadata\":{\"name\":\"fr_si
>> te_deductible\",\"scale\":0}},{\"name\":\"latitude\",\"type\
>> ":\"decimal(6,6)\",\"nullable\":true,\"metadata\":{\"name\":
>> \"latitude\",\"scale\":6}},{\"name\":\"longitude\",\"type\":
>> \"decimal(6,6)\",\"nullable\":true,\"metadata\":{\"name\":\"
>> longitude\",\"scale\":6}},{\"name\":\"line\",\"type\":\"
>> string\",\"nullable\":true,\"metadata\":{\"name\":\"line\",
>> \"scale\":0}},{\"name\":\"construction\",\"type\":\"
>> string\",\"nullable\":true,\"metadata\":{\"name\":\"
>> construction\",\"scale\":0}},{\"name\":\"point_granularity\"
>> ,\"type\":\"integer\",\"nullable\":true,\"metadata\":{
>> \"name\":\"point_granularity\",\"scale\":0}}]}',
>>   'transient_lastDdlTime'='1518335598')
>> Time taken: 0.13 seconds, Fetched: 35 row(s)
>>
>> On Sun, Feb 11, 2018 at 2:36 AM, Shmuel Blitz <
>> shmuel.bl...@similarweb.com> wrote:
>>
>>> Please run the following command, and paste the result:
>>> SHOW CREATE TABLE <<TABLE-NAME>>
>>>
>>> On Sun, Feb 11, 2018 at 7:56 AM, ☼ R Nair (रविशंकर नायर) <
>>> ravishankar.n...@gmail.com> wrote:
>>>
>>>> No, No luck.
>>>>
>>>> Thanks
>>>>
>>>> On Sun, Feb 11, 2018 at 12:48 AM, Deepak Sharma <deepakmc...@gmail.com>
>>>> wrote:
>>>>
>>>>> In hive cli:
>>>>> msck repair table 《table_name》;
>>>>>
>>>>> Thanks
>>>>> Deepak
>>>>>
>>>>> On Feb 11, 2018 11:14, "☼ R Nair (रविशंकर नायर)" <
>>>>> ravishankar.n...@gmail.com> wrote:
>>>>>
>>>>>> NO, can you pease explain the command ? Let me try now.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> On Sun, Feb 11, 2018 at 12:40 AM, Deepak Sharma <
>>>>>> deepakmc...@gmail.com> wrote:
>>>>>>
>>>>>>> I am not sure about the exact issue bjt i see you are partioning
>>>>>>> while writing from spark.
>>>>>>> Did you tried msck repair on the table before reading it in hive ?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Deepak
>>>>>>>
>>>>>>> On Feb 11, 2018 11:06, "☼ R Nair (रविशंकर नायर)" <
>>>>>>> ravishankar.n...@gmail.com> wrote:
>>>>>>>
>>>>>>>> All,
>>>>>>>>
>>>>>>>> Thanks for the inputs. Again I am not successful. I think, we need
>>>>>>>> to resolve this, as this is a very common requirement.
>>>>>>>>
>>>>>>>> Please go through my complete code:
>>>>>>>>
>>>>>>>> STEP 1:  Started Spark shell as spark-shell --master yarn
>>>>>>>>
>>>>>>>> STEP 2: Flowing code is being given as inout to shark shell
>>>>>>>>
>>>>>>>> import org.apache.spark.sql.Row
>>>>>>>> import org.apache.spark.sql.SparkSession
>>>>>>>> val warehouseLocation ="/user/hive/warehouse"
>>>>>>>>
>>>>>>>> val spark = SparkSession.builder().appName("Spark Hive
>>>>>>>> Example").config("spark.sql.warehouse.dir",
>>>>>>>> warehouseLocation).enableHiveSupport().getOrCreate()
>>>>>>>>
>>>>>>>> import org.apache.spark.sql._
>>>>>>>> var passion_df = spark.read.
>>>>>>>> format("jdbc").
>>>>>>>> option("url", "jdbc:mysql://localhost:3307/policies").
>>>>>>>> option("driver" ,"com.mysql.jdbc.Driver").
>>>>>>>> option("user", "root").
>>>>>>>> option("password", "root").
>>>>>>>> option("dbtable", "insurancedetails").
>>>>>>>> option("partitionColumn", "policyid").
>>>>>>>> option("lowerBound", "1").
>>>>>>>> option("upperBound", "100000").
>>>>>>>> option("numPartitions", "4").
>>>>>>>> load()
>>>>>>>> //Made sure that passion_df is created, as passion_df.show(5) shows
>>>>>>>> me correct data.
>>>>>>>> passion_df.write.saveAsTable("default.mine") //Default parquet
>>>>>>>>
>>>>>>>> STEP 3: Went to HIVE. Started HIVE prompt.
>>>>>>>>
>>>>>>>> hive> show tables;
>>>>>>>> OK
>>>>>>>> callcentervoicelogs
>>>>>>>> mine
>>>>>>>> Time taken: 0.035 seconds, Fetched: 2 row(s)
>>>>>>>> //As you can see HIVE is showing the table "mine" in default schema.
>>>>>>>>
>>>>>>>> STEP 4: HERE IS THE PROBLEM.
>>>>>>>>
>>>>>>>> hive> select * from mine;
>>>>>>>> OK
>>>>>>>> Time taken: 0.354 seconds
>>>>>>>> hive>
>>>>>>>> //Where is the data ???
>>>>>>>>
>>>>>>>> STEP 5:
>>>>>>>>
>>>>>>>> See the below command on HIVE
>>>>>>>>
>>>>>>>> describe formatted mine;
>>>>>>>> OK
>>>>>>>> # col_name             data_type           comment
>>>>>>>>
>>>>>>>> policyid             int
>>>>>>>> statecode           string
>>>>>>>> socialid             string
>>>>>>>> county               string
>>>>>>>> eq_site_limit       decimal(10,2)
>>>>>>>> hu_site_limit       decimal(10,2)
>>>>>>>> fl_site_limit       decimal(10,2)
>>>>>>>> fr_site_limit       decimal(10,2)
>>>>>>>> tiv_2014             decimal(10,2)
>>>>>>>> tiv_2015             decimal(10,2)
>>>>>>>> eq_site_deductible   int
>>>>>>>> hu_site_deductible   int
>>>>>>>> fl_site_deductible   int
>>>>>>>> fr_site_deductible   int
>>>>>>>> latitude             decimal(6,6)
>>>>>>>> longitude           decimal(6,6)
>>>>>>>> line                 string
>>>>>>>> construction         string
>>>>>>>> point_granularity   int
>>>>>>>>
>>>>>>>> # Detailed Table Information
>>>>>>>> Database:           default
>>>>>>>> Owner:               ravishankarnair
>>>>>>>> CreateTime:         Sun Feb 11 00:26:40 EST 2018
>>>>>>>> LastAccessTime:     UNKNOWN
>>>>>>>> Protect Mode:       None
>>>>>>>> Retention:           0
>>>>>>>> Location:           file:/Users/ravishankarnair/sp
>>>>>>>> ark-warehouse/mine
>>>>>>>> Table Type:         MANAGED_TABLE
>>>>>>>> Table Parameters:
>>>>>>>> spark.sql.sources.provider parquet
>>>>>>>> spark.sql.sources.schema.numParts 1
>>>>>>>> spark.sql.sources.schema.part.0 {\"type\":\"struct\",\"fields\
>>>>>>>> ":[{\"name\":\"policyid\",\"type\":\"integer\",\"nullable\":
>>>>>>>> true,\"metadata\":{\"name\":\"policyid\",\"scale\":0}},{\"na
>>>>>>>> me\":\"statecode\",\"type\":\"string\",\"nullable\":true,\"m
>>>>>>>> etadata\":{\"name\":\"statecode\",\"scale\":0}},{\"name\":\"
>>>>>>>> Socialid\",\"type\":\"string\",\"nullable\":true,\"metadata\
>>>>>>>> ":{\"name\":\"Socialid\",\"scale\":0}},{\"name\":\"county\",
>>>>>>>> \"type\":\"string\",\"nullable\":true,\"metadata\":{\"name\"
>>>>>>>> :\"county\",\"scale\":0}},{\"name\":\"eq_site_limit\",\"type
>>>>>>>> \":\"decimal(10,2)\",\"nullable\":true,\"metadata\":{\"name\
>>>>>>>> ":\"eq_site_limit\",\"scale\":2}},{\"name\":\"hu_site_limit\
>>>>>>>> ",\"type\":\"decimal(10,2)\",\"nullable\":true,\"metadata\":
>>>>>>>> {\"name\":\"hu_site_limit\",\"scale\":2}},{\"name\":\"fl_sit
>>>>>>>> e_limit\",\"type\":\"decimal(10,2)\",\"nullable\":true,\"met
>>>>>>>> adata\":{\"name\":\"fl_site_limit\",\"scale\":2}},{\"name\":
>>>>>>>> \"fr_site_limit\",\"type\":\"decimal(10,2)\",\"nullable\":t
>>>>>>>> rue,\"metadata\":{\"name\":\"fr_site_limit\",\"scale\":2}},{
>>>>>>>> \"name\":\"tiv_2014\",\"type\":\"decimal(10,2)\",\"nullable\
>>>>>>>> ":true,\"metadata\":{\"name\":\"tiv_2014\",\"scale\":2}},{\"
>>>>>>>> name\":\"tiv_2015\",\"type\":\"decimal(10,2)\",\"nullable\":
>>>>>>>> true,\"metadata\":{\"name\":\"tiv_2015\",\"scale\":2}},{\"
>>>>>>>> name\":\"eq_site_deductible\",\"type\":\"integer\",\"
>>>>>>>> nullable\":true,\"metadata\":{\"name\":\"eq_site_deductible\
>>>>>>>> ",\"scale\":0}},{\"name\":\"hu_site_deductible\",\"type\":
>>>>>>>> \"integer\",\"nullable\":true,\"metadata\":{\"name\":\"hu_
>>>>>>>> site_deductible\",\"scale\":0}},{\"name\":\"fl_site_
>>>>>>>> deductible\",\"type\":\"integer\",\"nullable\":true,\"
>>>>>>>> metadata\":{\"name\":\"fl_site_deductible\",\"scale\":0}},{\
>>>>>>>> "name\":\"fr_site_deductible\",\"type\":\"integer\",\"
>>>>>>>> nullable\":true,\"metadata\":{\"name\":\"fr_site_deductible\
>>>>>>>> ",\"scale\":0}},{\"name\":\"latitude\",\"type\":\"decimal(
>>>>>>>> 6,6)\",\"nullable\":true,\"metadata\":{\"name\":\"latitud
>>>>>>>> e\",\"scale\":6}},{\"name\":\"longitude\",\"type\":\"decimal
>>>>>>>> (6,6)\",\"nullable\":true,\"metadata\":{\"name\":\"longitu
>>>>>>>> de\",\"scale\":6}},{\"name\":\"line\",\"type\":\"string\",\"
>>>>>>>> nullable\":true,\"metadata\":{\"name\":\"line\",\"scale\":0}
>>>>>>>> },{\"name\":\"construction\",\"type\":\"string\",\"nullable\
>>>>>>>> ":true,\"metadata\":{\"name\":\"construction\",\"scale\":0}}
>>>>>>>> ,{\"name\":\"point_granularity\",\"type\":\"integer\",\"
>>>>>>>> nullable\":true,\"metadata\":{\"name\":\"point_granularity\"
>>>>>>>> ,\"scale\":0}}]}
>>>>>>>> transient_lastDdlTime 1518326800
>>>>>>>>
>>>>>>>> # Storage Information
>>>>>>>> SerDe Library:       org.apache.hadoop.hive.ql.io.p
>>>>>>>> arquet.serde.ParquetHiveSerDe
>>>>>>>> InputFormat:         org.apache.hadoop.hive.ql.io.p
>>>>>>>> arquet.MapredParquetInputFormat
>>>>>>>> OutputFormat:       org.apache.hadoop.hive.ql.io.p
>>>>>>>> arquet.MapredParquetOutputFormat
>>>>>>>> Compressed:         No
>>>>>>>> Num Buckets:         -1
>>>>>>>> Bucket Columns:     []
>>>>>>>> Sort Columns:       []
>>>>>>>> Storage Desc Params:
>>>>>>>> path                 hdfs://localhost:8020/user/hive/warehouse/mine
>>>>>>>> serialization.format 1
>>>>>>>> Time taken: 0.077 seconds, Fetched: 48 row(s)
>>>>>>>>
>>>>>>>> Now, I see your advise and support. Whats the issue? Am I doing
>>>>>>>> wrong, it it a bug ? I am using Spark 2.2.1, HIVE 1.2.1, HADOOP 2.7.3. 
>>>>>>>> All
>>>>>>>> class path, configuration are set properly.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Ravion
>>>>>>>>
>>>>>>>> On Fri, Feb 9, 2018 at 1:29 PM, Nicholas Hakobian <
>>>>>>>> nicholas.hakob...@rallyhealth.com> wrote:
>>>>>>>>
>>>>>>>>> Its possible that the format of your table is not compatible with
>>>>>>>>> your version of hive, so Spark saved it in a way such that only Spark 
>>>>>>>>> can
>>>>>>>>> read it. When this happens it prints out a very visible warning 
>>>>>>>>> letting you
>>>>>>>>> know this has happened.
>>>>>>>>>
>>>>>>>>> We've seen it most frequently when trying to save a parquet file
>>>>>>>>> with a column in date format into a Hive table. In older versions of 
>>>>>>>>> hive,
>>>>>>>>> its parquet reader/writer did not support Date formats (among a couple
>>>>>>>>> others).
>>>>>>>>>
>>>>>>>>> Nicholas Szandor Hakobian, Ph.D.
>>>>>>>>> Staff Data Scientist
>>>>>>>>> Rally Health
>>>>>>>>> nicholas.hakob...@rallyhealth.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Feb 9, 2018 at 9:59 AM, Prakash Joshi <
>>>>>>>>> prakashcjos...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Ravi,
>>>>>>>>>>
>>>>>>>>>> Can you send the result of
>>>>>>>>>> Show create table your_table_name
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Prakash
>>>>>>>>>>
>>>>>>>>>> On Feb 9, 2018 8:20 PM, "☼ R Nair (रविशंकर नायर)" <
>>>>>>>>>> ravishankar.n...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> All,
>>>>>>>>>>>
>>>>>>>>>>> It has been three days continuously I am on this issue. Not
>>>>>>>>>>> getting any clue.
>>>>>>>>>>>
>>>>>>>>>>> Environment: Spark 2.2.x, all configurations are correct.
>>>>>>>>>>> hive-site.xml is in spark's conf.
>>>>>>>>>>>
>>>>>>>>>>> 1) Step 1: I created a data frame DF1 reading a csv file.
>>>>>>>>>>>
>>>>>>>>>>> 2) Did  manipulations on DF1. Resulting frame is passion_df.
>>>>>>>>>>>
>>>>>>>>>>> 3) passion_df.write.format("orc").saveAsTable("sampledb.passion
>>>>>>>>>>> ")
>>>>>>>>>>>
>>>>>>>>>>> 4) The metastore shows the hive table., when I do "show tables"
>>>>>>>>>>> in HIVE, I can see table name
>>>>>>>>>>>
>>>>>>>>>>> 5) I can't select in HIVE, though I can select from SPARK as
>>>>>>>>>>> spark.sql("select * from sampledb.passion")
>>>>>>>>>>>
>>>>>>>>>>> Whats going on here? Please help. Why I am not seeing data from
>>>>>>>>>>> HIVE prompt?
>>>>>>>>>>> The "describe formatted " command on the table in HIVE shows he
>>>>>>>>>>> data is is in default warehouse location ( /user/hive/warehouse) 
>>>>>>>>>>> since I
>>>>>>>>>>> set it.
>>>>>>>>>>>
>>>>>>>>>>> I am not getting any definite answer anywhere. Many suggestions
>>>>>>>>>>> and answers given in Stackoverflow et al.Nothing really works.
>>>>>>>>>>>
>>>>>>>>>>> So asking experts here for some light on this, thanks
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Ravion
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>
>>>
>>> --
>>> Shmuel Blitz
>>> Big Data Developer
>>> Email: shmuel.bl...@similarweb.com
>>> www.similarweb.com
>>> <https://www.facebook.com/SimilarWeb/>
>>> <https://www.linkedin.com/company/429838/>
>>> <https://twitter.com/similarweb>
>>>
>>
>>
>>
>> --
>>
>>
>


--

Re: Spark Dataframe and HIVE

Reply via email to