Re: Spark Dataframe and HIVE

Mich Talebzadeh Sun, 11 Feb 2018 00:08:06 -0800

simple question have you created the table through spark sql or hive?

I recall similar issues a while back.


val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
//val sqlContext = new HiveContext(sc)
println ("\nStarted at"); spark.sql("SELECT FROM_unixtime(unix_timestamp(),
'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println)
//
// Need to create and populate target ORC table sales in database test in
Hive
//
HiveContext.sql("use test")
//
// Drop and create table
//
HiveContext.sql("DROP TABLE IF EXISTS test.dummy2")
var sqltext = ""
sqltext = """
CREATE TABLE test.dummy2
 (
     ID INT
   , CLUSTERED INT
   , SCATTERED INT
   , RANDOMISED INT
   , RANDOM_STRING VARCHAR(50)
   , SMALL_VC VARCHAR(10)
   , PADDING  VARCHAR(10)
)
--CLUSTERED BY (ID) INTO 256 BUCKETS
STORED AS ORC
TBLPROPERTIES ( "orc.compress"="SNAPPY",
"orc.create.index"="true",
"orc.bloom.filter.columns"="ID",
"orc.bloom.filter.fpp"="0.05",
"orc.stripe.size"="268435456",
"orc.row.index.stride"="10000" )
"""
HiveContext.sql(sqltext)
//

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 11 February 2018 at 07:54, ☼ R Nair (रविशंकर नायर) <
ravishankar.n...@gmail.com> wrote:

> Hi,
> Here you go:
>
> hive> show create table mine;
> OK
> CREATE TABLE `mine`(
>   `policyid` int,
>   `statecode` string,
>   `socialid` string,
>   `county` string,
>   `eq_site_limit` decimal(10,2),
>   `hu_site_limit` decimal(10,2),
>   `fl_site_limit` decimal(10,2),
>   `fr_site_limit` decimal(10,2),
>   `tiv_2014` decimal(10,2),
>   `tiv_2015` decimal(10,2),
>   `eq_site_deductible` int,
>   `hu_site_deductible` int,
>   `fl_site_deductible` int,
>   `fr_site_deductible` int,
>   `latitude` decimal(6,6),
>   `longitude` decimal(6,6),
>   `line` string,
>   `construction` string,
>   `point_granularity` int)
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> WITH SERDEPROPERTIES (
>   'path'='hdfs://localhost:8020/user/hive/warehouse/mine')
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
>   'file:/Users/ravishankarnair/spark-warehouse/mine'
> TBLPROPERTIES (
>   'spark.sql.sources.provider'='parquet',
>   'spark.sql.sources.schema.numParts'='1',
>   'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",
> \"fields\":[{\"name\":\"policyid\",\"type\":\"integer\
> ",\"nullable\":true,\"metadata\":{\"name\":\"policyid\",\"scale\":0}},{\"
> name\":\"statecode\",\"type\":\"string\",\"nullable\":true,\
> "metadata\":{\"name\":\"statecode\",\"scale\":0}},{\"
> name\":\"Socialid\",\"type\":\"string\",\"nullable\":true,\"
> metadata\":{\"name\":\"Socialid\",\"scale\":0}},{\"
> name\":\"county\",\"type\":\"string\",\"nullable\":true,\"
> metadata\":{\"name\":\"county\",\"scale\":0}},{\"name\":\"
> eq_site_limit\",\"type\":\"decimal(10,2)\",\"nullable\":
> true,\"metadata\":{\"name\":\"eq_site_limit\",\"scale\":2}},
> {\"name\":\"hu_site_limit\",\"type\":\"decimal(10,2)\",\"
> nullable\":true,\"metadata\":{\"name\":\"hu_site_limit\",\"
> scale\":2}},{\"name\":\"fl_site_limit\",\"type\":\"
> decimal(10,2)\",\"nullable\":true,\"metadata\":{\"name\":\"
> fl_site_limit\",\"scale\":2}},{\"name\":\"fr_site_limit\",\"
> type\":\"decimal(10,2)\",\"nullable\":true,\"metadata\":{
> \"name\":\"fr_site_limit\",\"scale\":2}},{\"name\":\"tiv_
> 2014\",\"type\":\"decimal(10,2)\",\"nullable\":true,\"
> metadata\":{\"name\":\"tiv_2014\",\"scale\":2}},{\"name\"
> :\"tiv_2015\",\"type\":\"decimal(10,2)\",\"nullable\":
> true,\"metadata\":{\"name\":\"tiv_2015\",\"scale\":2}},{\"
> name\":\"eq_site_deductible\",\"type\":\"integer\",\"
> nullable\":true,\"metadata\":{\"name\":\"eq_site_deductible\
> ",\"scale\":0}},{\"name\":\"hu_site_deductible\",\"type\":
> \"integer\",\"nullable\":true,\"metadata\":{\"name\":\"hu_
> site_deductible\",\"scale\":0}},{\"name\":\"fl_site_
> deductible\",\"type\":\"integer\",\"nullable\":true,\"
> metadata\":{\"name\":\"fl_site_deductible\",\"scale\":0}
> },{\"name\":\"fr_site_deductible\",\"type\":\"
> integer\",\"nullable\":true,\"metadata\":{\"name\":\"fr_
> site_deductible\",\"scale\":0}},{\"name\":\"latitude\",\"
> type\":\"decimal(6,6)\",\"nullable\":true,\"metadata\":{
> \"name\":\"latitude\",\"scale\":6}},{\"name\":\"longitude\",
> \"type\":\"decimal(6,6)\",\"nullable\":true,\"metadata\":{
> \"name\":\"longitude\",\"scale\":6}},{\"name\":\"line\"
> ,\"type\":\"string\",\"nullable\":true,\"metadata\":{
> \"name\":\"line\",\"scale\":0}},{\"name\":\"construction\",\
> "type\":\"string\",\"nullable\":true,\"metadata\":{\"name\":
> \"construction\",\"scale\":0}},{\"name\":\"point_granularity\",\"type\":\"
> integer\",\"nullable\":true,\"metadata\":{\"name\":\"point_
> granularity\",\"scale\":0}}]}',
>   'transient_lastDdlTime'='1518335598')
> Time taken: 0.13 seconds, Fetched: 35 row(s)
>
> On Sun, Feb 11, 2018 at 2:36 AM, Shmuel Blitz <shmuel.bl...@similarweb.com
> > wrote:
>
>> Please run the following command, and paste the result:
>> SHOW CREATE TABLE <<TABLE-NAME>>
>>
>> On Sun, Feb 11, 2018 at 7:56 AM, ☼ R Nair (रविशंकर नायर) <
>> ravishankar.n...@gmail.com> wrote:
>>
>>> No, No luck.
>>>
>>> Thanks
>>>
>>> On Sun, Feb 11, 2018 at 12:48 AM, Deepak Sharma <deepakmc...@gmail.com>
>>> wrote:
>>>
>>>> In hive cli:
>>>> msck repair table 《table_name》;
>>>>
>>>> Thanks
>>>> Deepak
>>>>
>>>> On Feb 11, 2018 11:14, "☼ R Nair (रविशंकर नायर)" <
>>>> ravishankar.n...@gmail.com> wrote:
>>>>
>>>>> NO, can you pease explain the command ? Let me try now.
>>>>>
>>>>> Best,
>>>>>
>>>>> On Sun, Feb 11, 2018 at 12:40 AM, Deepak Sharma <deepakmc...@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> I am not sure about the exact issue bjt i see you are partioning
>>>>>> while writing from spark.
>>>>>> Did you tried msck repair on the table before reading it in hive ?
>>>>>>
>>>>>> Thanks
>>>>>> Deepak
>>>>>>
>>>>>> On Feb 11, 2018 11:06, "☼ R Nair (रविशंकर नायर)" <
>>>>>> ravishankar.n...@gmail.com> wrote:
>>>>>>
>>>>>>> All,
>>>>>>>
>>>>>>> Thanks for the inputs. Again I am not successful. I think, we need
>>>>>>> to resolve this, as this is a very common requirement.
>>>>>>>
>>>>>>> Please go through my complete code:
>>>>>>>
>>>>>>> STEP 1:  Started Spark shell as spark-shell --master yarn
>>>>>>>
>>>>>>> STEP 2: Flowing code is being given as inout to shark shell
>>>>>>>
>>>>>>> import org.apache.spark.sql.Row
>>>>>>> import org.apache.spark.sql.SparkSession
>>>>>>> val warehouseLocation ="/user/hive/warehouse"
>>>>>>>
>>>>>>> val spark = SparkSession.builder().appName("Spark Hive
>>>>>>> Example").config("spark.sql.warehouse.dir",
>>>>>>> warehouseLocation).enableHiveSupport().getOrCreate()
>>>>>>>
>>>>>>> import org.apache.spark.sql._
>>>>>>> var passion_df = spark.read.
>>>>>>> format("jdbc").
>>>>>>> option("url", "jdbc:mysql://localhost:3307/policies").
>>>>>>> option("driver" ,"com.mysql.jdbc.Driver").
>>>>>>> option("user", "root").
>>>>>>> option("password", "root").
>>>>>>> option("dbtable", "insurancedetails").
>>>>>>> option("partitionColumn", "policyid").
>>>>>>> option("lowerBound", "1").
>>>>>>> option("upperBound", "100000").
>>>>>>> option("numPartitions", "4").
>>>>>>> load()
>>>>>>> //Made sure that passion_df is created, as passion_df.show(5) shows
>>>>>>> me correct data.
>>>>>>> passion_df.write.saveAsTable("default.mine") //Default parquet
>>>>>>>
>>>>>>> STEP 3: Went to HIVE. Started HIVE prompt.
>>>>>>>
>>>>>>> hive> show tables;
>>>>>>> OK
>>>>>>> callcentervoicelogs
>>>>>>> mine
>>>>>>> Time taken: 0.035 seconds, Fetched: 2 row(s)
>>>>>>> //As you can see HIVE is showing the table "mine" in default schema.
>>>>>>>
>>>>>>> STEP 4: HERE IS THE PROBLEM.
>>>>>>>
>>>>>>> hive> select * from mine;
>>>>>>> OK
>>>>>>> Time taken: 0.354 seconds
>>>>>>> hive>
>>>>>>> //Where is the data ???
>>>>>>>
>>>>>>> STEP 5:
>>>>>>>
>>>>>>> See the below command on HIVE
>>>>>>>
>>>>>>> describe formatted mine;
>>>>>>> OK
>>>>>>> # col_name             data_type           comment
>>>>>>>
>>>>>>> policyid             int
>>>>>>> statecode           string
>>>>>>> socialid             string
>>>>>>> county               string
>>>>>>> eq_site_limit       decimal(10,2)
>>>>>>> hu_site_limit       decimal(10,2)
>>>>>>> fl_site_limit       decimal(10,2)
>>>>>>> fr_site_limit       decimal(10,2)
>>>>>>> tiv_2014             decimal(10,2)
>>>>>>> tiv_2015             decimal(10,2)
>>>>>>> eq_site_deductible   int
>>>>>>> hu_site_deductible   int
>>>>>>> fl_site_deductible   int
>>>>>>> fr_site_deductible   int
>>>>>>> latitude             decimal(6,6)
>>>>>>> longitude           decimal(6,6)
>>>>>>> line                 string
>>>>>>> construction         string
>>>>>>> point_granularity   int
>>>>>>>
>>>>>>> # Detailed Table Information
>>>>>>> Database:           default
>>>>>>> Owner:               ravishankarnair
>>>>>>> CreateTime:         Sun Feb 11 00:26:40 EST 2018
>>>>>>> LastAccessTime:     UNKNOWN
>>>>>>> Protect Mode:       None
>>>>>>> Retention:           0
>>>>>>> Location:           file:/Users/ravishankarnair/spark-warehouse/mine
>>>>>>>
>>>>>>> Table Type:         MANAGED_TABLE
>>>>>>> Table Parameters:
>>>>>>> spark.sql.sources.provider parquet
>>>>>>> spark.sql.sources.schema.numParts 1
>>>>>>> spark.sql.sources.schema.part.0 {\"type\":\"struct\",\"fields\
>>>>>>> ":[{\"name\":\"policyid\",\"type\":\"integer\",\"nullable\":
>>>>>>> true,\"metadata\":{\"name\":\"policyid\",\"scale\":0}},{\"na
>>>>>>> me\":\"statecode\",\"type\":\"string\",\"nullable\":true,\"m
>>>>>>> etadata\":{\"name\":\"statecode\",\"scale\":0}},{\"name\":\"
>>>>>>> Socialid\",\"type\":\"string\",\"nullable\":true,\"metadata\
>>>>>>> ":{\"name\":\"Socialid\",\"scale\":0}},{\"name\":\"county\",
>>>>>>> \"type\":\"string\",\"nullable\":true,\"metadata\":{\"name\"
>>>>>>> :\"county\",\"scale\":0}},{\"name\":\"eq_site_limit\",\"type
>>>>>>> \":\"decimal(10,2)\",\"nullable\":true,\"metadata\":{\"name\
>>>>>>> ":\"eq_site_limit\",\"scale\":2}},{\"name\":\"hu_site_limit\
>>>>>>> ",\"type\":\"decimal(10,2)\",\"nullable\":true,\"metadata\":
>>>>>>> {\"name\":\"hu_site_limit\",\"scale\":2}},{\"name\":\"fl_sit
>>>>>>> e_limit\",\"type\":\"decimal(10,2)\",\"nullable\":true,\"met
>>>>>>> adata\":{\"name\":\"fl_site_limit\",\"scale\":2}},{\"name\
>>>>>>> ":\"fr_site_limit\",\"type\":\"decimal(10,2)\",\"nullable\":
>>>>>>> true,\"metadata\":{\"name\":\"fr_site_limit\",\"scale\":2}},
>>>>>>> {\"name\":\"tiv_2014\",\"type\":\"decimal(10,2)\",\"
>>>>>>> nullable\":true,\"metadata\":{\"name\":\"tiv_2014\",\"scale\
>>>>>>> ":2}},{\"name\":\"tiv_2015\",\"type\":\"decimal(10,2)\",\"
>>>>>>> nullable\":true,\"metadata\":{\"name\":\"tiv_2015\",\"scale\
>>>>>>> ":2}},{\"name\":\"eq_site_deductible\",\"type\":\"
>>>>>>> integer\",\"nullable\":true,\"metadata\":{\"name\":\"eq_
>>>>>>> site_deductible\",\"scale\":0}},{\"name\":\"hu_site_deductib
>>>>>>> le\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{
>>>>>>> \"name\":\"hu_site_deductible\",\"scale\":0}},{\"name\":\"
>>>>>>> fl_site_deductible\",\"type\":\"integer\",\"nullable\":true,
>>>>>>> \"metadata\":{\"name\":\"fl_site_deductible\",\"scale\":0}
>>>>>>> },{\"name\":\"fr_site_deductible\",\"type\":\"intege
>>>>>>> r\",\"nullable\":true,\"metadata\":{\"name\":\"fr_site_
>>>>>>> deductible\",\"scale\":0}},{\"name\":\"latitude\",\"type\":\
>>>>>>> "decimal(6,6)\",\"nullable\":true,\"metadata\":{\"name\":\"l
>>>>>>> atitude\",\"scale\":6}},{\"name\":\"longitude\",\"type\":\"
>>>>>>> decimal(6,6)\",\"nullable\":true,\"metadata\":{\"name\":\"lo
>>>>>>> ngitude\",\"scale\":6}},{\"name\":\"line\",\"type\":\"string
>>>>>>> \",\"nullable\":true,\"metadata\":{\"name\":\"line\",\"
>>>>>>> scale\":0}},{\"name\":\"construction\",\"type\":\"string\",\
>>>>>>> "nullable\":true,\"metadata\":{\"name\":\"construction\",\"
>>>>>>> scale\":0}},{\"name\":\"point_granularity\",\"type\":\"
>>>>>>> integer\",\"nullable\":true,\"metadata\":{\"name\":\"point_
>>>>>>> granularity\",\"scale\":0}}]}
>>>>>>> transient_lastDdlTime 1518326800
>>>>>>>
>>>>>>> # Storage Information
>>>>>>> SerDe Library:       org.apache.hadoop.hive.ql.io.p
>>>>>>> arquet.serde.ParquetHiveSerDe
>>>>>>> InputFormat:         org.apache.hadoop.hive.ql.io.p
>>>>>>> arquet.MapredParquetInputFormat
>>>>>>> OutputFormat:       org.apache.hadoop.hive.ql.io.p
>>>>>>> arquet.MapredParquetOutputFormat
>>>>>>> Compressed:         No
>>>>>>> Num Buckets:         -1
>>>>>>> Bucket Columns:     []
>>>>>>> Sort Columns:       []
>>>>>>> Storage Desc Params:
>>>>>>> path                 hdfs://localhost:8020/user/hive/warehouse/mine
>>>>>>> serialization.format 1
>>>>>>> Time taken: 0.077 seconds, Fetched: 48 row(s)
>>>>>>>
>>>>>>> Now, I see your advise and support. Whats the issue? Am I doing
>>>>>>> wrong, it it a bug ? I am using Spark 2.2.1, HIVE 1.2.1, HADOOP 2.7.3. 
>>>>>>> All
>>>>>>> class path, configuration are set properly.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Ravion
>>>>>>>
>>>>>>> On Fri, Feb 9, 2018 at 1:29 PM, Nicholas Hakobian <
>>>>>>> nicholas.hakob...@rallyhealth.com> wrote:
>>>>>>>
>>>>>>>> Its possible that the format of your table is not compatible with
>>>>>>>> your version of hive, so Spark saved it in a way such that only Spark 
>>>>>>>> can
>>>>>>>> read it. When this happens it prints out a very visible warning 
>>>>>>>> letting you
>>>>>>>> know this has happened.
>>>>>>>>
>>>>>>>> We've seen it most frequently when trying to save a parquet file
>>>>>>>> with a column in date format into a Hive table. In older versions of 
>>>>>>>> hive,
>>>>>>>> its parquet reader/writer did not support Date formats (among a couple
>>>>>>>> others).
>>>>>>>>
>>>>>>>> Nicholas Szandor Hakobian, Ph.D.
>>>>>>>> Staff Data Scientist
>>>>>>>> Rally Health
>>>>>>>> nicholas.hakob...@rallyhealth.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Feb 9, 2018 at 9:59 AM, Prakash Joshi <
>>>>>>>> prakashcjos...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Ravi,
>>>>>>>>>
>>>>>>>>> Can you send the result of
>>>>>>>>> Show create table your_table_name
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Prakash
>>>>>>>>>
>>>>>>>>> On Feb 9, 2018 8:20 PM, "☼ R Nair (रविशंकर नायर)" <
>>>>>>>>> ravishankar.n...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> All,
>>>>>>>>>>
>>>>>>>>>> It has been three days continuously I am on this issue. Not
>>>>>>>>>> getting any clue.
>>>>>>>>>>
>>>>>>>>>> Environment: Spark 2.2.x, all configurations are correct.
>>>>>>>>>> hive-site.xml is in spark's conf.
>>>>>>>>>>
>>>>>>>>>> 1) Step 1: I created a data frame DF1 reading a csv file.
>>>>>>>>>>
>>>>>>>>>> 2) Did  manipulations on DF1. Resulting frame is passion_df.
>>>>>>>>>>
>>>>>>>>>> 3) passion_df.write.format("orc").saveAsTable("sampledb.passion")
>>>>>>>>>>
>>>>>>>>>> 4) The metastore shows the hive table., when I do "show tables"
>>>>>>>>>> in HIVE, I can see table name
>>>>>>>>>>
>>>>>>>>>> 5) I can't select in HIVE, though I can select from SPARK as
>>>>>>>>>> spark.sql("select * from sampledb.passion")
>>>>>>>>>>
>>>>>>>>>> Whats going on here? Please help. Why I am not seeing data from
>>>>>>>>>> HIVE prompt?
>>>>>>>>>> The "describe formatted " command on the table in HIVE shows he
>>>>>>>>>> data is is in default warehouse location ( /user/hive/warehouse) 
>>>>>>>>>> since I
>>>>>>>>>> set it.
>>>>>>>>>>
>>>>>>>>>> I am not getting any definite answer anywhere. Many suggestions
>>>>>>>>>> and answers given in Stackoverflow et al.Nothing really works.
>>>>>>>>>>
>>>>>>>>>> So asking experts here for some light on this, thanks
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Ravion
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>
>>>
>>>
>>> --
>>>
>>>
>>
>>
>> --
>> Shmuel Blitz
>> Big Data Developer
>> Email: shmuel.bl...@similarweb.com
>> www.similarweb.com
>> <https://www.facebook.com/SimilarWeb/>
>> <https://www.linkedin.com/company/429838/>
>> <https://twitter.com/similarweb>
>>
>
>
>
> --
>
>

Re: Spark Dataframe and HIVE

Reply via email to