Re: Spark Dataframe and HIVE

Deepak Sharma Sat, 10 Feb 2018 21:49:20 -0800

In hive cli:
msck repair table 《table_name》;

Thanks
Deepak


On Feb 11, 2018 11:14, "☼ R Nair (रविशंकर नायर)" <ravishankar.n...@gmail.com>
wrote:

> NO, can you pease explain the command ? Let me try now.
>
> Best,
>
> On Sun, Feb 11, 2018 at 12:40 AM, Deepak Sharma <deepakmc...@gmail.com>
> wrote:
>
>> I am not sure about the exact issue bjt i see you are partioning while
>> writing from spark.
>> Did you tried msck repair on the table before reading it in hive ?
>>
>> Thanks
>> Deepak
>>
>> On Feb 11, 2018 11:06, "☼ R Nair (रविशंकर नायर)" <
>> ravishankar.n...@gmail.com> wrote:
>>
>>> All,
>>>
>>> Thanks for the inputs. Again I am not successful. I think, we need to
>>> resolve this, as this is a very common requirement.
>>>
>>> Please go through my complete code:
>>>
>>> STEP 1:  Started Spark shell as spark-shell --master yarn
>>>
>>> STEP 2: Flowing code is being given as inout to shark shell
>>>
>>> import org.apache.spark.sql.Row
>>> import org.apache.spark.sql.SparkSession
>>> val warehouseLocation ="/user/hive/warehouse"
>>>
>>> val spark = SparkSession.builder().appName("Spark Hive
>>> Example").config("spark.sql.warehouse.dir",
>>> warehouseLocation).enableHiveSupport().getOrCreate()
>>>
>>> import org.apache.spark.sql._
>>> var passion_df = spark.read.
>>> format("jdbc").
>>> option("url", "jdbc:mysql://localhost:3307/policies").
>>> option("driver" ,"com.mysql.jdbc.Driver").
>>> option("user", "root").
>>> option("password", "root").
>>> option("dbtable", "insurancedetails").
>>> option("partitionColumn", "policyid").
>>> option("lowerBound", "1").
>>> option("upperBound", "100000").
>>> option("numPartitions", "4").
>>> load()
>>> //Made sure that passion_df is created, as passion_df.show(5) shows me
>>> correct data.
>>> passion_df.write.saveAsTable("default.mine") //Default parquet
>>>
>>> STEP 3: Went to HIVE. Started HIVE prompt.
>>>
>>> hive> show tables;
>>> OK
>>> callcentervoicelogs
>>> mine
>>> Time taken: 0.035 seconds, Fetched: 2 row(s)
>>> //As you can see HIVE is showing the table "mine" in default schema.
>>>
>>> STEP 4: HERE IS THE PROBLEM.
>>>
>>> hive> select * from mine;
>>> OK
>>> Time taken: 0.354 seconds
>>> hive>
>>> //Where is the data ???
>>>
>>> STEP 5:
>>>
>>> See the below command on HIVE
>>>
>>> describe formatted mine;
>>> OK
>>> # col_name             data_type           comment
>>>
>>> policyid             int
>>> statecode           string
>>> socialid             string
>>> county               string
>>> eq_site_limit       decimal(10,2)
>>> hu_site_limit       decimal(10,2)
>>> fl_site_limit       decimal(10,2)
>>> fr_site_limit       decimal(10,2)
>>> tiv_2014             decimal(10,2)
>>> tiv_2015             decimal(10,2)
>>> eq_site_deductible   int
>>> hu_site_deductible   int
>>> fl_site_deductible   int
>>> fr_site_deductible   int
>>> latitude             decimal(6,6)
>>> longitude           decimal(6,6)
>>> line                 string
>>> construction         string
>>> point_granularity   int
>>>
>>> # Detailed Table Information
>>> Database:           default
>>> Owner:               ravishankarnair
>>> CreateTime:         Sun Feb 11 00:26:40 EST 2018
>>> LastAccessTime:     UNKNOWN
>>> Protect Mode:       None
>>> Retention:           0
>>> Location:           file:/Users/ravishankarnair/spark-warehouse/mine
>>> Table Type:         MANAGED_TABLE
>>> Table Parameters:
>>> spark.sql.sources.provider parquet
>>> spark.sql.sources.schema.numParts 1
>>> spark.sql.sources.schema.part.0 {\"type\":\"struct\",\"fields\
>>> ":[{\"name\":\"policyid\",\"type\":\"integer\",\"nullable\":
>>> true,\"metadata\":{\"name\":\"policyid\",\"scale\":0}},{\"na
>>> me\":\"statecode\",\"type\":\"string\",\"nullable\":true,\"m
>>> etadata\":{\"name\":\"statecode\",\"scale\":0}},{\"name\":\"
>>> Socialid\",\"type\":\"string\",\"nullable\":true,\"metadata\
>>> ":{\"name\":\"Socialid\",\"scale\":0}},{\"name\":\"county\",
>>> \"type\":\"string\",\"nullable\":true,\"metadata\":{
>>> \"name\":\"county\",\"scale\":0}},{\"name\":\"eq_site_limit\
>>> ",\"type\":\"decimal(10,2)\",\"nullable\":true,\"metadata\":
>>> {\"name\":\"eq_site_limit\",\"scale\":2}},{\"name\":\"hu_
>>> site_limit\",\"type\":\"decimal(10,2)\",\"nullable\":
>>> true,\"metadata\":{\"name\":\"hu_site_limit\",\"scale\":2}},
>>> {\"name\":\"fl_site_limit\",\"type\":\"decimal(10,2)\",\"
>>> nullable\":true,\"metadata\":{\"name\":\"fl_site_limit\",\"
>>> scale\":2}},{\"name\":\"fr_site_limit\",\"type\":\"
>>> decimal(10,2)\",\"nullable\":true,\"metadata\":{\"name\":\"
>>> fr_site_limit\",\"scale\":2}},{\"name\":\"tiv_2014\",\"type\
>>> ":\"decimal(10,2)\",\"nullable\":true,\"metadata\":{
>>> \"name\":\"tiv_2014\",\"scale\":2}},{\"name\":\"tiv_2015\",\
>>> "type\":\"decimal(10,2)\",\"nullable\":true,\"metadata\":{
>>> \"name\":\"tiv_2015\",\"scale\":2}},{\"name\":\"eq_site_
>>> deductible\",\"type\":\"integer\",\"nullable\":true,\"
>>> metadata\":{\"name\":\"eq_site_deductible\",\"scale\":0}
>>> },{\"name\":\"hu_site_deductible\",\"type\":\"intege
>>> r\",\"nullable\":true,\"metadata\":{\"name\":\"hu_site_
>>> deductible\",\"scale\":0}},{\"name\":\"fl_site_deductible\",
>>> \"type\":\"integer\",\"nullable\":true,\"metadata\":{\"name\
>>> ":\"fl_site_deductible\",\"scale\":0}},{\"name\":\"fr_
>>> site_deductible\",\"type\":\"integer\",\"nullable\":true,\"
>>> metadata\":{\"name\":\"fr_site_deductible\",\"scale\":0}},{\
>>> "name\":\"latitude\",\"type\":\"decimal(6,6)\",\"nullable\":
>>> true,\"metadata\":{\"name\":\"latitude\",\"scale\":6}},{\"na
>>> me\":\"longitude\",\"type\":\"decimal(6,6)\",\"nullable\":tr
>>> ue,\"metadata\":{\"name\":\"longitude\",\"scale\":6}},{\"nam
>>> e\":\"line\",\"type\":\"string\",\"nullable\":true,\"metadat
>>> a\":{\"name\":\"line\",\"scale\":0}},{\"name\":\"constr
>>> uction\",\"type\":\"string\",\"nullable\":true,\"metadata\":
>>> {\"name\":\"construction\",\"scale\":0}},{\"name\":\"point_
>>> granularity\",\"type\":\"integer\",\"nullable\":true,\"
>>> metadata\":{\"name\":\"point_granularity\",\"scale\":0}}]}
>>> transient_lastDdlTime 1518326800
>>>
>>> # Storage Information
>>> SerDe Library:       org.apache.hadoop.hive.ql.io.p
>>> arquet.serde.ParquetHiveSerDe
>>> InputFormat:         org.apache.hadoop.hive.ql.io.p
>>> arquet.MapredParquetInputFormat
>>> OutputFormat:       org.apache.hadoop.hive.ql.io.p
>>> arquet.MapredParquetOutputFormat
>>> Compressed:         No
>>> Num Buckets:         -1
>>> Bucket Columns:     []
>>> Sort Columns:       []
>>> Storage Desc Params:
>>> path                 hdfs://localhost:8020/user/hive/warehouse/mine
>>> serialization.format 1
>>> Time taken: 0.077 seconds, Fetched: 48 row(s)
>>>
>>> Now, I see your advise and support. Whats the issue? Am I doing wrong,
>>> it it a bug ? I am using Spark 2.2.1, HIVE 1.2.1, HADOOP 2.7.3. All class
>>> path, configuration are set properly.
>>>
>>> Best,
>>>
>>> Ravion
>>>
>>> On Fri, Feb 9, 2018 at 1:29 PM, Nicholas Hakobian <
>>> nicholas.hakob...@rallyhealth.com> wrote:
>>>
>>>> Its possible that the format of your table is not compatible with your
>>>> version of hive, so Spark saved it in a way such that only Spark can read
>>>> it. When this happens it prints out a very visible warning letting you know
>>>> this has happened.
>>>>
>>>> We've seen it most frequently when trying to save a parquet file with a
>>>> column in date format into a Hive table. In older versions of hive, its
>>>> parquet reader/writer did not support Date formats (among a couple others).
>>>>
>>>> Nicholas Szandor Hakobian, Ph.D.
>>>> Staff Data Scientist
>>>> Rally Health
>>>> nicholas.hakob...@rallyhealth.com
>>>>
>>>>
>>>> On Fri, Feb 9, 2018 at 9:59 AM, Prakash Joshi <prakashcjos...@gmail.com
>>>> > wrote:
>>>>
>>>>> Ravi,
>>>>>
>>>>> Can you send the result of
>>>>> Show create table your_table_name
>>>>>
>>>>> Thanks
>>>>> Prakash
>>>>>
>>>>> On Feb 9, 2018 8:20 PM, "☼ R Nair (रविशंकर नायर)" <
>>>>> ravishankar.n...@gmail.com> wrote:
>>>>>
>>>>>> All,
>>>>>>
>>>>>> It has been three days continuously I am on this issue. Not getting
>>>>>> any clue.
>>>>>>
>>>>>> Environment: Spark 2.2.x, all configurations are correct.
>>>>>> hive-site.xml is in spark's conf.
>>>>>>
>>>>>> 1) Step 1: I created a data frame DF1 reading a csv file.
>>>>>>
>>>>>> 2) Did  manipulations on DF1. Resulting frame is passion_df.
>>>>>>
>>>>>> 3) passion_df.write.format("orc").saveAsTable("sampledb.passion")
>>>>>>
>>>>>> 4) The metastore shows the hive table., when I do "show tables" in
>>>>>> HIVE, I can see table name
>>>>>>
>>>>>> 5) I can't select in HIVE, though I can select from SPARK as
>>>>>> spark.sql("select * from sampledb.passion")
>>>>>>
>>>>>> Whats going on here? Please help. Why I am not seeing data from HIVE
>>>>>> prompt?
>>>>>> The "describe formatted " command on the table in HIVE shows he data
>>>>>> is is in default warehouse location ( /user/hive/warehouse) since I set 
>>>>>> it.
>>>>>>
>>>>>> I am not getting any definite answer anywhere. Many suggestions and
>>>>>> answers given in Stackoverflow et al.Nothing really works.
>>>>>>
>>>>>> So asking experts here for some light on this, thanks
>>>>>>
>>>>>> Best,
>>>>>> Ravion
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>>
>
>
> --
>
>

Re: Spark Dataframe and HIVE

Reply via email to