Re: Spark Dataframe and HIVE

Gourav Sengupta Fri, 09 Feb 2018 08:11:32 -0800

Hi Ravi,

can you please post the entire code?


Regards,
Gourav

On Fri, Feb 9, 2018 at 3:39 PM, Patrick Alwell <palw...@hortonworks.com>
wrote:

> Might sound silly, but are you using a Hive context?
>
> What errors do the Hive query results return?
>
>
>
> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
>
>
>
> The second part of your questions, you are creating a temp table and then
> subsequently creating another table from that temp view. Doesn’t seem like
> you are reading the table from the spark or hive warehouse.
>
>
>
> This works fine for me; albeit I was using spark thrift to communicate
> with my directory of choice.
>
>
>
> *from pyspark import SparkContext*
>
> *from pyspark.sql import SparkSession, Row, types*
>
> *from pyspark.sql.types import **
>
> *from pyspark.sql import functions as f*
>
> *from decimal import **
>
> *from datetime import datetime*
>
>
>
> *# instantiate our sparkSession and context*
>
> *spark = SparkSession.builder.enableHiveSupport().getOrCreate()*
>
> *sc = spark.sparkContext*
>
>
>
> *# Generating customer orc table files*
>
> *# load raw data as an RDD*
>
> *customer_data = sc.textFile("/data/tpch/customer.tbl")*
>
> *# map the data into an RDD split with pipe delimitations*
>
> *customer_split = customer_data.map(lambda l: l.split("|"))*
>
> *# map the split data with a row method; this is where we specificy column
> names and types*
>
> *# default type is string- UTF8*
>
> *# there are issues with converting string to date and these issues have
> been addressed*
>
> *# in those tables with dates: See notes below*
>
> *customer_row = customer_split.map( lambda r: Row(*
>
> *    custkey=long(r[0]),*
>
> *    name=r[1],*
>
> *    address=r[2],*
>
> *    nationkey=long(r[3]),*
>
>     *phone=r[4],*
>
> *    acctbal=Decimal(r[5]),*
>
>     *mktsegment=r[6],*
>
> *    comment=r[7]*
>
> *))*
>
>
>
> *# we can have Spark infer the schema, or apply a strict schema and
> identify whether or not we want null values*
>
> *# in this case we don't want null values for keys; and we want explicit
> data types to support the TPCH tables/ data model*
>
> *customer_schema = types.StructType([*
>
> *       types.StructField('custkey',types.LongType(),False)*
>
> *       ,types.StructField('name',types.StringType())*
>
> *       ,types.StructField('address',types.StringType())*
>
> *       ,types.StructField('nationkey',types.LongType(),False)*
>
> *       ,types.StructField('phone',types.StringType())*
>
> *       ,types.StructField('acctbal',types.DecimalType())*
>
> *       ,types.StructField('mktsegment',types.StringType())*
>
> *       ,types.StructField('comment',types.StringType())])*
>
>
>
> *# we create a dataframe object by referencing our sparkSession class and
> the createDataFrame method*
>
> *# this method takes two arguments by default (row, schema)*
>
> *customer_df = spark.createDataFrame(customer_row,customer_schema)*
>
>
>
> *# we can now write a file of type orc by referencing our dataframe object
> we created*
>
> *customer_df.write.orc("/data/tpch/customer.orc")*
>
>
>
> *# read that same file we created but create a seperate dataframe object*
>
> *customer_df_orc = spark.read.orc("/data/tpch/customer.orc")*
>
>
>
> *# reference the newly created dataframe object and create a tempView for
> QA purposes*
>
> *customer_df_orc.createOrReplaceTempView("customer")*
>
>
>
> *# reference the sparkSession class and SQL method in order to issue SQL
> statements to the materialized view*
>
> *spark.sql("SELECT * FROM customer LIMIT 10").show()*
>
>
>
> *From: *"☼ R Nair (रविशंकर नायर)" <ravishankar.n...@gmail.com>
> *Date: *Friday, February 9, 2018 at 7:03 AM
> *To: *"user @spark/'user @spark'/spark users/user@spark" <
> user@spark.apache.org>
> *Subject: *Re: Spark Dataframe and HIVE
>
>
>
> An update: (Sorry I missed)
>
>
>
> When I do
>
>
>
> passion_df.createOrReplaceTempView("sampleview")
>
>
>
> spark.sql("create table sample table as select * from sample view")
>
>
>
> Now, I can see table and can query as well.
>
>
>
> So why this do work from Spark and other method discussed below is not?
>
>
>
> Thanks
>
>
>
>
>
>
>
> On Fri, Feb 9, 2018 at 9:49 AM, ☼ R Nair (रविशंकर नायर) <
> ravishankar.n...@gmail.com> wrote:
>
> All,
>
>
>
> It has been three days continuously I am on this issue. Not getting any
> clue.
>
>
>
> Environment: Spark 2.2.x, all configurations are correct. hive-site.xml is
> in spark's conf.
>
>
>
> 1) Step 1: I created a data frame DF1 reading a csv file.
>
>
>
> 2) Did  manipulations on DF1. Resulting frame is passion_df.
>
>
>
> 3) passion_df.write.format("orc").saveAsTable("sampledb.passion")
>
>
>
> 4) The metastore shows the hive table., when I do "show tables" in HIVE, I
> can see table name
>
>
>
> 5) I can't select in HIVE, though I can select from SPARK as
> spark.sql("select * from sampledb.passion")
>
>
>
> Whats going on here? Please help. Why I am not seeing data from HIVE
> prompt?
>
> The "describe formatted " command on the table in HIVE shows he data is is
> in default warehouse location ( /user/hive/warehouse) since I set it.
>
>
>
> I am not getting any definite answer anywhere. Many suggestions and
> answers given in Stackoverflow et al.Nothing really works.
>
>
>
> So asking experts here for some light on this, thanks
>
>
>
> Best,
>
> Ravion
>
>
>
>
>
>
>
>
>
> --
>
> [image: mage removed by sender.]
>

Re: Spark Dataframe and HIVE

Reply via email to