Re: Spark process creating and writing to a Hive ORC table

Mich Talebzadeh Fri, 01 Apr 2016 09:07:38 -0700

yes this is feasible.

You can use databricks jar file to loas csv files from staging directory.
This is pretty standard


val df =
sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
"true").option("header", "true").load("hdfs://xxxxxx:9000/data/stg/")

You can then create an ORC table from Spark


// Specify Hive DB namw
sql("use accounts")
//
// Drop and create table ll_18740868 .Prefix the tablename with Hive
database name
//
sql("DROP TABLE IF EXISTS accounts.ll_18740868")
var sqltext : String = ""
sqltext = """
CREATE TABLE *accounts.ll_18740868* (
TransactionDate            DATE
,TransactionType           String
,SortCode                  String
)
COMMENT 'from csv file from excel sheet'
STORED AS ORC
TBLPROPERTIES ( "orc.compress"="ZLIB" )
"""
sql(sqltext)

/
// Put data in Hive table say from a spark temporary table called say "tmp"
here
//
sqltext = """
INSERT INTO TABLE *accounts.ll_18740868*
SELECT

TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(TransactionDate,'dd/MM/yyyy'),'yyyy-MM-dd'))
AS TransactionDate
        , TransactionType
        , SortCode
 FROM tmp
"""
sql(sqltext)

That is it. The above will create and populate an ORC table for you in Hive
database

With regard to the use of Parquet as an optimum file for Spark, well I am
not sure. It depends on the use case. I personally prefer an ORC file.
Having said that Parquet seems to be common in Spark. Personally I would
rather store a file as a Hive table for better manageability.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 31 March 2016 at 22:00, Ashok Kumar <ashok34...@yahoo.com.invalid> wrote:

> Hello,
>
> How feasible is to use Spark to extract csv files and creates and writes
> the content to an ORC table in a Hive database.
>
> Is Parquet file the best (optimum) format to write to HDFS from Spark app.
>
> Thanks
>

Re: Spark process creating and writing to a Hive ORC table

Reply via email to