Importing csv files into Hive ORC target table

Mich Talebzadeh Thu, 18 Feb 2016 09:35:30 -0800

 

> thanks, 
> 
> I have an issue here. 
> 
> define rdd to read the CSV file 
> 
> scala> var csv = sc.textFile("/data/stg/table2")
> csv: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[69] at textFile at 
> <console>:27 
> 
> I then get rid of the header 
> 
> scala> val csv2 = csv.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) 
> iter.drop(1) else iter }
> csv2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[72] at 
> mapPartitionsWithIndex at <console>:29 
> 
> This is what I have now 
> 
> scala> csv.first 
> 
> res79: String = Invoice Number,Payment date,Net,VAT,Total 
> 
> SCALA> CSV2.FIRST 
> 
> RES80: STRING = 360,10/02/2014,"?2,500.00",?0.00,"?2,500.00" 
> 
> Then I define a class based on the columns 
> 
> scala> case class Invoice(invoice: String, date: String, net: String, vat: 
> String, total: String)
> defined class Invoice 
> 
> Next stage to map the data to their individual columns 
> 
> scala> val ttt = csv2.map(_.split(",")).map(p => 
> Invoice(p(0),p(1),p(2),p(3),p(4)))
> ttt: org.apache.spark.rdd.RDD[Invoice] = MapPartitionsRDD[74] at map at 
> <console>:33 
> 
> the problem now I have is that one column is missing 
> 
> SCALA> TTT.FIRST
> RES81: INVOICE = INVOICE(360,10/02/2014,"?2,500.00",?0.00) 
> 
> it seems that I am missing the last column here! 
> 
> I suspect the cause of the problem is the "," used in "?2,500.00" which is a 
> money column of "£250000" in excel. 
> 
> Any work around is appreciated. 
> 
> Thanks, 
> 
> Mich 
> 
> On 17/02/2016 22:58, Alex Dzhagriev wrote: 
> Hi Mich, 
> 
> You can use data frames 
> (http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes 
> [3]) to achieve that. 
> 
> val sqlContext = new HiveContext(sc) 
> 
> var rdd = sc.textFile("/data/stg/table2") 
> 
> //... 
> //perform you business logic, cleanups, etc. 
> //... 
> 
> sqlContext.createDataFrame(resultRdd).write.orc("..path..") 
> 
> Please, note that resultRdd should contain Products (e.g. case classes) 
> 
> Cheers, Alex. 
> 
> On Wed, Feb 17, 2016 at 11:43 PM, Mich Talebzadeh 
> <mich.talebza...@cloudtechnologypartners.co.uk> wrote:
> 
> Hi, 
> 
> We put csv files that are zipped using bzip into a staging are on hdfs 
> 
> In Hive an external table is created as below: 
> 
> DROP TABLE IF EXISTS stg_t2;
> CREATE EXTERNAL TABLE stg_t2 (
> INVOICENUMBER string
> ,PAYMENTDATE string
> ,NET string
> ,VAT string
> ,TOTAL string
> )
> COMMENT 'from csv file from excel sheet PayInsPeridaleTechnology'
> ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> STORED AS TEXTFILE
> LOCATION '/data/stg/table2'
> TBLPROPERTIES ("skip.header.line.count"="1") 
> 
> We have an ORC table in Hive created as below: 
> 
> DROP TABLE IF EXISTS t2;
> CREATE TABLE t2 (
> INVOICENUMBER INT
> ,PAYMENTDATE timestamp
> ,NET DECIMAL(20,2)
> ,VAT DECIMAL(20,2)
> ,TOTAL DECIMAL(20,2)
> )
> COMMENT 'from csv file from excel sheet PayInsPeridaleTechnology'
> STORED AS ORC
> TBLPROPERTIES ( "orc.compress"="ZLIB" )
> ; 
> 
> Then we insert the data from the external table into target table do some 
> conversion and ignoring empty rows 
> 
> INSERT INTO TABLE t2
> SELECT
> INVOICENUMBER
> , CAST(UNIX_TIMESTAMP(paymentdate,'DD/MM/YYYY')*1000 as timestamp)
> --, CAST(REGEXP_REPLACE(SUBSTR(net,2,20),",","") AS DECIMAL(20,2))
> , CAST(REGEXP_REPLACE(net,'[^\d\.]','') AS DECIMAL(20,2))
> , CAST(REGEXP_REPLACE(vat,'[^\d\.]','') AS DECIMAL(20,2))
> , CAST(REGEXP_REPLACE(total,'[^\d\.]','') AS DECIMAL(20,2))
> FROM
> stg_t2 
> 
> This works OK for now. 
> 
> I was wondering whether this could be done using operations on rdd in Spark? 
> 
> var rdd = sc.textFile("/data/stg/table2") 
> 
> I can use rdd.count to see the total rows and rdd.collect.foreach(println) to 
> see the individual rows 
> 
> I would like to get some ideas on how I can do CAST conversion etc on the 
> data to clean it up and store it in the said ORC table? 
> 
> Thanks 
> 
> -- 
> 
> Dr Mich Talebzadeh
> 
> LinkedIn 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  [1]
> 
> http://talebzadehmich.wordpress.com [2]
> 
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Cloud Technology Partners 
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
> the responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Cloud Technology partners Ltd, its subsidiaries nor their 
> employees accept any responsibility.


-- 

Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential.
This message is for the designated recipient only, if you are not the
intended recipient, you should destroy it immediately. Any information
in this message shall not be understood as given or endorsed by Cloud
Technology Partners Ltd, its subsidiaries or their employees, unless
expressly so stated. It is the responsibility of the recipient to ensure
that this email is virus free, therefore neither Cloud Technology
partners Ltd, its subsidiaries nor their employees accept any
responsibility.

-- 

Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential.
This message is for the designated recipient only, if you are not the
intended recipient, you should destroy it immediately. Any information
in this message shall not be understood as given or endorsed by Cloud
Technology Partners Ltd, its subsidiaries or their employees, unless
expressly so stated. It is the responsibility of the recipient to ensure
that this email is virus free, therefore neither Cloud Technology
partners Ltd, its subsidiaries nor their employees accept any
responsibility.

-- 

Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential.
This message is for the designated recipient only, if you are not the
intended recipient, you should destroy it immediately. Any information
in this message shall not be understood as given or endorsed by Cloud
Technology Partners Ltd, its subsidiaries or their employees, unless
expressly so stated. It is the responsibility of the recipient to ensure
that this email is virus free, therefore neither Cloud Technology
partners Ltd, its subsidiaries nor their employees accept any
responsibility.

 

Links:
------
[1]
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
[2] http://talebzadehmich.wordpress.com
[3]
http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes

Importing csv files into Hive ORC target table

Reply via email to