Re: Best way to load CSV file into Hive

2015-11-01 Thread Furcy Pin
Hi Vijaya,

If you need some nice ETL capabilities, you may want to try
https://github.com/databricks/spark-csv

Among other things, spark-csv let you read the csv as is and create and
insert a copy of the
data into a Hive table with any format you like (Parquet, ORC, etc.)

If you have a header row, it can remove it and use it get the column names
directly, and it can also perform automatic type detection.
You can specify the delimiterChar and the quoteChar, but I did not see the
escapeChar in the doc.

In the end, it's as easy as :

val df : DataFrame =
sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("delimiter", ",") // Specify delimiter
.option("quote", "\"") // Specify quoteChar
.option("inferSchema", "true") // Automatically infer data types
.load("path/to/data.csv")

df.write
  .format("orc")
  .saveAsTable("db_name.table_name")


I believe HDP now supports spark.




On Sat, Oct 31, 2015 at 10:30 PM, Jörn Franke  wrote:

> You clearly need to escape those characters as for any other tool. You may
> want to use avro instead of csv , xml or JSON etc
>
> On 30 Oct 2015, at 19:16, Vijaya Narayana Reddy Bhoomi Reddy <
> vijaya.bhoomire...@whishworks.com> wrote:
>
> Hi,
>
> I have a CSV file which contains hunderd thousand rows and about 200+
> columns. Some of the columns have free text information, which means it
> might contain characters like comma, colon, quotes etc with in the column
> content.
>
> What is the best way to load such CSV file into Hive?
>
> Another serious issue, I have stored the file in a location in HDFS and
> then created an external hive table on it. However, upon running Create
> external table using HDP Hive View, the original CSV is no longer present
> in the folder where it is meant to be. Not sure on how HDP processes and
> where it is stored? My understanding was that EXTERNAL table wouldnt be
> moved from their original HDFS location?
>
> Request someone to help out!
>
>
> Thanks & Regards
> Vijay
>
>
>
> The contents of this e-mail are confidential and for the exclusive use of
> the intended recipient. If you receive this e-mail in error please delete
> it from your system immediately and notify us either by e-mail or
> telephone. You should not copy, forward or otherwise disclose the content
> of the e-mail. The views expressed in this communication may not
> necessarily be the view held by WHISHWORKS.
>
>


Re: Best way to load CSV file into Hive

2015-10-31 Thread Jörn Franke
You clearly need to escape those characters as for any other tool. You may want 
to use avro instead of csv , xml or JSON etc

> On 30 Oct 2015, at 19:16, Vijaya Narayana Reddy Bhoomi Reddy 
>  wrote:
> 
> Hi,
> 
> I have a CSV file which contains hunderd thousand rows and about 200+ 
> columns. Some of the columns have free text information, which means it might 
> contain characters like comma, colon, quotes etc with in the column content.
> 
> What is the best way to load such CSV file into Hive?
> 
> Another serious issue, I have stored the file in a location in HDFS and then 
> created an external hive table on it. However, upon running Create external 
> table using HDP Hive View, the original CSV is no longer present in the 
> folder where it is meant to be. Not sure on how HDP processes and where it is 
> stored? My understanding was that EXTERNAL table wouldnt be moved from their 
> original HDFS location?
> 
> Request someone to help out!
> 
> 
> Thanks & Regards
> Vijay
> 
> 
> 
> The contents of this e-mail are confidential and for the exclusive use of the 
> intended recipient. If you receive this e-mail in error please delete it from 
> your system immediately and notify us either by e-mail or telephone. You 
> should not copy, forward or otherwise disclose the content of the e-mail. The 
> views expressed in this communication may not necessarily be the view held by 
> WHISHWORKS.


Re: Best way to load CSV file into Hive

2015-10-30 Thread Martin Menzel
Hi
Do have access to the data source?
If not you have first to find out if the data can be mapped to the columns
in a unique way and for all rows. If yes maybe bindy can be a option to
convert the data in a first step to tsv.
I hope this helps.
Regards
Martin
Am 30.10.2015 19:16 schrieb "Vijaya Narayana Reddy Bhoomi Reddy" <
vijaya.bhoomire...@whishworks.com>:

> Hi,
>
> I have a CSV file which contains hunderd thousand rows and about 200+
> columns. Some of the columns have free text information, which means it
> might contain characters like comma, colon, quotes etc with in the column
> content.
>
> What is the best way to load such CSV file into Hive?
>
> Another serious issue, I have stored the file in a location in HDFS and
> then created an external hive table on it. However, upon running Create
> external table using HDP Hive View, the original CSV is no longer present
> in the folder where it is meant to be. Not sure on how HDP processes and
> where it is stored? My understanding was that EXTERNAL table wouldnt be
> moved from their original HDFS location?
>
> Request someone to help out!
>
>
> Thanks & Regards
> Vijay
>
>
>
> The contents of this e-mail are confidential and for the exclusive use of
> the intended recipient. If you receive this e-mail in error please delete
> it from your system immediately and notify us either by e-mail or
> telephone. You should not copy, forward or otherwise disclose the content
> of the e-mail. The views expressed in this communication may not
> necessarily be the view held by WHISHWORKS.


Re: Best way to load CSV file into Hive

2015-10-30 Thread Daniel Lopes
Hello,

If you have file with diferents types of data, it's prefered to use other
type of file like TSV, ORC or Parquet.

Best,

On Fri, Oct 30, 2015 at 4:16 PM, Vijaya Narayana Reddy Bhoomi Reddy <
vijaya.bhoomire...@whishworks.com> wrote:

> Hi,
>
> I have a CSV file which contains hunderd thousand rows and about 200+
> columns. Some of the columns have free text information, which means it
> might contain characters like comma, colon, quotes etc with in the column
> content.
>
> What is the best way to load such CSV file into Hive?
>
> Another serious issue, I have stored the file in a location in HDFS and
> then created an external hive table on it. However, upon running Create
> external table using HDP Hive View, the original CSV is no longer present
> in the folder where it is meant to be. Not sure on how HDP processes and
> where it is stored? My understanding was that EXTERNAL table wouldnt be
> moved from their original HDFS location?
>
> Request someone to help out!
>
>
> Thanks & Regards
> Vijay
>
>
>
> The contents of this e-mail are confidential and for the exclusive use of
> the intended recipient. If you receive this e-mail in error please delete
> it from your system immediately and notify us either by e-mail or
> telephone. You should not copy, forward or otherwise disclose the content
> of the e-mail. The views expressed in this communication may not
> necessarily be the view held by WHISHWORKS.




-- 
*Daniel Lopes, B.Eng*
Data Scientist - BankFacil
CREA/SP 5069410560

Mob +55 (18) 99764-2733 
Ph +55 (11) 3522-8009
http://about.me/dannyeuu

Av. Nova Independência, 956, São Paulo, SP
Bairro Brooklin Paulista
CEP 04570-001
https://www.bankfacil.com.br