Sqoop and rejected rows in export process

2015-02-13 Thread Matouk IFTISSEN
Hello every one,
I search if in Sqoop , is there a way to catch  (and manage) rejected rows
in export process ( like duplicate keys, data mismatch type, ...etc ) ?. I
tested table staging but that is not a good manner to manage export data in
relational data bases.

Thanks by advance ;)

-- 
---
Life and Relations are not binary

*Matouk IFTISSEN | Consultant BI  Big Data[image: http://www.ysance.com] *
24 rue du sentier - 75002 Paris - www.ysance.com http://www.ysance.com/
Fax : +33 1 73 72 97 26
*Ysance sur* :*Twitter* http://twitter.com/ysance* | Facebook
https://www.facebook.com/pages/Ysance/131036788697 | Google+
https://plus.google.com/u/0/b/115710923959357341736/115710923959357341736/posts
| LinkedIn
http://www.linkedin.com/company/ysance | Newsletter
http://www.ysance.com/nous-contacter.html*
*Nos autres sites* : *ys4you* http://.ys4you.com/* | labdecisionnel
http://www.labdecisionnel.com/ | decrypt http://decrypt.ysance.com/*


Re: CSV file reading in hive

2015-02-13 Thread Furcy Pin
Hi Sreeman,

Unfortunately, I don't think that Hive built-in format can currently read
csv files with fields enclosed in double quotes.
More generally, for having ingested quite a lot of messy csv files myself,
I would recommend you to write a MapReduce (or Spark) job
for cleaning your csv before giving it to Hive. This is what I did.
The (other) kind of issue I've met were among :

   - File not encoded in utf-8, making special characters unreadable for
   Hive
   - Some lines with missing or too many columns, which could shift your
   columns and ruin your stats.
   - Some lines with unreadable characters (probably data corruption)
   - I even got some lines with java stack traces in it

I hope your csv is cleaner than that, and would recommend that if you have
the control on how it is generated, replace your current separator with tab
(and replace inline tabs with \t) or something like that.

There might be some open source tools for data cleaning already out there.
I plan to release mine one day, once I've migrated it to Spark maybe, and
if my company agrees.

If you're lazy, I heard that Dataiku Studio (which has a free version) can
do such thing, though I never used it myself.

Hope this helps,

Furcy



2015-02-13 7:30 GMT+01:00 Slava Markeyev slava.marke...@upsight.com:

 You can use lazy simple serde with ROW FORMAT DELIMITED FIELDS TERMINATED
 BY ',' ESCAPED BY '\'. Check the DDL for details
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL



 On Thu, Feb 12, 2015 at 8:19 PM, Sreeman sreebalin...@gmail.com wrote:

  Hi All,

 How all of you are creating hive/Impala table when the CSV file has some
 values with COMMA in between. it is like

 sree,12345,payment made,but it is not successful





 I know opencsv serde is there but it is not available in lower versions
 of Hive 14.0






 --

 Slava Markeyev | Engineering | Upsight
 Find me on LinkedIn http://www.linkedin.com/in/slavamarkeyev
 http://www.linkedin.com/in/slavamarkeyev



HIVE Custom InputFormat for Sequence Files

2015-02-13 Thread Varsha Raveendran
Hello!

I have csv files which are small in size which are moved  to the HDFS using
the SequenceFile Format. The key is the file name and contents of the file
becomes the value.

Now I want to create an external table on these csv files using HIVE. But
when I do I get only the first row of each csv file.

For example,

Assume the csv files contain three columns - Col1, Col2, Col3 and I have 3
CSV files - File1, File2, File3.

File 1
10,20,30
40,50,60,
70,80,90

File2
100,110,120
130,140,150
160,170,180

File3
200,210,220
230,240,250
260,270,280


A sequence file is created -
File1 Contents of  File1
File2 Contents of  File2
File3 Contents of  File3

Now when I create an external table Stored as SEQUENCEFILE and do a SELECT
ALL query using HIVE I get the following result
10 20  30
100   110120
200210220

I am aware that I need to write a custom inputformat, custom recordreader
and custom serde. Also, a sequence file treats one key-value pair as one
row.
I dont understand how to split one row (corresponding to one value) of a
sequence file into multiple rows in a HIVE table.

Any suggestions on how to go about this?

Regards,
VR


Fwd: Custom Input Format for Sequence Files

2015-02-13 Thread Varsha Raveendran
Hello!

I have csv files which are small in size which are moved  to the HDFS using
the SequenceFile Format. The key is the file name and contents of the file
becomes the value.

Now I want to create an external table on these csv files using HIVE. But
when I do I get only the first row of each csv file.

For example,

Assume the csv files contain three columns - Col1, Col2, Col3 and I have 3
CSV files - File1, File2, File3.

File 1
10,20,30
40,50,60,
70,80,90

File2
100,110,120
130,140,150
160,170,180

File3
200,210,220
230,240,250
260,270,280


A sequence file is created -
File1 Contents of  File1
File2 Contents of  File2
File3 Contents of  File3

Now when I create an external table Stored as SEQUENCEFILE and do a SELECT
ALL query using HIVE I get the following result
10 20  30
100   110120
200210220

I am aware that I need to write a custom inputformat, custom recordreader
and custom serde. Also, a sequence file treats one key-value pair as one
row.
I dont understand how to split one row (corresponding to one value) of a
sequence file into multiple rows in a HIVE table.

Any suggestions on how to go about this?

Regards,
VR


Re: writing to partitions with HCatWriter

2015-02-13 Thread Alan Gates
This sounds like a bug in the HCatWriter.  You should file a JIRA so we 
can track it.


Alan.


Nathan Bamford mailto:nathan.bamf...@redpoint.net
February 13, 2015 at 13:50

Hi all,

  I'm using HCatWriter in a java program to write records to a 
partitioned Hive table. It works great, but I notice it leaves behind 
the _SCRATCH directories it uses for staging (before HCatWriter.commit 
is called).


  When it's all said and done, the partitioned records are in the 
appropriate directory (e.g. state=CO), and the _SCRATCH directories 
are empty.


  I tried running a load of the same records/partition values via the 
CLI, and after the mapreduce job has finished, the _SCRATCH 
directories are cleaned up. Only the finished partition dirs remain.


  Is there something I'm  missing with HCatWriter?


Thanks,


Nathan




Re: CSV file reading in hive

2015-02-13 Thread Alexander Pivovarov
hive csv serde is available for all hive versions

https://github.com/ogrodnek/csv-serde


DEFAULT_ESCAPE_CHARACTER \
DEFAULT_QUOTE_CHARACTER  
DEFAULT_SEPARATOR,


add jar path/to/csv-serde.jar;   (or put it to hive/hadoop/mr
classpath on all boxes on cluster)

-- you can use custom separator/quote/escape

create table my_table(a string, b string, ...)
 row format serde 'com.bizo.hive.serde.csv.CSVSerde'
 with serdeproperties (
   separatorChar = \t,
   quoteChar = ',
   escapeChar= \\
  )
 stored as textfile
;



On Thu, Feb 12, 2015 at 8:19 PM, Sreeman sreebalin...@gmail.com wrote:

  Hi All,

 How all of you are creating hive/Impala table when the CSV file has some
 values with COMMA in between. it is like

 sree,12345,payment made,but it is not successful





 I know opencsv serde is there but it is not available in lower versions of
 Hive 14.0





writing to partitions with HCatWriter

2015-02-13 Thread Nathan Bamford
Hi all,

  I'm using HCatWriter in a java program to write records to a partitioned Hive 
table. It works great, but I notice it leaves behind the _SCRATCH directories 
it uses for staging (before HCatWriter.commit is called).?

  When it's all said and done, the partitioned records are in the appropriate 
directory (e.g. state=CO), and the _SCRATCH directories are empty.

  I tried running a load of the same records/partition values via the CLI, and 
after the mapreduce job has finished, the _SCRATCH directories are cleaned up. 
Only the finished partition dirs remain.

  Is there something I'm  missing with HCatWriter?


Thanks,


Nathan



Re: CSV file reading in hive

2015-02-13 Thread sreebalineni .
Hi Furcy,
Thats lot of information.Thanks a lot
On Feb 13, 2015 3:40 PM, Furcy Pin furcy@flaminem.com wrote:

 Hi Sreeman,

 Unfortunately, I don't think that Hive built-in format can currently read
 csv files with fields enclosed in double quotes.
 More generally, for having ingested quite a lot of messy csv files myself,
 I would recommend you to write a MapReduce (or Spark) job
 for cleaning your csv before giving it to Hive. This is what I did.
 The (other) kind of issue I've met were among :

- File not encoded in utf-8, making special characters unreadable for
Hive
- Some lines with missing or too many columns, which could shift your
columns and ruin your stats.
- Some lines with unreadable characters (probably data corruption)
- I even got some lines with java stack traces in it

 I hope your csv is cleaner than that, and would recommend that if you have
 the control on how it is generated, replace your current separator with tab
 (and replace inline tabs with \t) or something like that.

 There might be some open source tools for data cleaning already out there.
 I plan to release mine one day, once I've migrated it to Spark maybe, and
 if my company agrees.

 If you're lazy, I heard that Dataiku Studio (which has a free version) can
 do such thing, though I never used it myself.

 Hope this helps,

 Furcy



 2015-02-13 7:30 GMT+01:00 Slava Markeyev slava.marke...@upsight.com:

 You can use lazy simple serde with ROW FORMAT DELIMITED FIELDS TERMINATED
 BY ',' ESCAPED BY '\'. Check the DDL for details
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL



 On Thu, Feb 12, 2015 at 8:19 PM, Sreeman sreebalin...@gmail.com wrote:

  Hi All,

 How all of you are creating hive/Impala table when the CSV file has some
 values with COMMA in between. it is like

 sree,12345,payment made,but it is not successful





 I know opencsv serde is there but it is not available in lower versions
 of Hive 14.0






 --

 Slava Markeyev | Engineering | Upsight
 Find me on LinkedIn http://www.linkedin.com/in/slavamarkeyev
 http://www.linkedin.com/in/slavamarkeyev