Sqoop and rejected rows in export process
Hello every one, I search if in Sqoop , is there a way to catch (and manage) rejected rows in export process ( like duplicate keys, data mismatch type, ...etc ) ?. I tested table staging but that is not a good manner to manage export data in relational data bases. Thanks by advance ;) -- --- Life and Relations are not binary *Matouk IFTISSEN | Consultant BI Big Data[image: http://www.ysance.com] * 24 rue du sentier - 75002 Paris - www.ysance.com http://www.ysance.com/ Fax : +33 1 73 72 97 26 *Ysance sur* :*Twitter* http://twitter.com/ysance* | Facebook https://www.facebook.com/pages/Ysance/131036788697 | Google+ https://plus.google.com/u/0/b/115710923959357341736/115710923959357341736/posts | LinkedIn http://www.linkedin.com/company/ysance | Newsletter http://www.ysance.com/nous-contacter.html* *Nos autres sites* : *ys4you* http://.ys4you.com/* | labdecisionnel http://www.labdecisionnel.com/ | decrypt http://decrypt.ysance.com/*
Re: CSV file reading in hive
Hi Sreeman, Unfortunately, I don't think that Hive built-in format can currently read csv files with fields enclosed in double quotes. More generally, for having ingested quite a lot of messy csv files myself, I would recommend you to write a MapReduce (or Spark) job for cleaning your csv before giving it to Hive. This is what I did. The (other) kind of issue I've met were among : - File not encoded in utf-8, making special characters unreadable for Hive - Some lines with missing or too many columns, which could shift your columns and ruin your stats. - Some lines with unreadable characters (probably data corruption) - I even got some lines with java stack traces in it I hope your csv is cleaner than that, and would recommend that if you have the control on how it is generated, replace your current separator with tab (and replace inline tabs with \t) or something like that. There might be some open source tools for data cleaning already out there. I plan to release mine one day, once I've migrated it to Spark maybe, and if my company agrees. If you're lazy, I heard that Dataiku Studio (which has a free version) can do such thing, though I never used it myself. Hope this helps, Furcy 2015-02-13 7:30 GMT+01:00 Slava Markeyev slava.marke...@upsight.com: You can use lazy simple serde with ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\'. Check the DDL for details https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL On Thu, Feb 12, 2015 at 8:19 PM, Sreeman sreebalin...@gmail.com wrote: Hi All, How all of you are creating hive/Impala table when the CSV file has some values with COMMA in between. it is like sree,12345,payment made,but it is not successful I know opencsv serde is there but it is not available in lower versions of Hive 14.0 -- Slava Markeyev | Engineering | Upsight Find me on LinkedIn http://www.linkedin.com/in/slavamarkeyev http://www.linkedin.com/in/slavamarkeyev
HIVE Custom InputFormat for Sequence Files
Hello! I have csv files which are small in size which are moved to the HDFS using the SequenceFile Format. The key is the file name and contents of the file becomes the value. Now I want to create an external table on these csv files using HIVE. But when I do I get only the first row of each csv file. For example, Assume the csv files contain three columns - Col1, Col2, Col3 and I have 3 CSV files - File1, File2, File3. File 1 10,20,30 40,50,60, 70,80,90 File2 100,110,120 130,140,150 160,170,180 File3 200,210,220 230,240,250 260,270,280 A sequence file is created - File1 Contents of File1 File2 Contents of File2 File3 Contents of File3 Now when I create an external table Stored as SEQUENCEFILE and do a SELECT ALL query using HIVE I get the following result 10 20 30 100 110120 200210220 I am aware that I need to write a custom inputformat, custom recordreader and custom serde. Also, a sequence file treats one key-value pair as one row. I dont understand how to split one row (corresponding to one value) of a sequence file into multiple rows in a HIVE table. Any suggestions on how to go about this? Regards, VR
Fwd: Custom Input Format for Sequence Files
Hello! I have csv files which are small in size which are moved to the HDFS using the SequenceFile Format. The key is the file name and contents of the file becomes the value. Now I want to create an external table on these csv files using HIVE. But when I do I get only the first row of each csv file. For example, Assume the csv files contain three columns - Col1, Col2, Col3 and I have 3 CSV files - File1, File2, File3. File 1 10,20,30 40,50,60, 70,80,90 File2 100,110,120 130,140,150 160,170,180 File3 200,210,220 230,240,250 260,270,280 A sequence file is created - File1 Contents of File1 File2 Contents of File2 File3 Contents of File3 Now when I create an external table Stored as SEQUENCEFILE and do a SELECT ALL query using HIVE I get the following result 10 20 30 100 110120 200210220 I am aware that I need to write a custom inputformat, custom recordreader and custom serde. Also, a sequence file treats one key-value pair as one row. I dont understand how to split one row (corresponding to one value) of a sequence file into multiple rows in a HIVE table. Any suggestions on how to go about this? Regards, VR
Re: writing to partitions with HCatWriter
This sounds like a bug in the HCatWriter. You should file a JIRA so we can track it. Alan. Nathan Bamford mailto:nathan.bamf...@redpoint.net February 13, 2015 at 13:50 Hi all, I'm using HCatWriter in a java program to write records to a partitioned Hive table. It works great, but I notice it leaves behind the _SCRATCH directories it uses for staging (before HCatWriter.commit is called). When it's all said and done, the partitioned records are in the appropriate directory (e.g. state=CO), and the _SCRATCH directories are empty. I tried running a load of the same records/partition values via the CLI, and after the mapreduce job has finished, the _SCRATCH directories are cleaned up. Only the finished partition dirs remain. Is there something I'm missing with HCatWriter? Thanks, Nathan
Re: CSV file reading in hive
hive csv serde is available for all hive versions https://github.com/ogrodnek/csv-serde DEFAULT_ESCAPE_CHARACTER \ DEFAULT_QUOTE_CHARACTER DEFAULT_SEPARATOR, add jar path/to/csv-serde.jar; (or put it to hive/hadoop/mr classpath on all boxes on cluster) -- you can use custom separator/quote/escape create table my_table(a string, b string, ...) row format serde 'com.bizo.hive.serde.csv.CSVSerde' with serdeproperties ( separatorChar = \t, quoteChar = ', escapeChar= \\ ) stored as textfile ; On Thu, Feb 12, 2015 at 8:19 PM, Sreeman sreebalin...@gmail.com wrote: Hi All, How all of you are creating hive/Impala table when the CSV file has some values with COMMA in between. it is like sree,12345,payment made,but it is not successful I know opencsv serde is there but it is not available in lower versions of Hive 14.0
writing to partitions with HCatWriter
Hi all, I'm using HCatWriter in a java program to write records to a partitioned Hive table. It works great, but I notice it leaves behind the _SCRATCH directories it uses for staging (before HCatWriter.commit is called).? When it's all said and done, the partitioned records are in the appropriate directory (e.g. state=CO), and the _SCRATCH directories are empty. I tried running a load of the same records/partition values via the CLI, and after the mapreduce job has finished, the _SCRATCH directories are cleaned up. Only the finished partition dirs remain. Is there something I'm missing with HCatWriter? Thanks, Nathan
Re: CSV file reading in hive
Hi Furcy, Thats lot of information.Thanks a lot On Feb 13, 2015 3:40 PM, Furcy Pin furcy@flaminem.com wrote: Hi Sreeman, Unfortunately, I don't think that Hive built-in format can currently read csv files with fields enclosed in double quotes. More generally, for having ingested quite a lot of messy csv files myself, I would recommend you to write a MapReduce (or Spark) job for cleaning your csv before giving it to Hive. This is what I did. The (other) kind of issue I've met were among : - File not encoded in utf-8, making special characters unreadable for Hive - Some lines with missing or too many columns, which could shift your columns and ruin your stats. - Some lines with unreadable characters (probably data corruption) - I even got some lines with java stack traces in it I hope your csv is cleaner than that, and would recommend that if you have the control on how it is generated, replace your current separator with tab (and replace inline tabs with \t) or something like that. There might be some open source tools for data cleaning already out there. I plan to release mine one day, once I've migrated it to Spark maybe, and if my company agrees. If you're lazy, I heard that Dataiku Studio (which has a free version) can do such thing, though I never used it myself. Hope this helps, Furcy 2015-02-13 7:30 GMT+01:00 Slava Markeyev slava.marke...@upsight.com: You can use lazy simple serde with ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\'. Check the DDL for details https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL On Thu, Feb 12, 2015 at 8:19 PM, Sreeman sreebalin...@gmail.com wrote: Hi All, How all of you are creating hive/Impala table when the CSV file has some values with COMMA in between. it is like sree,12345,payment made,but it is not successful I know opencsv serde is there but it is not available in lower versions of Hive 14.0 -- Slava Markeyev | Engineering | Upsight Find me on LinkedIn http://www.linkedin.com/in/slavamarkeyev http://www.linkedin.com/in/slavamarkeyev