[ 
https://issues.apache.org/jira/browse/PHOENIX-53?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Violette updated PHOENIX-53:
----------------------------------

    Comment: was deleted

(was: We found that the apache commons-csv library 
(http://commons.apache.org/proper/commons-csv/) has the features we need to 
handle different csv formats. Even though it is not yet released, this library 
is being actively developed (last commit 1/2014) and it works well in SNAPSHOT 
mode. By contrast, the opencsv project 
(http://sourceforge.net/projects/opencsv/files/opencsv/) has been dormant since 
2011, long enough to force a fork (http://code.google.com/p/opencsv/).

We found that the opencsv loader accepted bad encapsulated meta-character 
records and ended up getting confused, which resulted in a 50% load success 
rate and significant data corruption.  By comparison, the apache commons-csv 
parser threw an exception when the csv format was not followed. That exception 
allowed us to isolate the issue and also prevent subsequent corrupt data 
records.

We used the commons-csv source from this repo:
http://svn.apache.org/repos/asf/commons/proper/csv/trunk/

We have created a CSVCommonsLoader that uses this parser. The attached patch 
replaces the current CSVLoader in PhoenixRuntime. Someone can update the 
command line parameters to swap out the parsers, if required. The associated 
CSVCommonsLoaderTests verify regression tests with the supplied test data plus 
two new tests with the encapsulated characters.

)

> Replace CSV loader with Apache Commons CSV loader
> -------------------------------------------------
>
>                 Key: PHOENIX-53
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-53
>             Project: Phoenix
>          Issue Type: Bug
>    Affects Versions: 2.2.3, 3.0.0
>            Reporter: James Violette
>              Labels: patch
>             Fix For: 2.2.3, 3.0.0
>
>         Attachments: commons-csv-1.0-SNAPSHOT-sources.jar, 
> commons-csv-1.0-SNAPSHOT.jar, incubator-phoenix-commons-csv-rev2-3.0.0.patch
>
>
> in org.apache.phoenix.util.CSVLoader, the upsert fails if it encounters an 
> empty line.  This occurs if all lines end with the new line character and the 
> reader returns an empty line at the end. Other issues, such as encapsulated 
> meta characters also occur.
> The fix is to replace the opencsv library with the current apache commons-csv 
> library.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to