[
https://issues.apache.org/jira/browse/PHOENIX-53?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
James Violette updated PHOENIX-53:
----------------------------------
Comment: was deleted
(was: We found that the apache commons-csv library
(http://commons.apache.org/proper/commons-csv/) has the features we need to
handle different csv formats. Even though it is not yet released, this library
is being actively developed (last commit 1/2014) and it works well in SNAPSHOT
mode. By contrast, the opencsv project
(http://sourceforge.net/projects/opencsv/files/opencsv/) has been dormant since
2011, long enough to force a fork (http://code.google.com/p/opencsv/).
We found that the opencsv loader accepted bad encapsulated meta-character
records and ended up getting confused, which resulted in a 50% load success
rate and significant data corruption. By comparison, the apache commons-csv
parser threw an exception when the csv format was not followed. That exception
allowed us to isolate the issue and also prevent subsequent corrupt data
records.
We used the commons-csv source from this repo:
http://svn.apache.org/repos/asf/commons/proper/csv/trunk/
We have created a CSVCommonsLoader that uses this parser. The attached patch
replaces the current CSVLoader in PhoenixRuntime. Someone can update the
command line parameters to swap out the parsers, if required. The associated
CSVCommonsLoaderTests verify regression tests with the supplied test data plus
two new tests with the encapsulated characters.
)
> Replace CSV loader with Apache Commons CSV loader
> -------------------------------------------------
>
> Key: PHOENIX-53
> URL: https://issues.apache.org/jira/browse/PHOENIX-53
> Project: Phoenix
> Issue Type: Bug
> Affects Versions: 2.2.3, 3.0.0
> Reporter: James Violette
> Labels: patch
> Fix For: 2.2.3, 3.0.0
>
> Attachments: commons-csv-1.0-SNAPSHOT-sources.jar,
> commons-csv-1.0-SNAPSHOT.jar, incubator-phoenix-commons-csv-rev2-3.0.0.patch
>
>
> in org.apache.phoenix.util.CSVLoader, the upsert fails if it encounters an
> empty line. This occurs if all lines end with the new line character and the
> reader returns an empty line at the end. Other issues, such as encapsulated
> meta characters also occur.
> The fix is to replace the opencsv library with the current apache commons-csv
> library.
--
This message was sent by Atlassian JIRA
(v6.2#6252)