[
https://issues.apache.org/jira/browse/PHOENIX-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13936038#comment-13936038
]
Gabriel Reid commented on PHOENIX-129:
--------------------------------------
Thanks for the feedback [~ndimiduk] and [~jamestaylor].
{quote}One question: did you try it on a real cluster as well? {quote}
I haven't run it on a real cluster yet, as that'll require a bit more work with
the assembly first, and I wanted to check off the general approach here first.
I'll put together the assembly and do some more testing with it, as well as
making the other changes suggested.
{quote}I think the changes you made make sense in terms of config
options.{quote}
Just to clarify as well, I would actually consider completely removing the
python script for running the mapred import, as I feel it will get in the way
more than helping if we're running the bulk import with the "hadoop jar"
command. Does that sound ok?
{quote}+ // TODO Creating a new parser for each line seems terribly inefficient
but
+ // there's no public way to parse single lines via commons-csv. We should
update
+ // it to create a LineParser class like this one.
Good idea. File JIRA with Apache Commons?{quote}
Yep, that's my plan. I was going to see how much work it is to put together the
patch myself and then apply it to the copy of the commons-csv in the phoenix
code tree as well, but I'll have to see how involved that is. If I do that,
I'll also try to get a better idea of when a release of commons-csv can be
expected so that the copy of the sources in Phoenix can be removed.
> Improve MapReduce-based import
> ------------------------------
>
> Key: PHOENIX-129
> URL: https://issues.apache.org/jira/browse/PHOENIX-129
> Project: Phoenix
> Issue Type: Improvement
> Reporter: Gabriel Reid
> Assignee: Gabriel Reid
> Attachments: PHOENIX-129-3.0.patch, PHOENIX-129-master.patch
>
>
> In implementing PHOENIX-66, it was noted that the current MapReduce-based
> importer implementation has a number issues, including the following:
> * CSV handling is largely replicated from the non-MR code, with no ability to
> specify custom separators
> * No automated tests, and code is written in a way that makes it difficult to
> test
> * Unusual custom config loading and handling instead of using
> GenericOptionParser and ToolRunner and friends
> The initial work towards PHOENIX-66 included refactoring the MR importer
> enough to use common code, up until the development of automated testing
> exposed the fact that the MR importer could use some major refactoring.
> This ticket is a proposal to do a relatively major rework of the MR import,
> fixing the above issues. The biggest improvements that will result from this
> are a common codebase for handling CSV input, and the addition of automated
> testing for the MR import.
--
This message was sent by Atlassian JIRA
(v6.2#6252)