[ 
https://issues.apache.org/jira/browse/PHOENIX-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13936059#comment-13936059
 ] 

James Taylor commented on PHOENIX-129:
--------------------------------------

bq. Just to clarify as well, I would actually consider completely removing the 
python script for running the mapred import, as I feel it will get in the way 
more than helping if we're running the bulk import with the "hadoop jar" 
command. Does that sound ok?

I'm fine with doing whatever is the "right" and easy way of invoking 
map-reduce, but I'm probably not the best person to voice an opinion because 
I'm kind of the anti-map-reduce guy :-)  [~prkommireddi], [~ndimiduk] - what do 
you think?

bq. Yep, that's my plan. I was going to see how much work it is to put together 
the patch myself and then apply it to the copy of the commons-csv in the 
phoenix code tree as well, but I'll have to see how involved that is. 

Sounds good, but are you ok with getting as much done as possible so we can 
make our Tuesday RC date? We can always do this enhancement post 3.0.

> Improve MapReduce-based import
> ------------------------------
>
>                 Key: PHOENIX-129
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-129
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Gabriel Reid
>            Assignee: Gabriel Reid
>         Attachments: PHOENIX-129-3.0.patch, PHOENIX-129-master.patch
>
>
> In implementing PHOENIX-66, it was noted that the current MapReduce-based 
> importer implementation has a number issues, including the following:
> * CSV handling is largely replicated from the non-MR code, with no ability to 
> specify custom separators
> * No automated tests, and code is written in a way that makes it difficult to 
> test
> * Unusual custom config loading and handling instead of using 
> GenericOptionParser and ToolRunner and friends
> The initial work towards PHOENIX-66 included refactoring the MR importer 
> enough to use common code, up until the development of automated testing 
> exposed the fact that the MR importer could use some major refactoring.
> This ticket is a proposal to do a relatively major rework of the MR import, 
> fixing the above issues. The biggest improvements that will result from this 
> are a common codebase for handling CSV input, and the addition of automated 
> testing for the MR import.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to