[ 
https://issues.apache.org/jira/browse/PHOENIX-129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabriel Reid updated PHOENIX-129:
---------------------------------

    Attachment: PHOENIX-129-3.0.patch
                PHOENIX-129-master.patch

Here’s my first pass at the reworked MapReduce import. It’s basically a full 
rewrite of the original, with the following important differences:
* Specifying custom field separators is now supported
* The main codebase for CSV handling from the non-MR implementation is used, 
meaning arrays are also now supported
* Package is changed to “org.apache.phoenix.mr”, as the old package naming 
didn’t make a lot of sense
* There’s a hook to do custom updates to the KeyValues before they are written 
via a plugin — this is to facilitate coprocessor-style things when doing bulk 
imports
* Standard logging and MR counters are used
* The whole concept of MR configuration is used (i.e. the tool is designed to 
run with the “hadoop” command without specifying the job tracker, etc)
* Added automated testing, including integration tests which run actual MR tasks

That being said, it’s also important to note that this version (in its current 
form) also introduces some important backwards incompatibilities. The 
jobtracker and namenode addresses are no longer supplied explicitly as 
parameters, the custom configuration file loading is no longer supported, and 
there is no longer a facility to run a “create table” statement as part of the 
tool.

My rationale for not supporting the custom config loading and supplying the 
namenode/jobtracker parameters in the tool itself is that both of these 
scenarios can be covered by the the script used to invoke the tool, if we want. 
Starting the tool with any custom configuration can be done by supplying -D 
options on the command line that is invoked by the startup script, if that is 
what we want to support.

On the other hand, I’d like to make a case for not putting in the effort to 
supporting these things at all. Considering that the mapreduce bulk loader can 
only be used by someone who has access to HDFS (i.e. to supply their data to 
the tool), it’s a good bet that they have the “hadoop” command available to 
them, in which case running the job with “hadoop jar” seems very acceptable to 
me. This approach would probably be best-served by creating a map-reduce 
specific assembly jar.

I’m aware that there will likely be other opinions on this topic, so I’d love 
to hear them. In any case, here’s are the patches for master and 3.0 for the 
loader code as it is now.

> Improve MapReduce-based import
> ------------------------------
>
>                 Key: PHOENIX-129
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-129
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Gabriel Reid
>            Assignee: Gabriel Reid
>         Attachments: PHOENIX-129-3.0.patch, PHOENIX-129-master.patch
>
>
> In implementing PHOENIX-66, it was noted that the current MapReduce-based 
> importer implementation has a number issues, including the following:
> * CSV handling is largely replicated from the non-MR code, with no ability to 
> specify custom separators
> * No automated tests, and code is written in a way that makes it difficult to 
> test
> * Unusual custom config loading and handling instead of using 
> GenericOptionParser and ToolRunner and friends
> The initial work towards PHOENIX-66 included refactoring the MR importer 
> enough to use common code, up until the development of automated testing 
> exposed the fact that the MR importer could use some major refactoring.
> This ticket is a proposal to do a relatively major rework of the MR import, 
> fixing the above issues. The biggest improvements that will result from this 
> are a common codebase for handling CSV input, and the addition of automated 
> testing for the MR import.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to