[
https://issues.apache.org/jira/browse/PHOENIX-129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gabriel Reid updated PHOENIX-129:
---------------------------------
Attachment: PHOENIX-129-3.0.patch
PHOENIX-129-master.patch
Here’s my first pass at the reworked MapReduce import. It’s basically a full
rewrite of the original, with the following important differences:
* Specifying custom field separators is now supported
* The main codebase for CSV handling from the non-MR implementation is used,
meaning arrays are also now supported
* Package is changed to “org.apache.phoenix.mr”, as the old package naming
didn’t make a lot of sense
* There’s a hook to do custom updates to the KeyValues before they are written
via a plugin — this is to facilitate coprocessor-style things when doing bulk
imports
* Standard logging and MR counters are used
* The whole concept of MR configuration is used (i.e. the tool is designed to
run with the “hadoop” command without specifying the job tracker, etc)
* Added automated testing, including integration tests which run actual MR tasks
That being said, it’s also important to note that this version (in its current
form) also introduces some important backwards incompatibilities. The
jobtracker and namenode addresses are no longer supplied explicitly as
parameters, the custom configuration file loading is no longer supported, and
there is no longer a facility to run a “create table” statement as part of the
tool.
My rationale for not supporting the custom config loading and supplying the
namenode/jobtracker parameters in the tool itself is that both of these
scenarios can be covered by the the script used to invoke the tool, if we want.
Starting the tool with any custom configuration can be done by supplying -D
options on the command line that is invoked by the startup script, if that is
what we want to support.
On the other hand, I’d like to make a case for not putting in the effort to
supporting these things at all. Considering that the mapreduce bulk loader can
only be used by someone who has access to HDFS (i.e. to supply their data to
the tool), it’s a good bet that they have the “hadoop” command available to
them, in which case running the job with “hadoop jar” seems very acceptable to
me. This approach would probably be best-served by creating a map-reduce
specific assembly jar.
I’m aware that there will likely be other opinions on this topic, so I’d love
to hear them. In any case, here’s are the patches for master and 3.0 for the
loader code as it is now.
> Improve MapReduce-based import
> ------------------------------
>
> Key: PHOENIX-129
> URL: https://issues.apache.org/jira/browse/PHOENIX-129
> Project: Phoenix
> Issue Type: Improvement
> Reporter: Gabriel Reid
> Assignee: Gabriel Reid
> Attachments: PHOENIX-129-3.0.patch, PHOENIX-129-master.patch
>
>
> In implementing PHOENIX-66, it was noted that the current MapReduce-based
> importer implementation has a number issues, including the following:
> * CSV handling is largely replicated from the non-MR code, with no ability to
> specify custom separators
> * No automated tests, and code is written in a way that makes it difficult to
> test
> * Unusual custom config loading and handling instead of using
> GenericOptionParser and ToolRunner and friends
> The initial work towards PHOENIX-66 included refactoring the MR importer
> enough to use common code, up until the development of automated testing
> exposed the fact that the MR importer could use some major refactoring.
> This ticket is a proposal to do a relatively major rework of the MR import,
> fixing the above issues. The biggest improvements that will result from this
> are a common codebase for handling CSV input, and the addition of automated
> testing for the MR import.
--
This message was sent by Atlassian JIRA
(v6.2#6252)