[
https://issues.apache.org/jira/browse/PHOENIX-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13936311#comment-13936311
]
Gabriel Reid commented on PHOENIX-129:
--------------------------------------
I've integrated the items pointed out by Nick & James, and created a review on
ReviewBoard, see https://reviews.apache.org/r/19257/ (to avoid having to set
things up in GitHub, etc).
These new changes also include an assembly for the mapreduce jar, which has
been tested on a real cluster.
Also, thanks for taking a look [~prkommireddi]. Here are my thoughts on the
points you brought up:
{quote}Do we want remove the old MR bulk loader entirely, or keep it around for
a release, mark it deprecated and communicate to the users that it would not be
supported from the next release onwards?{quote}
As far as I know, the old MR bulk loader doesn't currently work, at least not
in the 3.0 branch, so including it would involve first fixing it. I'm also not
familiar with how widely the old version was being used, but I'm also not too
enthusiastic about putting in the effort to get it working again and (probably)
writing tests to make sure its working. On the other hand, if this is going to
be a big problem for users then we should probably bite the bullet and fix it.
{quote}I believe the Bulk Loader creates HFiles and does not directly write to
the Phoenix table via a connection? PhoenixHBaseStorage actually writes to a
table directly and there are custom OutputFormat, RecordWriter and
OutputCommitter implementations. I am guessing those wouldn't be required here
as creating HFiles might be more efficient?{quote}
Correct, it's writing to HFiles. In my experience, using HFiles for bulk import
is significantly faster than writing directly to HBase, although it does have
some drawbacks, including the fact that coprocessors don't get a chance to see
incoming changes. The current patch includes a plugin hook to allow
manipulating the KeyValues before they are actually written.
> Improve MapReduce-based import
> ------------------------------
>
> Key: PHOENIX-129
> URL: https://issues.apache.org/jira/browse/PHOENIX-129
> Project: Phoenix
> Issue Type: Improvement
> Reporter: Gabriel Reid
> Assignee: Gabriel Reid
> Attachments: PHOENIX-129-3.0.patch, PHOENIX-129-3.0_2.patch,
> PHOENIX-129-master.patch, PHOENIX-129-master_2.patch
>
>
> In implementing PHOENIX-66, it was noted that the current MapReduce-based
> importer implementation has a number issues, including the following:
> * CSV handling is largely replicated from the non-MR code, with no ability to
> specify custom separators
> * No automated tests, and code is written in a way that makes it difficult to
> test
> * Unusual custom config loading and handling instead of using
> GenericOptionParser and ToolRunner and friends
> The initial work towards PHOENIX-66 included refactoring the MR importer
> enough to use common code, up until the development of automated testing
> exposed the fact that the MR importer could use some major refactoring.
> This ticket is a proposal to do a relatively major rework of the MR import,
> fixing the above issues. The biggest improvements that will result from this
> are a common codebase for handling CSV input, and the addition of automated
> testing for the MR import.
--
This message was sent by Atlassian JIRA
(v6.2#6252)