[jira] [Commented] (PHOENIX-129) Improve MapReduce-based import

Gabriel Reid (JIRA) Sat, 15 Mar 2014 15:28:24 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13936311#comment-13936311
 ]


Gabriel Reid commented on PHOENIX-129:
--------------------------------------

I've integrated the items pointed out by Nick & James, and created a review on 
ReviewBoard, see https://reviews.apache.org/r/19257/ (to avoid having to set 
things up in GitHub, etc).

These new changes also include an assembly for the mapreduce jar, which has 
been tested on a real cluster.

Also, thanks for taking a look [~prkommireddi]. Here are my thoughts on the 
points you brought up:

{quote}Do we want remove the old MR bulk loader entirely, or keep it around for 
a release, mark it deprecated and communicate to the users that it would not be 
supported from the next release onwards?{quote}

As far as I know, the old MR bulk loader doesn't currently work, at least not 
in the 3.0 branch, so including it would involve first fixing it. I'm also not 
familiar with how widely the old version was being used, but I'm also not too 
enthusiastic about putting in the effort to get it working again and (probably) 
writing tests to make sure its working. On the other hand, if this is going to 
be a big problem for users then we should probably bite the bullet and fix it.

{quote}I believe the Bulk Loader creates HFiles and does not directly write to 
the Phoenix table via a connection? PhoenixHBaseStorage actually writes to a 
table directly and there are custom OutputFormat, RecordWriter and 
OutputCommitter implementations. I am guessing those wouldn't be required here 
as creating HFiles might be more efficient?{quote}

Correct, it's writing to HFiles. In my experience, using HFiles for bulk import 
is significantly faster than writing directly to HBase, although it does have 
some drawbacks, including the fact that coprocessors don't get a chance to see 
incoming changes. The current patch includes a plugin hook to allow 
manipulating the KeyValues before they are actually written.






> Improve MapReduce-based import
> ------------------------------
>
>                 Key: PHOENIX-129
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-129
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Gabriel Reid
>            Assignee: Gabriel Reid
>         Attachments: PHOENIX-129-3.0.patch, PHOENIX-129-3.0_2.patch, 
> PHOENIX-129-master.patch, PHOENIX-129-master_2.patch
>
>
> In implementing PHOENIX-66, it was noted that the current MapReduce-based 
> importer implementation has a number issues, including the following:
> * CSV handling is largely replicated from the non-MR code, with no ability to 
> specify custom separators
> * No automated tests, and code is written in a way that makes it difficult to 
> test
> * Unusual custom config loading and handling instead of using 
> GenericOptionParser and ToolRunner and friends
> The initial work towards PHOENIX-66 included refactoring the MR importer 
> enough to use common code, up until the development of automated testing 
> exposed the fact that the MR importer could use some major refactoring.
> This ticket is a proposal to do a relatively major rework of the MR import, 
> fixing the above issues. The biggest improvements that will result from this 
> are a common codebase for handling CSV input, and the addition of automated 
> testing for the MR import.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PHOENIX-129) Improve MapReduce-based import

Reply via email to