wa-ooo edited a comment on pull request #92:
URL: https://github.com/apache/sqoop/pull/92#issuecomment-800087918
> Hi hong ,
>
> I've reviewed your changes (both Github and issues.apache.org), but TBH in
the current state I'm concerned both about the intention of the change, and the
correctness as well.
>
> First of all:
> Could you please provide a bit more detail around what performance gain do
you expect from this change and how did you measure it? Could you please
provide also some automated testcase which would show the effect of this gain,
and would ensure we don't loose it in the future?
>
> On the front of correctness:
> SQOOP-3149 introduced the line you'd like to remove, and if I do remember
correctly absolutely intentionally. Because of this reason:
> Could you please provide automated test cases which ensures that
SQOOP-3149 changes won't be undone by your change (so we keep the current
correctness around NULL column updates)?
>
> Many thanks in advance,
> Attila Szabo
----------
hi @maugly24
thk for review this pr
our production environment was upgraded from CDH-5.13.0 to CDH-6.3.2,
and it was found that the task of importing data from RDM into HBase in 6.3.2
cluster took 3\~4 hours longer (\~ 50 million records). The record output in MR
log was much more than that in 5.13.0. This problem can be difficult to detect
when importing small tables, and the larger the data volume, the more
significant the delay.So I compared the changes of Hbase-import-job in SQOOP
between the two versions and found the problem here.
I think this is an easy-fix for HBase developers, so there is not
much description in the issue.
This change is also easy to understand, since it was added to the
mutationList when PUT was initialized, and no subsequent PUT needs to be added
again. Otherwise, the PUT will be recorded repeatedly in the generated HFILE.
I looked at SQOOP-3149, and there is no explanation for this line
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]