[
https://issues.apache.org/jira/browse/SQOOP-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16335941#comment-16335941
]
Szabolcs Vasas commented on SQOOP-3267:
---------------------------------------
Hi [~dvoros],
Option B seems to be a good direction to me, I agree that ideally the target
HBase table should reflect that a column is set to null in the source RDBMS and
I would not make this dependant on incremental mode since in theory only
"lastmodified" mode should change already existing rows in the target table.
However after thinking about this more thouroughly a performance related
questions have arisen. Let's say the users want to import new rows (so it would
be a regular import not an incremental one) from a wide table where most of the
columns are nulls only a couple of values are defined. In this case the current
implementation would use only a few Put commands but the suggested
implementation would need significantly more Put commands just to add the null
strings to the HBase table. I think this is something the users would not
prefer in this case. On the other hand you are right that it would be great if
we could keep the consistency of how we represent nulls in the HBase table and
it would not be different in case of regular import and incremental import...
Considering the above I suggest the following solution:
* Introduce an --hbase-null-incremental-mode(or similar name) option which
would enable the users to specify what should Sqoop do with the null values in
the source RDBMS table. The options could be:
** ignore (default) - This would be basically the behavior before SQOOP-3149
** delete - This would be similar to the behavior introduced in SQOOP-3149 but
we would delete the whole history
** null-string - Sqoop would put a null string value instead of null specified
in the new --hbase-null-string option
* Introduce a new option called --hbase-null-string which could be used to
specify which null string Sqoop should put into the HBase table instead of
null. This could be used for the regular imports too but if it is not specified
Sqoop should not use null strings to avoid the above mentioned potential
performance problem.
The benefit of this solution would be that the users would have more
possibilities to control how the null values are handled and it would not
change the behavior unexpectedly (I might be paranoid but I feel introducing
the new --hbase-null-string is safer than overloading the already existing
--null-string).
Implementing this might be an overkill for addressing this bug we could move
the null-string handling part to another Jira as well.
> Incremental import to HBase deletes only last version of column
> ---------------------------------------------------------------
>
> Key: SQOOP-3267
> URL: https://issues.apache.org/jira/browse/SQOOP-3267
> Project: Sqoop
> Issue Type: Bug
> Components: hbase-integration
> Affects Versions: 1.4.7
> Reporter: Daniel Voros
> Assignee: Daniel Voros
> Priority: Major
> Attachments: SQOOP-3267.1.patch
>
>
> Deletes are supported since SQOOP-3149, but we're only deleting the last
> version of a column when the corresponding cell was set to NULL in the source
> table.
> This can lead to unexpected and misleading results if the row has been
> transferred multiple times, which can easily happen if it's being modified on
> the source side.
> Also SQOOP-3149 is using a new Put command for every column instead of a
> single Put per row as before. This could probably lead to a performance drop
> for wide tables (for which HBase is otherwise usually recommended).
> [~jilani], [~anna.szonyi] could you please comment on what you think would be
> the expected behavior here?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)