[
https://issues.apache.org/jira/browse/SQOOP-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333491#comment-16333491
]
Daniel Voros commented on SQOOP-3267:
-------------------------------------
[~vasas], [~maugli] thank you both for your replies!
I agree with you on keeping the history. I see two ways to do that.
Option A) is a sort-of workaround. Let users know that Sqoop will delete all
previous versions of columns when updating them to NULL in the source table and
ask them to set KEEP_DELETED_CELLS on the underlying HBase table if they still
want to preserve history.
Option B) is inserting an empty string for NULL values. In detail:
- Whenever we're importing a NULL value (whether if it's from a new row or an
updated one) we insert a special string (let's call it NULL_STRING). Note, that
it would be better to insert NULL, but HBase lacks the notion of NULL.
- The value of NULL_STRING is "" (empty string) by default but is
configurable. (Probably via the already existing {{--null-string}} argument.)
- This behavior should NOT depend on the incremental mode ("append" or
"lastmodified").
Two notes:
- Option B) is similar to what was introduced in Phoenix in PHOENIX-1578.
(When using the STORE_NULLS=true table option, there's no way to tell a NULL
value from an empty string in Phoenix.)
- I've also tested how Hive treats NULL values when storing a table in HBase.
Empty (or deleted) columns are displayed as NULL when reading the table, but
updating a column to NULL in Hive fails (see HIVE-3336).
> Incremental import to HBase deletes only last version of column
> ---------------------------------------------------------------
>
> Key: SQOOP-3267
> URL: https://issues.apache.org/jira/browse/SQOOP-3267
> Project: Sqoop
> Issue Type: Bug
> Components: hbase-integration
> Affects Versions: 1.4.7
> Reporter: Daniel Voros
> Assignee: Daniel Voros
> Priority: Major
> Attachments: SQOOP-3267.1.patch
>
>
> Deletes are supported since SQOOP-3149, but we're only deleting the last
> version of a column when the corresponding cell was set to NULL in the source
> table.
> This can lead to unexpected and misleading results if the row has been
> transferred multiple times, which can easily happen if it's being modified on
> the source side.
> Also SQOOP-3149 is using a new Put command for every column instead of a
> single Put per row as before. This could probably lead to a performance drop
> for wide tables (for which HBase is otherwise usually recommended).
> [~jilani], [~anna.szonyi] could you please comment on what you think would be
> the expected behavior here?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)