[ 
https://issues.apache.org/jira/browse/SQOOP-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292487#comment-16292487
 ] 

Szabolcs Vasas commented on SQOOP-3267:
---------------------------------------

Hi [~dvoros],

Thank you for reporting this bug, this is a really nice catch!
I have created a couple of test cases to clarify the issue.

1) Updating a column without history in HBase

Let's create the test tables and import the initial values to HBase:

{code:java}
export MYCONN=...
export MYUSER=...

sqoop eval --connect $MYCONN --username $MYUSER --query "drop table hbase_test"
sqoop eval --connect $MYCONN --username $MYUSER --query "create table 
hbase_test(id int primary key ,name varchar(30), date_modified DATETIME)"
sqoop eval --connect $MYCONN --username $MYUSER --query "INSERT INTO hbase_test 
(id, name, date_modified) VALUES(1, 'name1', CURRENT_TIMESTAMP)"
sqoop eval --connect $MYCONN --username $MYUSER --query "INSERT INTO hbase_test 
(id, name, date_modified) VALUES(2, 'name2', CURRENT_TIMESTAMP)"

echo "disable 'hbase_test'" | hbase shell
echo "drop 'hbase_test'" | hbase shell
echo "create 'hbase_test', 'data'" | hbase shell

sqoop import --connect $MYCONN --username $MYUSER --table "hbase_test" 
--hbase-table hbase_test --column-family data -m 1

echo "scan 'hbase_test'" | hbase shell

scan 'hbase_test'
ROW                                                   COLUMN+CELL
 1                                                    
column=data:date_modified, timestamp=1513340516549, value=2017-12-15 12:17:26.0
 1                                                    column=data:name, 
timestamp=1513340516549, value=name1
 2                                                    
column=data:date_modified, timestamp=1513340516549, value=2017-12-15 12:17:29.0
 2                                                    column=data:name, 
timestamp=1513340516549, value=name2
2 row(s)

{code}

The output of the HBase scan is the expected, let's update a column to null in 
the database and execute an incremental import:

{code:java}
sqoop eval --connect $MYCONN --username $MYUSER --query "update hbase_test set 
name=null, date_modified=CURRENT_TIMESTAMP where id = 1"

sqoop import --connect $MYCONN --username $MYUSER --table "hbase_test" 
--hbase-table hbase_test --column-family data -m 1 --incremental lastmodified 
--check-column date_modified --last-value "2017-12-15 12:17:26.0" --merge-key id

echo "scan 'hbase_test'" | hbase shell

scan 'hbase_test'
ROW                                                   COLUMN+CELL
 1                                                    
column=data:date_modified, timestamp=1513340614595, value=2017-12-15 12:23:00.0
 2                                                    
column=data:date_modified, timestamp=1513340614595, value=2017-12-15 12:17:29.0
 2                                                    column=data:name, 
timestamp=1513340614595, value=name2
2 row(s)

{code}

The output of the HBase scan is pretty much the behavior expected after 
SQOOP-3149, the name column is updated to null, so it "disappeared" from the 
output of the scan, if we check that column in HBase using the Java API we will 
get null value. So far so good.

2) Updating a column with history in HBase

This test case is really similar to the previous one but first we update the 
name column to a non-null value:


{code:java}
export MYCONN=jdbc:mysql://`hostname`/test
export MYUSER=sqoop

sqoop eval --connect $MYCONN --username $MYUSER --query "drop table hbase_test"
sqoop eval --connect $MYCONN --username $MYUSER --query "create table 
hbase_test(id int primary key ,name varchar(30), date_modified DATETIME)"
sqoop eval --connect $MYCONN --username $MYUSER --query "INSERT INTO hbase_test 
(id, name, date_modified) VALUES(1, 'name1', CURRENT_TIMESTAMP)"
sqoop eval --connect $MYCONN --username $MYUSER --query "INSERT INTO hbase_test 
(id, name, date_modified) VALUES(2, 'name2', CURRENT_TIMESTAMP)"

echo "disable 'hbase_test'" | hbase shell
echo "drop 'hbase_test'" | hbase shell
echo "create 'hbase_test', 'data'" | hbase shell

sqoop import --connect $MYCONN --username $MYUSER --table "hbase_test" 
--hbase-table hbase_test --column-family data -m 1

echo "scan 'hbase_test'" | hbase shell

scan 'hbase_test'
ROW                                                   COLUMN+CELL
 1                                                    
column=data:date_modified, timestamp=1513335407294, value=2017-12-15 10:49:05.0
 1                                                    column=data:name, 
timestamp=1513335407294, value=name1
 2                                                    
column=data:date_modified, timestamp=1513335407294, value=2017-12-15 10:49:09.0
 2                                                    column=data:name, 
timestamp=1513335407294, value=name2
2 row(s)


sqoop eval --connect $MYCONN --username $MYUSER --query "update hbase_test set 
name='name_modified', date_modified=CURRENT_TIMESTAMP where id = 1"

sqoop import --connect $MYCONN --username $MYUSER --table "hbase_test" 
--hbase-table hbase_test --column-family data -m 1 --incremental lastmodified 
--check-column date_modified --last-value "2017-12-15 10:49:09.0" --merge-key id

echo "scan 'hbase_test'" | hbase shell


scan 'hbase_test'
ROW                                                   COLUMN+CELL
 1                                                    
column=data:date_modified, timestamp=1513335602892, value=2017-12-15 10:58:44.0
 1                                                    column=data:name, 
timestamp=1513335602892, value=name_modified
 2                                                    
column=data:date_modified, timestamp=1513335602892, value=2017-12-15 10:49:09.0
 2                                                    column=data:name, 
timestamp=1513335602892, value=name2
2 row(s)

{code}

Everything is expected so far, the output of the second HBase scan command 
shows us the updated name value. Now let's set that column to null and see what 
happens in HBase:

{code:java}
sqoop eval --connect $MYCONN --username $MYUSER --query "update hbase_test set 
name=null, date_modified=CURRENT_TIMESTAMP where id = 1"

sqoop import --connect $MYCONN --username $MYUSER --table "hbase_test" 
--hbase-table hbase_test --column-family data -m 1 --incremental lastmodified 
--check-column date_modified --last-value "2017-12-15 10:58:44.0" --merge-key id

echo "scan 'hbase_test'" | hbase shell

scan 'hbase_test'
ROW                                                   COLUMN+CELL
 1                                                    
column=data:date_modified, timestamp=1513335804486, value=2017-12-15 11:01:35.0
 1                                                    column=data:name, 
timestamp=1513335407294, value=name1
 2                                                    
column=data:date_modified, timestamp=1513335602892, value=2017-12-15 10:49:09.0
 2                                                    column=data:name, 
timestamp=1513335602892, value=name2
2 row(s)
Took 0.4017 seconds
{code}

The output of the last scan shows the bug reported in this JIRA. We can see 
that the data:date_modified column is updated in row 1, but the value data:name 
column is not null but it is "reset" to its previous value.

I think we all agree that the current behavior is not what we want the question 
is how we should address this bug.
I am not a big fan of deleting the full history, since some users might rely on 
it, if it is possible we should put a null value to the HBase table as well.

Regards,
Szabolcs

> Incremental import to HBase deletes only last version of column
> ---------------------------------------------------------------
>
>                 Key: SQOOP-3267
>                 URL: https://issues.apache.org/jira/browse/SQOOP-3267
>             Project: Sqoop
>          Issue Type: Bug
>          Components: hbase-integration
>    Affects Versions: 1.4.7
>            Reporter: Daniel Voros
>            Assignee: Daniel Voros
>         Attachments: SQOOP-3267.1.patch
>
>
> Deletes are supported since SQOOP-3149, but we're only deleting the last 
> version of a column when the corresponding cell was set to NULL in the source 
> table.
> This can lead to unexpected and misleading results if the row has been 
> transferred multiple times, which can easily happen if it's being modified on 
> the source side.
> Also SQOOP-3149 is using a new Put command for every column instead of a 
> single Put per row as before. This could probably lead to a performance drop 
> for wide tables (for which HBase is otherwise usually recommended).
> [~jilani], [~anna.szonyi] could you please comment on what you think would be 
> the expected behavior here?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to