[ 
https://issues.apache.org/jira/browse/PHOENIX-2067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605762#comment-14605762
 ] 

James Taylor commented on PHOENIX-2067:
---------------------------------------

This bug potentially affects any PK column of variable length that is declared 
as DESC sort order. The particular case is when one value is a subpart of 
another value. Simplest case would be this:
{code}
CREATE TABLE t (k VARCHAR DESC PRIMARY KEY);
UPSERT INTO t VALUES ('a');
UPSERT INTO t VALUES('ab');
SELECT * FROM t ORDER BY k;
{code}
The 'ab' row should appear before the 'a' row. This is due to there being no 
terminator at the end of the row key as well as because we're not inverting the 
null/zero terminator/separator byte between parts of the row key.

The fix is to:
- use a 255 byte separator byte instead of a 0 byte separator between row key 
parts
- include a 255 byte terminator at the end of the row key
- include a 255 byte for any null value at the end of the row key (without 
this, row keys with null values in the middle of the row key might sort before 
a row key with null values at the end of the row key)
- disallow a DESC pk column to be added to the PK in an ALTER TABLE <table> ADD 
<column> as it would require updating existing data.

We need a script that users can run to fix their existing data (or at least 
identify that there's an issue).

> Sort order incorrect for variable length DESC columns
> -----------------------------------------------------
>
>                 Key: PHOENIX-2067
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-2067
>             Project: Phoenix
>          Issue Type: Bug
>    Affects Versions: 4.4.0
>         Environment: HBase 0.98.6-cdh5.3.0
> jdk1.7.0_67 x64
> CentOS release 6.4 (2.6.32-358.el6.x86_64)
>            Reporter: Mykola Komarnytskyy
>            Assignee: James Taylor
>
> Steps to reproduce:
> 1. Create a table: 
> CREATE TABLE mytable (id BIGINT not null PRIMARY KEY, timestamp BIGINT, 
> log_message varchar) IMMUTABLE_ROWS=true, SALT_BUCKETS=16;
> 2. Create two indexes:
> CREATE INDEX mytable_index_search ON mytable(timestamp,id) INCLUDE 
> (log_message) SALT_BUCKETS=16;
> CREATE INDEX mytable_index_search_desc ON mytable(timestamp DESC,id DESC) 
> INCLUDE (log_message) SALT_BUCKETS=16;
> 3. Upsert values:
> UPSERT INTO mytable VALUES(1, 1434983826018, 'message1');
> UPSERT INTO mytable VALUES(2, 1434983826100, 'message2');
> UPSERT INTO mytable VALUES(3, 1434983826101, 'message3');
> UPSERT INTO mytable VALUES(4, 1434983826202, 'message4');
> 4. Sort DESC by timestamp:
> select timestamp,id,log_message from mytable ORDER BY timestamp DESC;
> Failure: data is sorted incorrectly. In case when we have two longs which  
> are different only by last two digits (e.g. 1434983826155, 1434983826100)  
> and one of the long ends with '00' we receive incorrect order. 
> Sorting result:
> 1434983826202
> 1434983826100
> 1434983826101
> 1434983826018



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to