hbase will split on row when the start and end row is the same cuase data loss
------------------------------------------------------------------------------

                 Key: HADOOP-2493
                 URL: https://issues.apache.org/jira/browse/HADOOP-2493
             Project: Hadoop
          Issue Type: Bug
          Components: contrib/hbase
            Reporter: Billy Pearson
            Priority: Critical


While testing hbase splits with my code I was loading a table to become a 
inverted index on some links

I was using the anchor text as the row key 

and the column parent:child as
url:(siteurl) and the data is the count of the links pointing to the siteurl 
with row key anchor text.

but a lot of sites have image links and I use "image" as the anchor text for my 
testing code so there is a lot of image links. 
I changed the max file size of hbase to 16mb for testing and have been able to 
recreate the same error.

When the table get big it splits on the column image as the end key for one 
table and the start of the next table later it splits to where the start key 
and end key was image for one of the splits. After that it keep spiting the 
region with start key as "image" and the end key the same. So I have multi 
splits with start key and end key as "image" unless the master keeps track of 
the row key and partend:child data on the splits I do not thank all the data 
will get returned when querying it.

I have attached a screen shot of my regions i thank there should be some logic 
to where if the start and end row key is the same the region does not split or 
we need to start keeping track of the start key, column data on the master of 
each split so we can know where each row is in the database.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to