[ https://issues.apache.org/jira/browse/HBASE-12363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lars Hofhansl resolved HBASE-12363. ----------------------------------- Resolution: Fixed Hadoop Flags: Reviewed Committed to 0.98, -1, and master. Thanks for the reviews. > Improve how KEEP_DELETED_CELLS works with MIN_VERSIONS > ------------------------------------------------------ > > Key: HBASE-12363 > URL: https://issues.apache.org/jira/browse/HBASE-12363 > Project: HBase > Issue Type: Sub-task > Components: regionserver > Reporter: Lars Hofhansl > Assignee: Lars Hofhansl > Labels: Phoenix > Fix For: 2.0.0, 0.98.8, 0.99.2 > > Attachments: 12363-0.98.txt, 12363-1.0.txt, 12363-master.txt, > 12363-test.txt, 12363-v2.txt, 12363-v3.txt > > > Brainstorming... > This morning in the train (of all places) I realized a fundamental issue in > how KEEP_DELETED_CELLS is implemented. > The problem is around knowing when it is safe to remove a delete marker (we > cannot remove it unless all cells affected by it are remove otherwise). > This was particularly hard for family marker, since they sort before all > cells of a row, and hence scanning forward through an HFile you cannot know > whether the family markers are still needed until at least the entire row is > scanned. > My solution was to keep the TS of the oldest put in any given HFile, and only > remove delete markers older than that TS. > That sounds good on the face of it... But now imagine you wrote a version of > ROW 1 and then never update it again. Then later you write a billion other > rows and delete them all. Since the TS of the cells in ROW 1 is older than > all the delete markers for the other billion rows, these will never be > collected... At least for the region that hosts ROW 1 after a major > compaction. > Note, in a sense that is what HBase is supposed to do when keeping deleted > cells: Keep them until they would be removed by some other means (for example > TTL, or MAX_VERSION when new versions are inserted). > The specific problem here is that even as all KVs affected by a delete marker > are expired this way the marker would not be removed if there just one older > KV in the HStore. > I don't see a good way out of this. In parent I outlined these four solutions: > So there are three options I think: > # Only allow the new flag set on CFs with TTL set. MIN_VERSIONS would not > apply to deleted rows or delete marker rows (wouldn't know how long to keep > family deletes in that case). (MAX)VERSIONS would still be enforced on all > rows types except for family delete markers. > # Translate family delete markers to column delete marker at (major) > compaction time. > # Change HFileWriterV* to keep track of the earliest put TS in a store and > write it to the file metadata. Use that use expire delete marker that are > older and hence can't affect any puts in the file. > # Have Store.java keep track of the earliest put in internalFlushCache and > compactStore and then append it to the file metadata. That way HFileWriterV* > would not need to know about KVs. > And I implemented #4. > I'd love to get input on ideas. -- This message was sent by Atlassian JIRA (v6.3.4#6332)