ctubbsii opened a new issue, #2729:
URL: https://github.com/apache/accumulo/issues/2729

   **Is your feature request related to a problem? Please describe.**
   Compactions, splits, merges, and table clones are tricky and complicated 
because we have multiple references to the same files, making it difficult to 
know when it is safe to delete a file. We keep track of files in use and when 
we're done with them, we only mark them as candidates for deletion. We rely on 
a separate garbage collection service to ensure that a file is no longer in use 
before we can safely delete it. Even then, the garbage collection process can 
be slow, risky, and if it crashes, it may leave behind unreferenced files.
   
   **Describe the solution you'd like**
   HDFS has a kind of 
[HardLink](https://hadoop.apache.org/docs/r2.8.2/hadoop-project-dist/hadoop-common/api/org/apache/hadoop/fs/HardLink.html)
 feature that we may be able to leverage to avoid garbage collection entirely. 
I have not tested how it works in practice, but in theory, we could just create 
unique file names whenever we split a tablet or clone a table, or even bulk 
import files, if the files were made as hard links, rather than simply copying 
the same reference. This would probably increase the memory footprint of the 
Hadoop NameNode, but it would enable dramatic simplification of Accumulo, so it 
would probably be worth it. When we are done with a file, we could just delete 
it immediately, because we wouldn't have to worry about any other references. 
The actual blocks would still be referenced and not deleted, by the other hard 
links. We can let Hadoop reclaim the blocks when the last hard link is deleted.
   
   **Describe alternatives you've considered**
   Keep doing file-based garbage collection and hoping for the best.
   
   **Additional context**
   Doing this could simplify the implementation of "no-chop merges" described 
in #1327 because each file would reference only a single range in its metadata.
   
   To implement this, we may need some kind of global locking per file, to 
ensure a file can't be deleted while hard links are being created.
   
   We'd need to test to make sure that the original file can still be 
deleted... that it's treated like any other hard link, and that we can make 
hard links of hard links, etc.
   
   We might still want a garbage collection service to lazily clean up files, 
but we'd no longer have to do complicated reference checking for candidates, if 
we could rely on file names being globally unique.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to