keith-turner opened a new issue, #5387:
URL: https://github.com/apache/accumulo/issues/5387

   **Is your feature request related to a problem? Please describe.**
   
   Many files referenced are only used by a single tablet and these files could 
be deleted by compaction if this was known.  Instead a delete marker is always 
added for files and GC has to process this delete marker.
   
   **Describe the solution you'd like**
   
   Each files in a tablets metadata could have a shared marker that tracks if 
more than one tablet references the file.  
   
    * When compaction creates a new files it sets shared=false
    * When a tablet splits it will set shared=true on any files that go to 
multiple tablets
    * When a table is cloned it will set shared=true in the source table on any 
files it references in the new table.
    * Bulk import could marks files as shared or not depending on if the files 
go to multiple tablets.
    * The fate operation that commits a compaction could either delete the 
input files or write a delete markers depending on if the files were shared or 
not.
   
   For this feature to be possible all of the above operations must be able to 
be done safely using conditional mutations. 
   
   The shared marker could be added to the per file metadata that is already 
stored in the tablet. 
   
   **Describe alternatives you've considered**
   
   #2729 may be an alternative if HDFS supports hard links.
   
   **Additional context**
   
   This feature would reduce the work on the Accumulo GC process and avoid 
storing delete markers.  The trade off is that the new shared marker would be 
required and compaction commit would now be making calls to the namenode to 
delete files in some cases.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@accumulo.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to