[GitHub] [druid] himanshug opened a new issue #10296: Remove eager loading and on-heap caching of "internalFiles" map in SmooshedFileMapper

GitBox Mon, 17 Aug 2020 15:59:23 -0700


himanshug opened a new issue #10296:
URL: https://github.com/apache/druid/issues/10296



   This is a not-so-straightforward follow-up to 
https://github.com/apache/druid/pull/10295
   
   ### Motivation
   
   For each segment loaded, there is one instance of `SmooshedFileMapper` which 
contains a `Map<String, Metadata>` where each entry typically is for one column 
(in certain cases there could be more per column but that is not important for 
the discussion here). Depending upon number of segments loaded and number of 
columns, this wastes heap and there is no real reason to keep that map around 
in the heap during the full lifetime of Druid process.
   
   ### Proposed changes
   
   Remove `private final Map<String, Metadata> internalFiles` from 
`SmooshedFileMapper` and have following instead...
   
   ```
     private final File segmentBaseDir;
     private static final ThreadLocal<InternalFilesObj> internalFileObjRef;
   
     private static class InternalFilesObj
     {
       private final File segmentBaseDir;
       private final Map<String, Metadata> internalFiles;
   
       public InternalFilesObj(
           File segmentBaseDir,
           Map<String, Metadata> internalFiles
       )
       {
         this.segmentBaseDir = segmentBaseDir;
         this.internalFiles = internalFiles;
       }
     }
   
     private Map<String, Metadata> getInternalFiles()
     {
       InternalFilesObj obj =  internalFileObjRef.get();
       if (obj != null && obj.segmentBaseDir.equals(segmentBaseDir)) {
         return obj.internalFiles;
       } else {
         Map<String, Metadata> internalFilesMap = loadInternalFilesMap();
         obj = new InternalFilesObj(segmentBaseDir, internalFilesMap);
         internalFileObjRef.set(obj);
         return obj;
       }
     }
   ```
   
   That leads to one map cached per thread rather than per loaded segment on 
the Druid node.
   
   ### Rationale
   
   Another approach considered was to use that metadata information as a sorted 
set of columns names , and an array of `Metadata` objects in the metadata file 
(both created/stored using `GenericIndexedWriter`). With that, we can read the 
information from file directly without ever building an on-heap map object. 
Lookup would be O(lg N) but everything will be lazy and totally off-heap.
   However, current textual format of metadata.drd file is helpful while 
debugging and it is useful to be able to do `cat metadata.drd` and this 
alternative approach would make the format binary, Also a little more complex 
than the changes proposed above.
   
   ### Operational impact
   
   None
   
   ### Test plan (optional)
   
   Existing test would cover the changes introduced.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

[GitHub] [druid] himanshug opened a new issue #10296: Remove eager loading and on-heap caching of "internalFiles" map in SmooshedFileMapper

Reply via email to