clintropolis opened a new pull request #12132:
URL: https://github.com/apache/druid/pull/12132


   ### Description
   
   This PR fixes a bug in `FileSmoosher` when using `addWithSmooshedWriter` in 
"delegated" mode , which writes to temporary files until the "parent" 
`SmooshedWriter` is closed, in cases file names which look like valid paths, 
e.g. 'foo/bar'.
   
   The added test case will explode with an error like:
   ```
   java.nio.file.NoSuchFileException: 
/var/folders/v0/zc_jt5n12d39kx7dn6xmg73w0000gn/T/junit296384917261983992/base/foo/bar/0
   ```
   prior to the changes in this PR, which is now using a counter to make a temp 
file name, and a map of the counter value to the actual file name when merging 
the delegated files back into the parent file.
   
   There are also a couple of other improvements for some future changes that I 
would like to make. The first is `FileSmoosher.addWithSmooshedWriter`, when 
_not_ delegating, now checks that it is still open when doing stuff inside of 
`close`, making it now be a no-op if already closed (allowing column 
serializers to add additional files and avoid delegated mode if they are 
finished writing out their own content and ned to add additional files)
   
   Additionally, `SegmentWriteOutMedium` has a new method that allows callers 
to free-up `WriteOutBytes` when using a shared medium and are fully in control 
of the lifecycle of the `WriteOutBytes`.
   
   ```
     SegmentWriteOutMedium makeChildWriteOutMedium() throws IOException;
   ```
   
   The contract of this method is that the child medium is registered to the 
closer of the parent, so need not be explicitly closed, but it also _may be_ 
explicitly closed, to allow freeing the backing resources of `WriteOutBytes` 
created by the medium. When using a shared medium, such as is done when 
serializing columns during segment creation, these resources are typically not 
able to be released until the medium itself is closed.
   
   The type of use case this method is targeting is actually present in 
`IntermediateColumnarLongsSerializer`, which does not actually use the 
`SegmentWriteOutMedium` until `getSerializedSize`/`writeTo` is called where 
they are immediately written to the channel. However, there is little benefit 
to changing to this method, because it is a very small number of additional 
files that will be created when writing out each column, and so not much to 
gain by releasing them early. Something like #10628 however, which creates a 
lot of files and immediately dumps them in the channel, could greatly reduce 
the number of open files on the system if there were for example a very large 
number of columns.
   
   Finally, `OnHeapMemorySegmentWriteOutMedium` previously would never mark its 
`WriteOutBytes` as 'closed', meaning that callers could indefinitely write to 
one of these buffers even after the medium was closed. This seemed like it 
allows breaking the contract of `SegmentWriteOutMedium` to use `WriteOutBytes` 
after closed, and the underlying `ByteBufferWriteOutBytes` has methods to check 
that the medium is open, but the heap implementation has no way to set it as 
not opened. I have changed this so that all implementations of 
`SegmentWriteOutMedium` behave identically by pushing the `free` method down to 
`ByteBufferWriteOutBytes` (and overriding it in the direct implementation).
   
   <hr>
   
   This PR has:
   - [x] been self-reviewed.
   - [x] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [x] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [x] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to