[
https://issues.apache.org/jira/browse/LUCENE-8438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16570159#comment-16570159
]
Dawid Weiss commented on LUCENE-8438:
-------------------------------------
I've done some work on this and have some thoughts to share, so looking for
feedback. The branch and PR diff is here, it's not merge'able as-is, but it
provides grounds for discussion.
[https://github.com/apache/lucene-solr/pull/432]
Here are some notes introducing you to this patch and things and decisions I
consider somewhat important.
*RAMDirectory and ByteBuffersDirectory (and associated classes).*
RAMDirectory and its bits and pieces (RAMFile, RAMOutputStream) are used in
many, many places in the code. In many of those places the actual use case is a
writeable (and reusable) abstraction of a buffer where vints and other data
structures can be accumulated, then flushed to another buffer (or sink). The
fact that RAMOutputStream implements full IndexOutput and not just DataOutput
is contradicting those use cases (a flag on optional CRC computation, reset
method).
While it is possible to change those classes in-place, I initially opted for
writing a "second set" of ram-based classes. This allows more progressive
transition from RAM* classes and, perhaps more importantly, provides some room
for a different structuring of relationships among those new classes (not to
mention it makes baby steps of re-running full test suite easier).
I opted to use ByteBuffers (heap-based) as the underlying storage in memory, so
all the classes are named ByteBuffers*. And so we have:
* ByteBuffersDirectory - a replacement/equivalent of RAMDirectory,
essentially.
* ByteBuffersDataOutput - a growing, paged buffer that implements DataOutput
and adds additional utility methods like current size, returning the content as
a contiguous array or a set of underlying ByteBuffers. The implementation is
paged, but does not use a fixed page size. Once the content grows past 100
"blocks" the blocks are doubled and existing content is copied to those new
blocks. This solution is nicer for heap memory than the fixed 1024-byte blocks
in RAMOutputStream. Sure, some memory copying overhead does occur, but I
haven't noticed any practical impact from this.
* ByteBuffersDataInput - a DataInput implementation on top of a collection of
ByteBuffers. There requirements for the input byte buffers are that for N
buffers, the capacity of buffers 1..N-1 is constant and power-of-two (which is
what ByteBuffersDataOutput emits) to allow efficient page/buffer seeks. The
initial buffer can also have the initial read position shifted from 0 (for easy
slicing).
* ByteBuffersIndexInput - An IndexInput implementation that reads from the
delegate ByteBuffersDataInput
* ByteBuffersIndexOutput - An IndexOutput implementation that writes to a
delegate ByteBuffersDataOutput and computes the CRC of the written data. This
is done lazily (so no special switches are required for temporary index outputs
where previously we explicitly passed crc=false).
*Why not byte[] instead of ByteBuffers?*
Mostly in hope that byte buffers provide JVM intrinsics that would be faster
than plain Java code (range checking, int or long access over the underlying
byte arrays). They also make certain things more problematic or complex
(especially to those not familiar with Buffers), but overall I think it came
out quite nicely and integrates well with filesystem buffers we use anyway.
*Why not reuse exiting ByteBufferIndexInput abstractions?*
My design decision was to avoid the exception-handler logic control currently
used in Lucene's memory mapped buffer IndexInput implementation. This is
subjective taste, really, but also driven by the fact that byte buffers here
can and will be much smaller than mapped buffer slices (so in-between buffer
region reads will happen more frequently). So I left the ByteBuffersDataInput
class in the code, although ByteBuffersDirectory provides a mapper that maps
each written file's ByteBuffer[] into an IndexInput and one of the provided
defaults can use Lucene's "native" byte buffer wrappers. Finally redundancies
can be stripped from the patch entirely (or left in, if we decide they're
useful). See notes about performance below too.
*What to do with RAMDirectory and other Ram*Stream classes?*
There is an obvious question what to do with RAMDirectory classes now. Should
this patch substitute the internals (a rename of ByteBuffers*, effectively),
should we deprecate RAMDirectory and then drop it entirely? My gut feeling is
that many people use RAMDirectory, but I'd rather deprecate it and remove it
eventually to give a clear signal that (a) it's not efficient, (b) it's going
away. I am really open to discussion here though.
*What are the performance implications?*
I ran a few quick experiments creating and reading from indexes and
ByteBuffersDirectory performs quite nicely. I'll attach a chart from AWS 32-cpu
instance for a taste. In short, the performance is comparable with FSDirectory,
especially when the output buffer is consolidated.
This said, I also replaced a number of RAMInput/OutputStream used as buffers in
internal codec classes, etc., and the implications there may have escaped my
attention. Please report if you see anything odd.
*What's next?*
This depends mostly on what we decide to do with RAMDirectory. It's still used
heavily in tests (and mocks), so there's still some work to be done there, but
I wanted to circulate this for feedback so that I don't spend my time on
something that won't be accepted.
> RAMDirectory speed improvements and cleanup
> -------------------------------------------
>
> Key: LUCENE-8438
> URL: https://issues.apache.org/jira/browse/LUCENE-8438
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Dawid Weiss
> Assignee: Dawid Weiss
> Priority: Minor
> Attachments: capture-4.png
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> RAMDirectory screams for a cleanup. It is used and abused in many places and
> even if we discourage its use in favor of native (mmapped) buffers, there
> seem to be benefits of keeping RAMDirectory available (quick throw-away
> indexes without the need to setup external tmpfs, for example).
> Currently RAMDirectory performs very poorly under concurrent loads. The
> implementation is also open for all sorts of abuses – the streams can be
> reset and are used all around the place as temporary buffers, even without
> the presence of RAMDirectory itself. This complicates the implementation and
> is pretty confusing.
> An example of how dramatically slow RAMDirectory is under concurrent load,
> consider this PoC pseudo-benchmark. It creates a single monolithic segment
> with 500K very short documents (single field, with norms). The index is ~60MB
> once created. We then run semi-complex Boolean queries on top of that index
> from N concurrent threads. The attached capture-4 shows the result (queries
> per second over 5-second spans) for a varying number of concurrent threads on
> an AWS machine with 32 CPUs available (of which it seems 16 seem to be real,
> 16 hyper-threaded). That red line at the bottom (which drops compared to a
> single-threaded performance) is the current RAMDirectory. RAMDirectory2 is an
> alternative implementation I wrote that uses ByteBuffers. Yes, it's slower
> than the native mmapped implementation, but a *lot* faster then the current
> RAMDirectory (and more GC-friendly because it uses dynamic progressive block
> scaling internally).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]