Re: My GSOC proposal

Michael McCandless Wed, 06 Apr 2011 11:12:07 -0700

That test code looks good -- you really should have seen awful
performance had you used O_DIRECT since you read byte by byte.


A more realistic test is to read a whole buffer (eg 4 KB is what
Lucene now uses during merging, but we'd probably up this to like 1 MB
when using O_DIRECT).

Linus does hate O_DIRECT (see http://kerneltrap.org/node/7563), and
for good reason: its existence means projects like ours can use it to
"work around" limitations in the Linux IO apis that control the buffer
cache when, otherwise, we might conceivably make patches to fix Linux
correctly.  It's an escape hatch, and we all use the escape hatch
instead of trying to fix Linux for real...

For example the NOREUSE flag is a no-op now in Linux, which is a
shame, because that's precisely the flag we'd want to use for merging
(along with SEQUENTIAL).  Had that flag been implemented well, it'd
give better results than our workaround using O_DIRECT.

Anyway, giving how things are, until we can get more control (waaaay
up in Javaland) over the buffer cache, O_DIRECT (via native directory
impl through JNI) is our only real option, today.

More details here:
http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html

Note that other OSs likely do a better job and actually implement
NOREUSE, and similar APIs, so the generic Unix/WindowsNativeDirectory
would simply use NOREUSE on these platforms for I/O during segment
merging.

Mike

http://blog.mikemccandless.com

On Wed, Apr 6, 2011 at 11:56 AM, Varun Thacker
<varunthacker1...@gmail.com> wrote:
> Hi. I wrote a sample code to test out speed difference between SEQUENTIAL
> and O_DIRECT( I used the madvise flag-MADV_DONTNEED) reads .
>
> This is the link to the code: http://pastebin.com/8QywKGyS
>
> There was a speed difference which when i switched between the two flags. I
> have not used the O_DIRECT flag because Linus had criticized it.
>
> Is this what the flags are intended to be used for ? This is just a sample
> code with a test file .
>
> On Wed, Apr 6, 2011 at 12:11 PM, Simon Willnauer
> <simon.willna...@googlemail.com> wrote:
>> Hey Varun,
>> On Tue, Apr 5, 2011 at 11:07 PM, Michael McCandless
>> <luc...@mikemccandless.com> wrote:
>>> Hi Varun,
>>>
>>> Those two issues would make a great GSoC!  Comments below...
>> +1
>>>
>>> On Tue, Apr 5, 2011 at 1:56 PM, Varun Thacker
>>> <varunthacker1...@gmail.com> wrote:
>>>
>>>> I would like to combine two tasks as part of my project
>>>> namely-Directory createOutput and openInput should take an IOContext
>>>> (Lucene-2793) and compliment it by Generalize DirectIOLinuxDir to
>>>> UnixDir (Lucene-2795).
>>>>
>>>> The first part of the project is aimed at significantly reducing time
>>>> taken to search during indexing by adding an IOContext which would
>>>> store buffer size and have options to bypass the OS’s buffer cache
>>>> (This is what causes the slowdown in search ) and other hints. Once
>>>> completed I would move on to Lucene-2795 and generalize the Directory
>>>> implementation to make a UnixDirectory .
>>>
>>> So, the first part (LUCENE-2793) should cause no change at all to
>>> performance, functionality, etc., because it's "merely" installing the
>>> plumbing (IOContext threaded throughout the low-level store APIs in
>>> Lucene) so that higher levels can send important details down to the
>>> Directory.  We'd fix IndexWriter/IndexReader to fill out this
>>> IOContext with the details (merging, flushing, new reader, etc.).
>>>
>>> There's some fun/freedom here in figuring out just what details should
>>> be included in IOContext... (eg: is it low level "set buffer size to 4
>>> KB"
>>> or is it high level "I am opening a new near-real-time reader").
>>>
>>> This first step is a rote cutover, just changing APIs but in no way
>>> taking advantage of the new APIs.
>>>
>>> The 2nd step (LUCENE-2795) would then take advantage of this plumbing,
>>> by creating a UnixDir impl that, using JNI (C code), passes advanced
>>> flags when opening files, based on the incoming IOContext.
>>>
>>> The goal is a single UnixDir that has ifdefs so that it's usable
>>> across multiple Unices, and eg would use direct IO if the context is
>>> merging.  If we are ambitious we could rope Windows into the mix, too,
>>> and then this would be NativeDir...
>>>
>>> We can measure success by validating that a big merge while searching
>>> does not hurt search performance?  (Ie we should be able to reproduce
>>> the results from
>>> http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html).
>>
>> Thanks for the summary mike!
>>>
>>>> I have spoken to Micheal McCandless and Simon Willnauer about
>>>> undertaking these tasks. Micheal McCandless has agreed to mentor me .
>>>> I would love to be able to contribute and learn from Apache Lucene
>>>> community this summer. Also I would love suggestions on how to make my
>>>> application proposal stronger.
>>>
>>> I think either Simon or I can be the "official" mentor, and then the
>>> other one of us (and other Lucene committers) will support/chime
>>> in...
>>
>> I will take the official responsibility here once we are there!
>> simon
>>>
>>> This is an important change for Lucene!
>>>
>>> Mike
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>
>>
>
>
>
> --
>
>
> Regards,
> Varun Thacker
> http://varunthacker.wordpress.com
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: My GSOC proposal

Reply via email to