Re: how do lucene read large index files?

Michael McCandless Tue, 29 Nov 2016 14:49:54 -0800

It's OK to use NIOFSDirectory for indexing only in that nothing will break.


But, MMapDirectory already uses normal IO for writing
(java.io.FileOutputStream), and indexing does sometimes need to to
read (for merging segments) though that's largely sequential reading
so perhaps NIOFSDirectory won't be much slower.

Why not use MMapDirectory for both indexing and searching?
Mike McCandless

http://blog.mikemccandless.com


On Mon, Nov 28, 2016 at 7:20 AM, Kumaran Ramasubramanian
<[email protected]> wrote:
> Thanks a lot Uwe!!! Do we get any benefit on using MMapDirectory over
> NIOFSDir during indexing? During merging? Is it ok to change to
> MMapDirectory during search alone?
>
> --
> Kumaran R
>
>
> On Nov 24, 2016 11:27 PM, "Erick Erickson" <[email protected]> wrote:
>>
>> Thanks Uwe!
>>
>>
>>
>>
>> On Thu, Nov 24, 2016 at 9:41 AM, Uwe Schindler <[email protected]> wrote:
>> > Hi Kumaran, hi Erick,
>> >
>> >> Not really, as I don't know that code well, Uwe and company
>> >> are the masters of that realm ;)....
>> >>
>> >> Sorry I can't be more help there....
>> >
>> > I can help!
>> >
>> >> On Thu, Nov 24, 2016 at 7:29 AM, Kumaran Ramasubramanian
>> >> <[email protected]> wrote:
>> >> > Erick, Thanks a lot for sharing an excellent post...
>> >> >
>> >> > Btw, am using NIOFSDirectory, could you please elaborate on below
>> >> mentioned
>> >> > lines? or any further pointers?
>> >> > NIOFSDirectory or SimpleFSDirectory, we have to pay another price:
> Our
>> >> code
>> >> >> has to do a lot of syscalls to the O/S kernel to copy blocks of data
>> >> >> between the disk or filesystem cache and our buffers residing in
> Java
>> >> heap.
>> >> >> This needs to be done on every search request, over and over again.
>> >
>> > the blog post just says it simple: You should use MMapDirectory and
> avoid SimpleFSDir or MMapDirectory! The blog post explains why: SimpleFSDir
> and NIOFSDir extend BufferedIndexInput. This class uses an on-heap buffer
> for reading index files (which is 16 KB). For some parts of the index (like
> doc values), this is not ideal. E.g. if you sort against a doc values field
> and it needs to access a sort value (e.g. a short, integer or byte, which
> is very small), it will ask the buffer for the like 4 bytes. In most cases
> when sorting the buffer will not contain those byte, as sorting requires
> random access over a huge file (so it is unlikely that the buffer will
> help). Then BufferedIndexInput will seek the NIO/Simple file pointer and
> read 16 KiB into the buffer. This requires a syscall to the OS kernel,
> which is expensive. During sorting search results this can be millions or
> billions of times. In addition it will copy chunks of memory between Java
> heap and operating system cache over and over.
>> >
>> > With MMapDirectory no buffering is done, the Lucene code directly
> accesses the file system cache and this is much more optimized.
>> >
>> > So for fast index access:
>> > - avoid SimpleFSDir or NIOFSDir (those are only there for legacy 32 bit
> operating systems and JVMs)
>> > - configure your operating system kernel as described in the blog post
> and use MMapDirectory
>> > - tell the sysadmin to inform himself about the output of linux
> commands free/top/... (or Windows complements).
>> >
>> > Uwe
>> >
>> >> > --
>> >> > Kumaran R
>> >> >
>> >> >
>> >> >
>> >> > On Wed, Nov 23, 2016 at 9:17 PM, Erick Erickson
>> >> <[email protected]>
>> >> > wrote:
>> >> >
>> >> >> see Uwe's blog:
>> >> >> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-
>> >> 64bit.html
>> >> >>
>> >> >> Short form: files are read into the OS's memory as needed. the whole
>> >> >> file isn't read at once.
>> >> >>
>> >> >> Best,
>> >> >> Erick
>> >> >>
>> >> >> On Wed, Nov 23, 2016 at 12:04 AM, Kumaran Ramasubramanian
>> >> >> <[email protected]> wrote:
>> >> >> > Hi All,
>> >> >> >
>> >> >> > how do lucene read large index files?
>> >> >> > for example, if one file (for eg: .dat file) is 4GB.
>> >> >> > lucene read only part of file to RAM? or
>> >> >> > is it different approach for different lucene file formats?
>> >> >> >
>> >> >> >
>> >> >> > Related Link:
>> >> >> > How do applications (and OS) handle very big files?
>> >> >> > http://superuser.com/a/361201
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Kumaran R
>> >> >>
>> >> >>
> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: [email protected]
>> >> >> For additional commands, e-mail: [email protected]
>> >> >>
>> >> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [email protected]
>> >> For additional commands, e-mail: [email protected]
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [email protected]
>> > For additional commands, e-mail: [email protected]
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: how do lucene read large index files?

Reply via email to