RE: how do lucene read large index files?

Uwe Schindler Thu, 24 Nov 2016 09:42:40 -0800

Hi Kumaran, hi Erick,

> Not really, as I don't know that code well, Uwe and company
> are the masters of that realm ;)....
> 
> Sorry I can't be more help there....

I can help!

> On Thu, Nov 24, 2016 at 7:29 AM, Kumaran Ramasubramanian
> <[email protected]> wrote:
> > Erick, Thanks a lot for sharing an excellent post...
> >
> > Btw, am using NIOFSDirectory, could you please elaborate on below
> mentioned
> > lines? or any further pointers?
> > NIOFSDirectory or SimpleFSDirectory, we have to pay another price: Our
> code
> >> has to do a lot of syscalls to the O/S kernel to copy blocks of data
> >> between the disk or filesystem cache and our buffers residing in Java
> heap.
> >> This needs to be done on every search request, over and over again.

the blog post just says it simple: You should use MMapDirectory and avoid 
SimpleFSDir or MMapDirectory! The blog post explains why: SimpleFSDir and 
NIOFSDir extend BufferedIndexInput. This class uses an on-heap buffer for 
reading index files (which is 16 KB). For some parts of the index (like doc 
values), this is not ideal. E.g. if you sort against a doc values field and it 
needs to access a sort value (e.g. a short, integer or byte, which is very 
small), it will ask the buffer for the like 4 bytes. In most cases when sorting 
the buffer will not contain those byte, as sorting requires random access over 
a huge file (so it is unlikely that the buffer will help). Then 
BufferedIndexInput will seek the NIO/Simple file pointer and read 16 KiB into 
the buffer. This requires a syscall to the OS kernel, which is expensive. 
During sorting search results this can be millions or billions of times. In 
addition it will copy chunks of memory between Java heap and operating system 
cache over and over.

With MMapDirectory no buffering is done, the Lucene code directly accesses the 
file system cache and this is much more optimized.

So for fast index access:
- avoid SimpleFSDir or NIOFSDir (those are only there for legacy 32 bit 
operating systems and JVMs)
- configure your operating system kernel as described in the blog post and use 
MMapDirectory
- tell the sysadmin to inform himself about the output of linux commands 
free/top/... (or Windows complements).

Uwe

> > --
> > Kumaran R
> >
> >
> >
> > On Wed, Nov 23, 2016 at 9:17 PM, Erick Erickson
> <[email protected]>
> > wrote:
> >
> >> see Uwe's blog:
> >> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-
> 64bit.html
> >>
> >> Short form: files are read into the OS's memory as needed. the whole
> >> file isn't read at once.
> >>
> >> Best,
> >> Erick
> >>
> >> On Wed, Nov 23, 2016 at 12:04 AM, Kumaran Ramasubramanian
> >> <[email protected]> wrote:
> >> > Hi All,
> >> >
> >> > how do lucene read large index files?
> >> > for example, if one file (for eg: .dat file) is 4GB.
> >> > lucene read only part of file to RAM? or
> >> > is it different approach for different lucene file formats?
> >> >
> >> >
> >> > Related Link:
> >> > How do applications (and OS) handle very big files?
> >> > http://superuser.com/a/361201
> >> >
> >> >
> >> > --
> >> > Kumaran R
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: how do lucene read large index files?

Reply via email to