Hi Michael Can you or someone from the community please help answer my questions?
Thanks Siddharth On Thu, Nov 7, 2019 at 7:50 AM siddharth teotia <siddharthteo...@gmail.com> wrote: > Hi Michael > > Thanks a lot for your response. Couple of more questions > > (1) During indexing, is there any knob to tell the writer to use off-heap > for buffering. I didn't find anything in the docs so probably the answer is > no. Just confirming.. > > (2) In my experiments, I have gone upto ingesting 5 million documents into > the lucene index and the number of segments created was 1. The writer was > committed and closed after ingesting all the documents and after that there > is no need for us to index more. So essentially it is an immutable index. > Basically I wanted to find the threshold for creating a new segment. Is > that pretty high? Or if the writer is reopened, then the next set of > documents will go into the next segment and so on? The reason for doing > this is to find the total number of files (per index) that will be opened > during querying. So far since it was a single segment, only that segment's > cfs file was opened. > > Thanks > Siddharth > > On Thu, Nov 7, 2019, 6:39 AM Michael McCandless <luc...@mikemccandless.com> > wrote: > >> Hi Siddharth, >> >> Your understanding of MMapDirectory is correct -- only give your JVM >> enough heap to not spend too much CPU on GC, and then let the OS use all >> available remaining RAM to cache hot pages from your index. >> >> There are some structures Lucene loads into JVM heap, but even those are >> being moved off-heap (accessed via Directory) recently such as FSTs used >> for the terms index, and BKD index (for dimensional points). I'm not sure >> exactly which structures are still in heap ... maybe the live documents >> bitset? >> >> During indexing, the recently indexed documents are buffered in JVM heap, >> up until the IndexWriterConfig.setRAMBufferSizeMB and then they will be >> written to the Directory as new segments. >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> >> On Wed, Nov 6, 2019 at 11:27 PM siddharth teotia < >> siddharthteo...@gmail.com> wrote: >> >>> Hi All >>> >>> I have some questions about the memory usage. I would really appreciate >>> if >>> someone can help answer these. >>> >>> I understand from the docs that during reading/querying, Lucene uses >>> MMapDirectory (assuming it is supported on the platform). So the Java >>> heap >>> overhead in this case will purely come from the objects that are >>> allocated/instantiated on the query path to process the query and build >>> results etc. But the whole index itself will not be loaded into memory >>> because we memory mapped the file. Is my understanding correct? In this >>> case, we are better off not increasing the Java heap and keep as much >>> as possible available for the file system cache for mmap to do its job >>> efficiently. >>> >>> However, are there any portions of index structures that are completely >>> loaded in memory regardless of whether it is MMapDirectory or not? If so, >>> are they loaded in Java heap or do we use off-heap (direct buffers) in >>> such cases? >>> >>> Secondly, on the write path I think even though the writer opens a >>> MMapDirectory, the writes are gathered/buffered in memory upto a flush >>> threshold controlled by IndexWriterConfig. Is this buffering done in Java >>> heap or direct memory? >>> >>> Thanks a lot for help >>> Siddharth >>> >> -- *Best Regards,* *SIDDHARTH TEOTIA* *2008C6PS540G* *BITS PILANI- GOA CAMPUS* *+91 87911 75932*