Re: Performance regression in getting doc by id in Lucene 8 vs Lucene 7

Adrien Grand Wed, 07 Jun 2023 04:22:05 -0700

I agree it's worth discussing. I opened
https://github.com/apache/lucene/issues/12355 and
https://github.com/apache/lucene/issues/12356.


On Tue, Jun 6, 2023 at 9:17 PM Rahul Goswami <rahul196...@gmail.com> wrote:
>
> Thanks Adrien. I spent some time trying to understand the readByte() in
> ReverseRandomAccessReader (through FST) and compare with 7.x.  Although I
> don't understand ALL of the details and reasoning for always loading the
> FST (and in turn the term index) off-heap (as discussed in
> https://github.com/apache/lucene/issues/10297 ) I understand that this is
> essentially causing disk access for every single byte during readByte().
>
> Does this warrant a JIRA for regression?
>
> As mentioned, I am noticing a 10x slowdown in SegmentTermsEnum.seekExact()
> affecting atomic update performance . For setups like mine that can't use
> mmap due to large indexes this would be a legit regression, no?
>
> - Rahul
>
> On Tue, Jun 6, 2023 at 10:09 AM Adrien Grand <jpou...@gmail.com> wrote:
>
> > Yes, this changed in 8.x:
> >  - 8.0 moved the terms index off-heap for non-PK fields with
> > MMapDirectory. https://github.com/apache/lucene/issues/9681
> >  - Then in 8.6 the FST was moved off-heap all the time.
> > https://github.com/apache/lucene/issues/10297
> >
> > More generally, there's a few files that are no longer loaded in heap
> > in 8.x. It should be possible to load them back in heap by doing
> > something like that (beware, I did not actually test this code):
> >
> > class MyHeapDirectory extends FilterDirectory {
> >
> >   MyHeapDirectory(Directory in) {
> >     super(in);
> >   }
> >
> >   @Override
> >   public IndexInput openInput(String name, IOContext context) throws
> > IOException {
> >     if (context.load == false) {
> >       return super.openInput(name, context);
> >     } else {
> >       try (IndexInput in = super.openInput(name, context)) {
> >         byte[] bytes = new byte[Math.toIntExact(in.length())];
> >         in.readBytes(bytes, bytes.length);
> >         ByteBuffer bb =
> > ByteBuffer.wrap(bytes).order(ByteOrder.LITTLE_ENDIAN).asReadOnlyBuffer();
> >         return new ByteBuffersIndexInput(new
> > ByteBuffersDataInput(Collections.singletonList(bb)),
> > "ByteBuffersIndexInput(" + name + ")");
> >       }
> >     }
> >   }
> >
> > }
> >
> > On Tue, Jun 6, 2023 at 3:41 PM Rahul Goswami <rahul196...@gmail.com>
> > wrote:
> > >
> > > Thanks Adrien. Is this behavior of FST something that has changed in
> > Lucene
> > > 8.x (from 7.x)?
> > > Also, is the terms index not loaded into memory anymore in 8.x?
> > >
> > > To your point on MMapDirectoryFactory, it is much faster as you
> > > anticipated, but the indexes commonly being >1 TB makes the Windows
> > machine
> > > freeze to a point I sometimes can't even connect to the VM.
> > > SimpleFSDirectory works well for us from that standpoint.
> > >
> > > To add, both NIOFS and SimpleFS have similar indexing benchmarks on
> > > Windows. I understand it is because of the Java bug which synchronizes
> > > internally in the native call for NIOFs.
> > >
> > > -Rahul
> > >
> > > On Tue, Jun 6, 2023 at 9:32 AM Adrien Grand <jpou...@gmail.com> wrote:
> > >
> > > > +Alan Woodward helped me better understand what is going on here.
> > > > BufferedIndexInput (used by NIOFSDirectory and SimpleFSDirectory)
> > > > doesn't play well with the fact that the FST reads bytes backwards:
> > > > every call to readByte() triggers a refill of 1kB because it wants to
> > > > read the byte that is just before what the buffer contains.
> > > >
> > > > On Tue, Jun 6, 2023 at 2:07 PM Adrien Grand <jpou...@gmail.com> wrote:
> > > > >
> > > > > My best guess based on your description of the issue is that
> > > > > SimpleFSDirectory doesn't like the fact that the terms index now
> > reads
> > > > > data directly from the directory instead of loading the terms index
> > in
> > > > > heap. Would you be able to run the same benchmark with MMapDirectory
> > > > > to check if it addresses the regression?
> > > > >
> > > > >
> > > > > On Tue, Jun 6, 2023 at 5:47 AM Rahul Goswami <rahul196...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > Hello,
> > > > > > We started experiencing slowness with atomic updates in Solr after
> > > > > > upgrading from 7.7.2 to 8.11.1. Running several tests revealed the
> > > > > > slowness to be in RealTimeGet's SolrIndexSearcher.getFirstMatch()
> > call
> > > > > > which eventually calls Lucene's SegmentTermsEnum.seekExact()..
> > > > > >
> > > > > > In the benchmarks I ran, 8.11.1 is about 10x slower than 7.7.2.
> > After
> > > > > > discussion on the Solr mailing list I created the below JIRA:
> > > > > >
> > > > > > https://issues.apache.org/jira/browse/SOLR-16838
> > > > > >
> > > > > > The thread dumps collected show a lot of threads stuck in the
> > > > > > FST.findTargetArc()
> > > > > > method. Testing environment details:
> > > > > >
> > > > > > Environment details:
> > > > > > - Java 11 on Windows server
> > > > > > - Xms1536m Xmx3072m
> > > > > > - Indexing client code running 15 parallel threads indexing in
> > batches
> > > > of
> > > > > > 1000 on a standalone core.
> > > > > > - using SimpleFSDirectoryFactory  (since Mmap doesn't  quite work
> > well
> > > > on
> > > > > > Windows for our index sizes which commonly run north of 1 TB)
> > > > > >
> > > > > >
> > > >
> > https://drive.google.com/drive/folders/1q2DPNTYQEU6fi3NeXIKJhaoq3KPnms0h?usp=sharing
> > > > > >
> > > > > > Is there a known issue with slowness with TermsEnum.seekExact() in
> > > > Lucene
> > > > > > 8.x ?
> > > > > >
> > > > > > Thanks,
> > > > > > Rahul
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Adrien
> > > >
> > > >
> > > >
> > > > --
> > > > Adrien
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > > >
> > > >
> >
> >
> >
> > --
> > Adrien
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >



-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Performance regression in getting doc by id in Lucene 8 vs Lucene 7

Reply via email to