Re: Info on document number limitations

Adrien Grand Fri, 14 Feb 2020 01:33:27 -0800

Lucene has a limit of 2^31-1-128 documents per index, see
IndexWriter.MAX_DOCS. Users don't often run into this limit but I've seen
it happen multiple times.


I think that it's unlikely that Lucene will ever remove this limit on a
per-segment basis, however there have been some discussions about having
the ability to go over this limit across multiple segments:
https://issues.apache.org/jira/browse/LUCENE-8321.

On Sun, Feb 9, 2020 at 2:29 PM Erick Erickson <erickerick...@gmail.com>
wrote:

> Also, given how people use search, they hit performance issues long before
> running out of document IDs. Usually. Although that said I do know of one
> user who’s running in the 1.0-1.5B range per replica so 2B is just around
> the corner. Of course they have to be _very_ careful how they use Solr.
>
> And that said, there’s just not a lot of pressure to go to longs, and as
> Tim says it’s be a very significant effort. And there would be memory
> implications for everyone to balance.
>
> Best,
> Erick
>
> > On Feb 8, 2020, at 9:59 PM, Tim Casey <tca...@gmail.com> wrote:
> >
> >
> > Hi Doug,
> >
> > I don't know the specific limits.  But the document limits are going to
> be around an int, probably signed.  This comes out to mean about 2 billion
> documents per lucene index.  This is fairly embedded into the lucene code.
> The way the collective we have solved this is through forms of sharding.
> >
> > tim
> >
> > On Fri, Feb 7, 2020 at 11:27 AM Doug Tarr <doug.t...@mongodb.com.invalid>
> wrote:
> > Hi!
> >
> > I'm working on a team that is building a lucene based search platform.
>  I've been lurking on this list for a while as we are spooling up on
> learning the various components of Lucene.  Thank you all for your amazing
> work!
> >
> > I'm interested in learning more about what work has been done around
> document count limitations in the Lucene 8 codec (as described here)
> related to using int32 vs VInt or Int64:
> >
> > "Lucene uses a Java int to refer to document numbers, and the index file
> format uses an Int32 on-disk to store document numbers. This is a
> limitation of both the index file format and the current implementation.
> Eventually these should be replaced with either UInt64 values, or better
> yet, VInt values which have no limit."
> >
> > I've looked through JIRA and couldn't find any discussions about it,
> trade-offs, difficulties, etc.  If there's any information about this, I'd
> appreciate any links or info that you might have.
> >
> > Thanks!
> > - Doug
> > --
> >
> > { name     : "Doug Tarr",
> >   title    : "Director of Engineering, Search",
> >   location : "San Francisco, CA",
> >   company  : "MongoDB",
> >   email:   : "doug.t...@mongodb.com",
> >   linkedin : "douglastarr",
> >   twitter  : "@doug_tarr" }
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-- 
Adrien

Re: Info on document number limitations

Reply via email to