I dug this a bit and suspect that the issue is mostly with one field
that is not part of the data but auto-generated: the ID field. It is a
slight variant of Flake IDs, so it's not random, it includes a
timestamp and a sequence number, and I suspect that its patterns
combined with the larger alphabet than ascii makes this size increase
more likely than with the data set you tested against.

For instance I ran the following code with direct array addressing on
and off to simulate a worst-case scenario.

  public static void main(String[] args) throws IOException {
    Directory dir = FSDirectory.open(Paths.get("/tmp/a"));
    IndexWriter w = new IndexWriter(dir, new
IndexWriterConfig().setOpenMode(OpenMode.CREATE));
    byte[] b = new byte[5];
    Random r = new Random(0);
    for (int i = 0; i < 1000000; ++i) {
      r.nextBytes(b);
      for (int j = 0; j < b.length; ++j) {
        b[j] &= 0xfc; // make this byte a multiple of 4
      }
      Document doc = new Document();
      StringField field = new StringField("f", new BytesRef(b), Store.NO);
      doc.add(field);
      w.addDocument(doc);
    }
    w.forceMerge(1);
    IndexReader reader = DirectoryReader.open(w);
    w.close();
    if (reader.leaves().size() != 1) {
      throw new Error();
    }
    LeafReader leaf = reader.leaves().get(0).reader();
    System.out.println(((SegmentReader) leaf).ramBytesUsed());
    reader.close();
    dir.close();
  }

When direct addressing is enabled (default), I get 586079. If I
disable direct addressing by applying the below patch, then I get
156228 - about 3.75x less.

diff --git a/lucene/core/src/java/org/apache/lucene/util/fst/FST.java
b/lucene/core/src/java/org/apache/lucene/util/fst/FST.java
index f308f1a..ff99cc2 100644
--- a/lucene/core/src/java/org/apache/lucene/util/fst/FST.java
+++ b/lucene/core/src/java/org/apache/lucene/util/fst/FST.java
@@ -647,7 +647,7 @@ public final class FST<T> implements Accountable {
       // array that may have holes in it so that we can address the
arcs directly by label without
       // binary search
       int labelRange = nodeIn.arcs[nodeIn.numArcs - 1].label -
nodeIn.arcs[0].label + 1;
-      boolean writeDirectly = labelRange > 0 && labelRange <
Builder.DIRECT_ARC_LOAD_FACTOR * nodeIn.numArcs;
+      boolean writeDirectly = false; // labelRange > 0 && labelRange
< Builder.DIRECT_ARC_LOAD_FACTOR * nodeIn.numArcs;

       //System.out.println("write int @pos=" + (fixedArrayStart-4) +
" numArcs=" + nodeIn.numArcs);
       // create the header

On Mon, Jul 15, 2019 at 2:33 PM Michael Sokolov <[email protected]> wrote:
>
> OK, both LUCENE-8781 and LUCENE-8895 were introduced in 8.2.0. I see
> most of the other data sets report an increase more in the 10-15%
> range, which is expected. I'm curious what the makeup of that http
> logs data set is -- I guess it's HTTP logs :) Is the data public?
>
>
> On Mon, Jul 15, 2019 at 7:23 AM Ignacio Vera <[email protected]> wrote:
> >
> > The change to Lucene 8.2.0 snapshot was done on July 10th. Previous to that 
> > the Lucene version was 8.1.0.
> >
> > On Mon, Jul 15, 2019 at 12:53 PM Michael Sokolov <[email protected]> wrote:
> >>
> >> Hmm that's possible, although the jump is bigger than anything I
> >> observed while testing. I assume these charts are building off of
> >> apache/master, or something close to that? If so, then the timing is
> >> off a bit. LUCENE-8781 was pushed quite a while before that, and then
> >> https://issues.apache.org/jira/browse/LUCENE-8895 which extended the
> >> encoding to be the default (not just for postings) was pushed on July
> >> 2 or so, but the chart shows a jump on July 10?
> >>
> >> On Mon, Jul 15, 2019 at 4:03 AM Ignacio Vera <[email protected]> wrote:
> >> >
> >> > Hi,
> >> >
> >> > We observed using a snapshot of Lucene 8.2 that there is an increase of 
> >> > around 30% on the memory usage of IndexReaders for some of the test 
> >> > datasets, for example:
> >> >
> >> > https://elasticsearch-benchmarks.elastic.co/#tracks/http-logs/nightly/default/30d
> >> >
> >> > We suspect this is due to this change: 
> >> > https://issues.apache.org/jira/browse/LUCENE-8781
> >> >
> >> > On Sun, Jul 14, 2019 at 7:10 AM David Smiley <[email protected]> 
> >> > wrote:
> >> >>
> >> >> Since there won't be any 8.1.2 yet some issues got fixed for 8.1.2 and 
> >> >> there is an 8.1.2 section in CHANGES.txt those issues might not be very 
> >> >> noticeable to users that only look at the published HTML version (e.g. 
> >> >> https://lucene.apache.org/solr/8_1_1/changes/Changes.html ).  Maybe 
> >> >> 8.1.2 should be integrated into 8.2.0 in CHANGES.txt?  Despite this, I 
> >> >> see at least one of those issues got into the curated release notes / 
> >> >> highlights any way -- thanks Ignacio.
> >> >>
> >> >> ~ David Smiley
> >> >> Apache Lucene/Solr Search Developer
> >> >> http://www.linkedin.com/in/davidwsmiley
> >> >>
> >> >>
> >> >> On Fri, Jul 12, 2019 at 9:40 AM Jan Høydahl <[email protected]> 
> >> >> wrote:
> >> >>>
> >> >>> Please use HTTPS in the links to download pages.
> >> >>>
> >> >>> Jan Høydahl
> >> >>>
> >> >>> 12. jul. 2019 kl. 09:04 skrev Ignacio Vera <[email protected]>:
> >> >>>
> >> >>> Ishan: I had a look into the issues and I have no objections as far as 
> >> >>> they get properly reviewed if possible. It will be good to commit the 
> >> >>> shortly so they go through a few CI iterations in case something gets 
> >> >>> broken. I am planning to build the first RC early next week as there 
> >> >>> are no blockers for the release.
> >> >>>
> >> >>> Steve: Than you so much, I need to work on getting the right 
> >> >>> permissions.
> >> >>>
> >> >>> Finally I wrote a draft for the release notes for Lucene and Solr. It 
> >> >>> would be good if someone with more experience in Solr can 
> >> >>> review/modify my attempt as it is difficult for me to know which are 
> >> >>> the most important bits. Here are the links to the drafts (not they 
> >> >>> are in wiki, let me know if you have problems accessing them):
> >> >>>
> >> >>> Lucene:
> >> >>> https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=120732808&draftShareId=cb366dc4-c136-4505-9c37-60bde5db2550&src=shareui&src.shareui.timestamp=1562914476369
> >> >>>
> >> >>> Solr:
> >> >>> https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=120732972&draftShareId=5cace703-b80b-49c4-a07f-55b891683f90&src=shareui&src.shareui.timestamp=1562914529931
> >> >>>
> >> >>> On Thu, Jul 11, 2019 at 6:36 PM Ishan Chattopadhyaya 
> >> >>> <[email protected]> wrote:
> >> >>>>
> >> >>>> Hi Ignacio,
> >> >>>> I wish to include two security bug fixes (not vulnerabilities, but 
> >> >>>> feature regressions due to Authorization plugin), SOLR-13472 and 
> >> >>>> SOLR-13619. I can commit both shortly, attempting to write a unit 
> >> >>>> test for it (which is proving harder to do than reproducing, fixing 
> >> >>>> and testing manually). Please let me know if you have any concerns.
> >> >>>> Regards,
> >> >>>> Ishan
> >> >>>>
> >> >>>> On Thu, 11 Jul, 2019, 9:12 PM Tomoko Uchida, 
> >> >>>> <[email protected]> wrote:
> >> >>>>>
> >> >>>>> Hi Ignacio,
> >> >>>>>
> >> >>>>> LUCENE-8907 was fixed. (I have reverted a series of commits which
> >> >>>>> cause backwards incompatibility on Lucene 8.x.)
> >> >>>>> Thank you for waiting for that!
> >> >>>>>
> >> >>>>> Tomoko
> >> >>>>>
> >> >>>>> 2019年7月11日(木) 22:44 Uwe Schindler <[email protected]>:
> >> >>>>> >
> >> >>>>> > Hi,
> >> >>>>> >
> >> >>>>> >
> >> >>>>> >
> >> >>>>> > I enabled the policeman Jenkins Jobs for 8.2 branch.
> >> >>>>> >
> >> >>>>> >
> >> >>>>> >
> >> >>>>> > Uwe
> >> >>>>> >
> >> >>>>> >
> >> >>>>> >
> >> >>>>> > -----
> >> >>>>> >
> >> >>>>> > Uwe Schindler
> >> >>>>> >
> >> >>>>> > Achterdiek 19, D-28357 Bremen
> >> >>>>> >
> >> >>>>> > https://www.thetaphi.de
> >> >>>>> >
> >> >>>>> > eMail: [email protected]
> >> >>>>> >
> >> >>>>> >
> >> >>>>> >
> >> >>>>> > From: Ignacio Vera <[email protected]>
> >> >>>>> > Sent: Thursday, July 11, 2019 1:05 PM
> >> >>>>> > To: [email protected]
> >> >>>>> > Subject: Re: Lucene/Solr 8.2.0
> >> >>>>> >
> >> >>>>> >
> >> >>>>> >
> >> >>>>> > Hi,
> >> >>>>> >
> >> >>>>> >
> >> >>>>> >
> >> >>>>> > The branch has been created, As a reminder, this branch is on 
> >> >>>>> > feature freeze and only documentation or build patches should be 
> >> >>>>> > committed. I will be waiting for LUCENE-8907 to start building the 
> >> >>>>> > first release candidate.
> >> >>>>> >
> >> >>>>> > Let me know if there is any other blocker before we can start the 
> >> >>>>> > release process.
> >> >>>>> >
> >> >>>>> >
> >> >>>>> >
> >> >>>>> > It seems I do not have the permissions to create the Jenkins jobs 
> >> >>>>> > for this branch, maybe Steve can help here?
> >> >>>>> >
> >> >>>>> >
> >> >>>>> >
> >> >>>>> > Thanks,
> >> >>>>> >
> >> >>>>> >
> >> >>>>> >
> >> >>>>> > Ignacio
> >> >>>>> >
> >> >>>>> >
> >> >>>>> >
> >> >>>>> > On Thu, Jul 11, 2019 at 4:51 AM David Smiley 
> >> >>>>> > <[email protected]> wrote:
> >> >>>>> >
> >> >>>>> > BTW for 8.2.0 I updated Solr's CHANGES.txt to split out issues 
> >> >>>>> > that seemed to be Improvements that were not really New Features.
> >> >>>>> >
> >> >>>>> >
> >> >>>>> > ~ David Smiley
> >> >>>>> >
> >> >>>>> > Apache Lucene/Solr Search Developer
> >> >>>>> >
> >> >>>>> > http://www.linkedin.com/in/davidwsmiley
> >> >>>>> >
> >> >>>>> >
> >> >>>>> >
> >> >>>>> >
> >> >>>>> >
> >> >>>>> > On Wed, Jul 10, 2019 at 10:38 AM Ignacio Vera <[email protected]> 
> >> >>>>> > wrote:
> >> >>>>> >
> >> >>>>> > Thanks Tomoko for taking care of that.
> >> >>>>> >
> >> >>>>> >
> >> >>>>> >
> >> >>>>> > On Wed, Jul 10, 2019 at 4:03 PM Đạt Cao Mạnh 
> >> >>>>> > <[email protected]> wrote:
> >> >>>>> >
> >> >>>>> > Hi Ignacio,
> >> >>>>> >
> >> >>>>> >
> >> >>>>> >
> >> >>>>> > 8.1.2 bugfix release will cancelled. You can go ahead with 8.2 
> >> >>>>> > release.
> >> >>>>> >
> >> >>>>> >
> >> >>>>> >
> >> >>>>> > Thanks!
> >> >>>>> >
> >> >>>>> >
> >> >>>>> >
> >> >>>>> > On Wed, 10 Jul 2019 at 20:38, Tomoko Uchida 
> >> >>>>> > <[email protected]> wrote:
> >> >>>>> >
> >> >>>>> > Hi,
> >> >>>>> > I opened a blocker issue a while ago for release 8.2:
> >> >>>>> > https://issues.apache.org/jira/browse/LUCENE-8907
> >> >>>>> >
> >> >>>>> > Sorry about that, I noticed the backwards incompatibility we have 
> >> >>>>> > to
> >> >>>>> > deal with today. If there are no objections, I will revert the all
> >> >>>>> > related commits from the branch_8x and 8_2 in a few days.
> >> >>>>> >
> >> >>>>> > Thanks,
> >> >>>>> > Tomoko
> >> >>>>> >
> >> >>>>> > 2019年7月10日(水) 22:02 Ignacio Vera <[email protected]>:
> >> >>>>> > >
> >> >>>>> > > Hi,
> >> >>>>> > >
> >> >>>>> > > All the issues listed above has been already committed and I see 
> >> >>>>> > > no blockers for release 8.2. I will cut the branch tomorrow 
> >> >>>>> > > around 10am CEST and I will wait for the decision on the bug 
> >> >>>>> > > release 8.1.2 to schedule the build of the first release 
> >> >>>>> > > candidate. Please let us know if this is troublesome for you.
> >> >>>>> > >
> >> >>>>> > > Thanks,
> >> >>>>> > >
> >> >>>>> > > Ignacio
> >> >>>>> > >
> >> >>>>> > >
> >> >>>>> > > On Tue, Jul 2, 2019 at 2:59 AM Joel Bernstein 
> >> >>>>> > > <[email protected]> wrote:
> >> >>>>> > >>
> >> >>>>> > >> I've got one issue that I'd like to get in 
> >> >>>>> > >> (https://issues.apache.org/jira/browse/SOLR-13589), which I 
> >> >>>>> > >> should have wrapped up in a day or two. +1 for around July 10th.
> >> >>>>> > >>
> >> >>>>> > >> On Mon, Jul 1, 2019 at 5:14 PM Nicholas Knize 
> >> >>>>> > >> <[email protected]> wrote:
> >> >>>>> > >>>
> >> >>>>> > >>> +1 for starting the 8.2 release process. I think it would be 
> >> >>>>> > >>> good to get the LUCENE-8632 feature into 8.2 along with the 
> >> >>>>> > >>> BKD improvements and changes in LUCENE-8888 and LUCENE-8896
> >> >>>>> > >>>
> >> >>>>> > >>> Nicholas Knize, Ph.D., GISP
> >> >>>>> > >>> Geospatial Software Guy  |  Elasticsearch
> >> >>>>> > >>> Apache Lucene PMC Member and Committer
> >> >>>>> > >>> [email protected]
> >> >>>>> > >>>
> >> >>>>> > >>>
> >> >>>>> > >>> On Wed, Jun 26, 2019 at 9:34 AM Ignacio Vera 
> >> >>>>> > >>> <[email protected]> wrote:
> >> >>>>> > >>>>
> >> >>>>> > >>>> Hi all,
> >> >>>>> > >>>>
> >> >>>>> > >>>> 8.1 has been released on May 16th and we have new features, 
> >> >>>>> > >>>> enhancements and fixes that are not released yet so I'd like 
> >> >>>>> > >>>> to start thinking in releasing Lucene/Solr 8.2.0.
> >> >>>>> > >>>>
> >> >>>>> > >>>> I can create the 8.2 branch in two weeks time (around July 
> >> >>>>> > >>>> 10th) and build the first RC by the end of that week if that 
> >> >>>>> > >>>> works for everyone. Please let me know if there are bug fixes 
> >> >>>>> > >>>> that needs to be fixed in 8.2 and might not be ready by then.
> >> >>>>> > >>>>
> >> >>>>> > >>>> Cheers,
> >> >>>>> > >>>>
> >> >>>>> > >>>> Ignacio
> >> >>>>> >
> >> >>>>> > ---------------------------------------------------------------------
> >> >>>>> > To unsubscribe, e-mail: [email protected]
> >> >>>>> > For additional commands, e-mail: [email protected]
> >> >>>>> >
> >> >>>>> > --
> >> >>>>> >
> >> >>>>> > Best regards,
> >> >>>>> >
> >> >>>>> > Cao Mạnh Đạt
> >> >>>>> >
> >> >>>>> > E-mail: [email protected]
> >> >>>>>
> >> >>>>> ---------------------------------------------------------------------
> >> >>>>> To unsubscribe, e-mail: [email protected]
> >> >>>>> For additional commands, e-mail: [email protected]
> >> >>>>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>


-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to