My guess is that the code is designed this way to avoid boilerplate more than for performance reasons. Mike McCandless might have more information?
+1 to disable array-with-gaps but keep the logic for now On Mon, Jul 15, 2019 at 5:52 PM Michael Sokolov <msoko...@gmail.com> wrote: > > I guess whether we roll back depends on timing. I think we are close > to a release though, and these changes are complex and will require > further testing, so rollback seems reasonable to me. I think from code > management perspective it will be simplest to disable direct > addressing for now, rather than actually reverting the various commits > that are in place. I can post a patch doing that today. > > I like the ideas you have for compressing FSTs further. It was > bothering me that we store the labels needlessly. I do think that > before making more radical changes to Arc though, I would like to add > some encapsulation so that we can be a bit freer without being > concerned about the abstraction leaking (Several classes depend on the > Arc internals today). EG I'd like to make its members private and add > getters. I know this is a performance-sensitive area, and maybe we had > a reason for not using them? Do we have some experience that suggests > that would be a performance issue? My assumption is that JIT > compilation would make that free, but I haven't tested. > > On Mon, Jul 15, 2019 at 11:36 AM Adrien Grand <jpou...@gmail.com> wrote: > > > > That would be great. I wonder that we could also make the encoding a > > bit more efficient. For instance I noticed that arc metadata is pretty > > large in some cases (in the 10-20 bytes) which make gaps very costly. > > Associating each label with a dense id and having an intermediate > > lookup, ie. lookup label -> id and then id->arc offset instead of > > doing label->arc directly could save a lot of space in some cases? > > Also it seems that we are repeating the label in the arc metadata when > > array-with-gaps is used, even though it shouldn't be necessary since > > the label is implicit from the address? > > > > Do you think we can have a mitigation for worst-case scenarii in 8.2 > > or should we revert from branch_8_2 to keep the release process going > > and work on this for 8.3? > > > > On Mon, Jul 15, 2019 at 5:12 PM Michael Sokolov <msoko...@gmail.com> wrote: > > > > > > Thanks for the nice test, Adrien. Yes, the tradeoff of direct > > > addressing is heavily data-dependent. I think we can improve the > > > situation here by tracking, per-FST instance, the size increase we're > > > seeing while building (or perhaps do a preliminary pass before > > > building) in order to decide whether to apply the encoding. > > > > > > On Mon, Jul 15, 2019 at 9:02 AM Adrien Grand <jpou...@gmail.com> wrote: > > > > > > > > I dug this a bit and suspect that the issue is mostly with one field > > > > that is not part of the data but auto-generated: the ID field. It is a > > > > slight variant of Flake IDs, so it's not random, it includes a > > > > timestamp and a sequence number, and I suspect that its patterns > > > > combined with the larger alphabet than ascii makes this size increase > > > > more likely than with the data set you tested against. > > > > > > > > For instance I ran the following code with direct array addressing on > > > > and off to simulate a worst-case scenario. > > > > > > > > public static void main(String[] args) throws IOException { > > > > Directory dir = FSDirectory.open(Paths.get("/tmp/a")); > > > > IndexWriter w = new IndexWriter(dir, new > > > > IndexWriterConfig().setOpenMode(OpenMode.CREATE)); > > > > byte[] b = new byte[5]; > > > > Random r = new Random(0); > > > > for (int i = 0; i < 1000000; ++i) { > > > > r.nextBytes(b); > > > > for (int j = 0; j < b.length; ++j) { > > > > b[j] &= 0xfc; // make this byte a multiple of 4 > > > > } > > > > Document doc = new Document(); > > > > StringField field = new StringField("f", new BytesRef(b), > > > > Store.NO); > > > > doc.add(field); > > > > w.addDocument(doc); > > > > } > > > > w.forceMerge(1); > > > > IndexReader reader = DirectoryReader.open(w); > > > > w.close(); > > > > if (reader.leaves().size() != 1) { > > > > throw new Error(); > > > > } > > > > LeafReader leaf = reader.leaves().get(0).reader(); > > > > System.out.println(((SegmentReader) leaf).ramBytesUsed()); > > > > reader.close(); > > > > dir.close(); > > > > } > > > > > > > > When direct addressing is enabled (default), I get 586079. If I > > > > disable direct addressing by applying the below patch, then I get > > > > 156228 - about 3.75x less. > > > > > > > > diff --git a/lucene/core/src/java/org/apache/lucene/util/fst/FST.java > > > > b/lucene/core/src/java/org/apache/lucene/util/fst/FST.java > > > > index f308f1a..ff99cc2 100644 > > > > --- a/lucene/core/src/java/org/apache/lucene/util/fst/FST.java > > > > +++ b/lucene/core/src/java/org/apache/lucene/util/fst/FST.java > > > > @@ -647,7 +647,7 @@ public final class FST<T> implements Accountable { > > > > // array that may have holes in it so that we can address the > > > > arcs directly by label without > > > > // binary search > > > > int labelRange = nodeIn.arcs[nodeIn.numArcs - 1].label - > > > > nodeIn.arcs[0].label + 1; > > > > - boolean writeDirectly = labelRange > 0 && labelRange < > > > > Builder.DIRECT_ARC_LOAD_FACTOR * nodeIn.numArcs; > > > > + boolean writeDirectly = false; // labelRange > 0 && labelRange > > > > < Builder.DIRECT_ARC_LOAD_FACTOR * nodeIn.numArcs; > > > > > > > > //System.out.println("write int @pos=" + (fixedArrayStart-4) + > > > > " numArcs=" + nodeIn.numArcs); > > > > // create the header > > > > > > > > On Mon, Jul 15, 2019 at 2:33 PM Michael Sokolov <msoko...@gmail.com> > > > > wrote: > > > > > > > > > > OK, both LUCENE-8781 and LUCENE-8895 were introduced in 8.2.0. I see > > > > > most of the other data sets report an increase more in the 10-15% > > > > > range, which is expected. I'm curious what the makeup of that http > > > > > logs data set is -- I guess it's HTTP logs :) Is the data public? > > > > > > > > > > > > > > > On Mon, Jul 15, 2019 at 7:23 AM Ignacio Vera <iver...@gmail.com> > > > > > wrote: > > > > > > > > > > > > The change to Lucene 8.2.0 snapshot was done on July 10th. Previous > > > > > > to that the Lucene version was 8.1.0. > > > > > > > > > > > > On Mon, Jul 15, 2019 at 12:53 PM Michael Sokolov > > > > > > <msoko...@gmail.com> wrote: > > > > > >> > > > > > >> Hmm that's possible, although the jump is bigger than anything I > > > > > >> observed while testing. I assume these charts are building off of > > > > > >> apache/master, or something close to that? If so, then the timing > > > > > >> is > > > > > >> off a bit. LUCENE-8781 was pushed quite a while before that, and > > > > > >> then > > > > > >> https://issues.apache.org/jira/browse/LUCENE-8895 which extended > > > > > >> the > > > > > >> encoding to be the default (not just for postings) was pushed on > > > > > >> July > > > > > >> 2 or so, but the chart shows a jump on July 10? > > > > > >> > > > > > >> On Mon, Jul 15, 2019 at 4:03 AM Ignacio Vera <iver...@gmail.com> > > > > > >> wrote: > > > > > >> > > > > > > >> > Hi, > > > > > >> > > > > > > >> > We observed using a snapshot of Lucene 8.2 that there is an > > > > > >> > increase of around 30% on the memory usage of IndexReaders for > > > > > >> > some of the test datasets, for example: > > > > > >> > > > > > > >> > https://elasticsearch-benchmarks.elastic.co/#tracks/http-logs/nightly/default/30d > > > > > >> > > > > > > >> > We suspect this is due to this change: > > > > > >> > https://issues.apache.org/jira/browse/LUCENE-8781 > > > > > >> > > > > > > >> > On Sun, Jul 14, 2019 at 7:10 AM David Smiley > > > > > >> > <david.w.smi...@gmail.com> wrote: > > > > > >> >> > > > > > >> >> Since there won't be any 8.1.2 yet some issues got fixed for > > > > > >> >> 8.1.2 and there is an 8.1.2 section in CHANGES.txt those issues > > > > > >> >> might not be very noticeable to users that only look at the > > > > > >> >> published HTML version (e.g. > > > > > >> >> https://lucene.apache.org/solr/8_1_1/changes/Changes.html ). > > > > > >> >> Maybe 8.1.2 should be integrated into 8.2.0 in CHANGES.txt? > > > > > >> >> Despite this, I see at least one of those issues got into the > > > > > >> >> curated release notes / highlights any way -- thanks Ignacio. > > > > > >> >> > > > > > >> >> ~ David Smiley > > > > > >> >> Apache Lucene/Solr Search Developer > > > > > >> >> http://www.linkedin.com/in/davidwsmiley > > > > > >> >> > > > > > >> >> > > > > > >> >> On Fri, Jul 12, 2019 at 9:40 AM Jan Høydahl > > > > > >> >> <jan....@cominvent.com> wrote: > > > > > >> >>> > > > > > >> >>> Please use HTTPS in the links to download pages. > > > > > >> >>> > > > > > >> >>> Jan Høydahl > > > > > >> >>> > > > > > >> >>> 12. jul. 2019 kl. 09:04 skrev Ignacio Vera <iver...@gmail.com>: > > > > > >> >>> > > > > > >> >>> Ishan: I had a look into the issues and I have no objections > > > > > >> >>> as far as they get properly reviewed if possible. It will be > > > > > >> >>> good to commit the shortly so they go through a few CI > > > > > >> >>> iterations in case something gets broken. I am planning to > > > > > >> >>> build the first RC early next week as there are no blockers > > > > > >> >>> for the release. > > > > > >> >>> > > > > > >> >>> Steve: Than you so much, I need to work on getting the right > > > > > >> >>> permissions. > > > > > >> >>> > > > > > >> >>> Finally I wrote a draft for the release notes for Lucene and > > > > > >> >>> Solr. It would be good if someone with more experience in Solr > > > > > >> >>> can review/modify my attempt as it is difficult for me to know > > > > > >> >>> which are the most important bits. Here are the links to the > > > > > >> >>> drafts (not they are in wiki, let me know if you have problems > > > > > >> >>> accessing them): > > > > > >> >>> > > > > > >> >>> Lucene: > > > > > >> >>> https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=120732808&draftShareId=cb366dc4-c136-4505-9c37-60bde5db2550&src=shareui&src.shareui.timestamp=1562914476369 > > > > > >> >>> > > > > > >> >>> Solr: > > > > > >> >>> https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=120732972&draftShareId=5cace703-b80b-49c4-a07f-55b891683f90&src=shareui&src.shareui.timestamp=1562914529931 > > > > > >> >>> > > > > > >> >>> On Thu, Jul 11, 2019 at 6:36 PM Ishan Chattopadhyaya > > > > > >> >>> <ichattopadhy...@gmail.com> wrote: > > > > > >> >>>> > > > > > >> >>>> Hi Ignacio, > > > > > >> >>>> I wish to include two security bug fixes (not > > > > > >> >>>> vulnerabilities, but feature regressions due to Authorization > > > > > >> >>>> plugin), SOLR-13472 and SOLR-13619. I can commit both > > > > > >> >>>> shortly, attempting to write a unit test for it (which is > > > > > >> >>>> proving harder to do than reproducing, fixing and testing > > > > > >> >>>> manually). Please let me know if you have any concerns. > > > > > >> >>>> Regards, > > > > > >> >>>> Ishan > > > > > >> >>>> > > > > > >> >>>> On Thu, 11 Jul, 2019, 9:12 PM Tomoko Uchida, > > > > > >> >>>> <tomoko.uchida.1...@gmail.com> wrote: > > > > > >> >>>>> > > > > > >> >>>>> Hi Ignacio, > > > > > >> >>>>> > > > > > >> >>>>> LUCENE-8907 was fixed. (I have reverted a series of commits > > > > > >> >>>>> which > > > > > >> >>>>> cause backwards incompatibility on Lucene 8.x.) > > > > > >> >>>>> Thank you for waiting for that! > > > > > >> >>>>> > > > > > >> >>>>> Tomoko > > > > > >> >>>>> > > > > > >> >>>>> 2019年7月11日(木) 22:44 Uwe Schindler <u...@thetaphi.de>: > > > > > >> >>>>> > > > > > > >> >>>>> > Hi, > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > I enabled the policeman Jenkins Jobs for 8.2 branch. > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > Uwe > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > ----- > > > > > >> >>>>> > > > > > > >> >>>>> > Uwe Schindler > > > > > >> >>>>> > > > > > > >> >>>>> > Achterdiek 19, D-28357 Bremen > > > > > >> >>>>> > > > > > > >> >>>>> > https://www.thetaphi.de > > > > > >> >>>>> > > > > > > >> >>>>> > eMail: u...@thetaphi.de > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > From: Ignacio Vera <iver...@gmail.com> > > > > > >> >>>>> > Sent: Thursday, July 11, 2019 1:05 PM > > > > > >> >>>>> > To: dev@lucene.apache.org > > > > > >> >>>>> > Subject: Re: Lucene/Solr 8.2.0 > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > Hi, > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > The branch has been created, As a reminder, this branch is > > > > > >> >>>>> > on feature freeze and only documentation or build patches > > > > > >> >>>>> > should be committed. I will be waiting for LUCENE-8907 to > > > > > >> >>>>> > start building the first release candidate. > > > > > >> >>>>> > > > > > > >> >>>>> > Let me know if there is any other blocker before we can > > > > > >> >>>>> > start the release process. > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > It seems I do not have the permissions to create the > > > > > >> >>>>> > Jenkins jobs for this branch, maybe Steve can help here? > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > Thanks, > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > Ignacio > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > On Thu, Jul 11, 2019 at 4:51 AM David Smiley > > > > > >> >>>>> > <david.w.smi...@gmail.com> wrote: > > > > > >> >>>>> > > > > > > >> >>>>> > BTW for 8.2.0 I updated Solr's CHANGES.txt to split out > > > > > >> >>>>> > issues that seemed to be Improvements that were not really > > > > > >> >>>>> > New Features. > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > ~ David Smiley > > > > > >> >>>>> > > > > > > >> >>>>> > Apache Lucene/Solr Search Developer > > > > > >> >>>>> > > > > > > >> >>>>> > http://www.linkedin.com/in/davidwsmiley > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > On Wed, Jul 10, 2019 at 10:38 AM Ignacio Vera > > > > > >> >>>>> > <iver...@gmail.com> wrote: > > > > > >> >>>>> > > > > > > >> >>>>> > Thanks Tomoko for taking care of that. > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > On Wed, Jul 10, 2019 at 4:03 PM Đạt Cao Mạnh > > > > > >> >>>>> > <caomanhdat...@gmail.com> wrote: > > > > > >> >>>>> > > > > > > >> >>>>> > Hi Ignacio, > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > 8.1.2 bugfix release will cancelled. You can go ahead with > > > > > >> >>>>> > 8.2 release. > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > Thanks! > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > > > > > > >> >>>>> > On Wed, 10 Jul 2019 at 20:38, Tomoko Uchida > > > > > >> >>>>> > <tomoko.uchida.1...@gmail.com> wrote: > > > > > >> >>>>> > > > > > > >> >>>>> > Hi, > > > > > >> >>>>> > I opened a blocker issue a while ago for release 8.2: > > > > > >> >>>>> > https://issues.apache.org/jira/browse/LUCENE-8907 > > > > > >> >>>>> > > > > > > >> >>>>> > Sorry about that, I noticed the backwards incompatibility > > > > > >> >>>>> > we have to > > > > > >> >>>>> > deal with today. If there are no objections, I will revert > > > > > >> >>>>> > the all > > > > > >> >>>>> > related commits from the branch_8x and 8_2 in a few days. > > > > > >> >>>>> > > > > > > >> >>>>> > Thanks, > > > > > >> >>>>> > Tomoko > > > > > >> >>>>> > > > > > > >> >>>>> > 2019年7月10日(水) 22:02 Ignacio Vera <iver...@gmail.com>: > > > > > >> >>>>> > > > > > > > >> >>>>> > > Hi, > > > > > >> >>>>> > > > > > > > >> >>>>> > > All the issues listed above has been already committed > > > > > >> >>>>> > > and I see no blockers for release 8.2. I will cut the > > > > > >> >>>>> > > branch tomorrow around 10am CEST and I will wait for the > > > > > >> >>>>> > > decision on the bug release 8.1.2 to schedule the build > > > > > >> >>>>> > > of the first release candidate. Please let us know if > > > > > >> >>>>> > > this is troublesome for you. > > > > > >> >>>>> > > > > > > > >> >>>>> > > Thanks, > > > > > >> >>>>> > > > > > > > >> >>>>> > > Ignacio > > > > > >> >>>>> > > > > > > > >> >>>>> > > > > > > > >> >>>>> > > On Tue, Jul 2, 2019 at 2:59 AM Joel Bernstein > > > > > >> >>>>> > > <joels...@gmail.com> wrote: > > > > > >> >>>>> > >> > > > > > >> >>>>> > >> I've got one issue that I'd like to get in > > > > > >> >>>>> > >> (https://issues.apache.org/jira/browse/SOLR-13589), > > > > > >> >>>>> > >> which I should have wrapped up in a day or two. +1 for > > > > > >> >>>>> > >> around July 10th. > > > > > >> >>>>> > >> > > > > > >> >>>>> > >> On Mon, Jul 1, 2019 at 5:14 PM Nicholas Knize > > > > > >> >>>>> > >> <nkn...@gmail.com> wrote: > > > > > >> >>>>> > >>> > > > > > >> >>>>> > >>> +1 for starting the 8.2 release process. I think it > > > > > >> >>>>> > >>> would be good to get the LUCENE-8632 feature into 8.2 > > > > > >> >>>>> > >>> along with the BKD improvements and changes in > > > > > >> >>>>> > >>> LUCENE-8888 and LUCENE-8896 > > > > > >> >>>>> > >>> > > > > > >> >>>>> > >>> Nicholas Knize, Ph.D., GISP > > > > > >> >>>>> > >>> Geospatial Software Guy | Elasticsearch > > > > > >> >>>>> > >>> Apache Lucene PMC Member and Committer > > > > > >> >>>>> > >>> nkn...@apache.org > > > > > >> >>>>> > >>> > > > > > >> >>>>> > >>> > > > > > >> >>>>> > >>> On Wed, Jun 26, 2019 at 9:34 AM Ignacio Vera > > > > > >> >>>>> > >>> <iver...@gmail.com> wrote: > > > > > >> >>>>> > >>>> > > > > > >> >>>>> > >>>> Hi all, > > > > > >> >>>>> > >>>> > > > > > >> >>>>> > >>>> 8.1 has been released on May 16th and we have new > > > > > >> >>>>> > >>>> features, enhancements and fixes that are not > > > > > >> >>>>> > >>>> released yet so I'd like to start thinking in > > > > > >> >>>>> > >>>> releasing Lucene/Solr 8.2.0. > > > > > >> >>>>> > >>>> > > > > > >> >>>>> > >>>> I can create the 8.2 branch in two weeks time (around > > > > > >> >>>>> > >>>> July 10th) and build the first RC by the end of that > > > > > >> >>>>> > >>>> week if that works for everyone. Please let me know > > > > > >> >>>>> > >>>> if there are bug fixes that needs to be fixed in 8.2 > > > > > >> >>>>> > >>>> and might not be ready by then. > > > > > >> >>>>> > >>>> > > > > > >> >>>>> > >>>> Cheers, > > > > > >> >>>>> > >>>> > > > > > >> >>>>> > >>>> Ignacio > > > > > >> >>>>> > > > > > > >> >>>>> > --------------------------------------------------------------------- > > > > > >> >>>>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > > > > >> >>>>> > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > > >> >>>>> > > > > > > >> >>>>> > -- > > > > > >> >>>>> > > > > > > >> >>>>> > Best regards, > > > > > >> >>>>> > > > > > > >> >>>>> > Cao Mạnh Đạt > > > > > >> >>>>> > > > > > > >> >>>>> > E-mail: caomanhdat...@gmail.com > > > > > >> >>>>> > > > > > >> >>>>> --------------------------------------------------------------------- > > > > > >> >>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > > > > >> >>>>> For additional commands, e-mail: dev-h...@lucene.apache.org > > > > > >> >>>>> > > > > > >> > > > > > >> --------------------------------------------------------------------- > > > > > >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > > > > >> For additional commands, e-mail: dev-h...@lucene.apache.org > > > > > >> > > > > > > > > > > --------------------------------------------------------------------- > > > > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > > > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > > > > > > > > > > > > > > -- > > > > Adrien > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > > > > > > -- > > Adrien > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > -- Adrien --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org