I think we should flip the default of hl.fragsizeIsMinimum to be 'true', thus have the behavior close to what preceded 8.5. (a) it was very recently (<= 8.4) the previous behavior and so may require less tuning for users in 8.6 henceforth (b) it's significantly faster for long text -- seems to be 2x to 5x for long documents (assuming no change in hl.fragAlignRatio). If the user additionally configures hl.fragAlignRatio to 0 (also the previous behavior; 0.5 is the new default), I saw another 6x on top of that for "doc3" in the test data Michal prepared.
Although I like that the sizing looks nicer, I think that is more from the introduction and new default of hl.fragAlignRatio=0.5 than it is hl.fragsizeIsMinimum=false. We might even consider lowering hl.fragAlignRatio to say 0.3 and retain pretty reasonable highlights (avoids the extreme cases occurring with '0') and additional performance benefit from that. What do you think Nandor, Michal? I'm hoping a change in settings (+ some better notes/docs on this) could slip into an 8.6, all done by myself ASAP. ~ David On Fri, Jun 19, 2020 at 2:32 PM Nándor Mátravölgyi <nandor.ma...@gmail.com> wrote: > Hi! > > With the provided test I've profiled the preceding() and following() > calls on the base Java iterators in the different options. > > === default highlighter arguments === > Calling the test query with SENTENCE base iterator: > - from LengthGoalBreakIterator.following(): 1130 calls of > baseIter.preceding() took 1.039629 seconds in total > - from LengthGoalBreakIterator.following(): 1140 calls of > baseIter.following() took 0.340679 seconds in total > - from LengthGoalBreakIterator.preceding(): 1150 calls of > baseIter.preceding() took 0.099344 seconds in total > - from LengthGoalBreakIterator.preceding(): 1100 calls of > baseIter.following() took 0.015156 seconds in total > > Calling the test query with WORD base iterator: > - from LengthGoalBreakIterator.following(): 1200 calls of > baseIter.preceding() took 0.001006 seconds in total > - from LengthGoalBreakIterator.following(): 1700 calls of > baseIter.following() took 0.006278 seconds in total > - from LengthGoalBreakIterator.preceding(): 1710 calls of > baseIter.preceding() took 0.016320 seconds in total > - from LengthGoalBreakIterator.preceding(): 1090 calls of > baseIter.following() took 0.000527 seconds in total > > === hl.fragsizeIsMinimum=true&hl.fragAlignRatio=0 === > Calling the test query with SENTENCE base iterator: > - from LengthGoalBreakIterator.following(): 860 calls of > baseIter.following() took 0.012593 seconds in total > - from LengthGoalBreakIterator.preceding(): 870 calls of > baseIter.preceding() took 0.022208 seconds in total > > Calling the test query with WORD base iterator: > - from LengthGoalBreakIterator.following(): 1360 calls of > baseIter.following() took 0.004789 seconds in total > - from LengthGoalBreakIterator.preceding(): 1370 calls of > baseIter.preceding() took 0.015983 seconds in total > > === hl.fragsizeIsMinimum=true === > Calling the test query with SENTENCE base iterator: > - from LengthGoalBreakIterator.following(): 980 calls of > baseIter.following() took 0.010253 seconds in total > - from LengthGoalBreakIterator.preceding(): 980 calls of > baseIter.preceding() took 0.341997 seconds in total > > Calling the test query with WORD base iterator: > - from LengthGoalBreakIterator.following(): 1670 calls of > baseIter.following() took 0.005150 seconds in total > - from LengthGoalBreakIterator.preceding(): 1680 calls of > baseIter.preceding() took 0.013657 seconds in total > > === hl.fragAlignRatio=0 === > Calling the test query with SENTENCE base iterator: > - from LengthGoalBreakIterator.following(): 1070 calls of > baseIter.preceding() took 1.312056 seconds in total > - from LengthGoalBreakIterator.following(): 1080 calls of > baseIter.following() took 0.678575 seconds in total > - from LengthGoalBreakIterator.preceding(): 1080 calls of > baseIter.preceding() took 0.020507 seconds in total > - from LengthGoalBreakIterator.preceding(): 1080 calls of > baseIter.following() took 0.006977 seconds in total > > Calling the test query with WORD base iterator: > - from LengthGoalBreakIterator.following(): 880 calls of > baseIter.preceding() took 0.000706 seconds in total > - from LengthGoalBreakIterator.following(): 1370 calls of > baseIter.following() took 0.004110 seconds in total > - from LengthGoalBreakIterator.preceding(): 1380 calls of > baseIter.preceding() took 0.014752 seconds in total > - from LengthGoalBreakIterator.preceding(): 1380 calls of > baseIter.following() took 0.000106 seconds in total > > There is definitely a big difference between SENTENCE and WORD. I'm > not sure how we can improve the logic on our side while keeping the > features as is. Since the number of calls is roughly the same for when > the performance is good and bad, it seems to depend on what the text > is that the iterator is traversing. >