[ https://issues.apache.org/jira/browse/LUCENE-7526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15662505#comment-15662505 ]
David Smiley commented on LUCENE-7526: -------------------------------------- RE Benchmarks: Yeah it seems hard to get consistency. I even upped the iterations counts to 2000 (from 500) and saw more variability than I'd like. Nonetheless your analysis looks good to me; it's how I would sum up my observations today. I ran my benchmarks with "-Xms4G -Xmx4G -XX:NewRatio=1" (to reduce the effects of GC). My machine is a MacBook Pro Retina "Late 2013", 2 GHz i7, 8GB RAM. This index is on an external spinning disk but there is no disk activity after warm-up because it's all in the O.S. Cache. As I looked over things carefully, I made a few more changes; all pretty minor. A few were in reaction to "ant precommit". https://github.com/dsmiley/lucene-solr/commits/uh_Tim -- the test & some javadoc issues. I also did a little bit of refactoring I hope you'll be good with. In a couple cases I merely moved a method up or down so that the code flows top to bottom better. Regarding Passage: I'll create a separate issue expressly for making Passage usable by anyone customizing the highlighter. It's not as simple as addMatch being public; there are other methods. _What's there now is in good shape for committing._ A couple ideas occurred to me; _feel free to punt to another issue or never_: * MultiValueTokenStream isn't needed for MemoryIndexOffsetStrategy, albeit with a change to loop over content.split(separatorChar). MemoryIndex.addField is overloaded to take the position increment gap. Then, MultiValueTokenStream could move to an inner class of TokenStreamOffsetStrategy, and it wouldn't generally be used (as it's no longer by default). That'd be nice -- keeping the complexity over there, and it's a bit of a hack too. * We had made OffsetsEnum & TokenStreamPostingsEnum implement Closeable to ameliorate the ramifications of the text analysis code throwing an exception, i.e. due to a bug. The only beneficiary of this now is TokenStreamOffsetStrategy, which isn't the default anymore. It could be removed to simplify things. But then again, perhaps it could be useful for those implementing custom OffsetStrategies. I guess it should stay; there's very little to this after all. Proposed CHANGES.txt in "Improvements": * Enhanced UnifiedHighlighter's passage relevancy for queries with wildcards and sometimes just terms. Added shouldPreferPassageRelevancyOverSpeed() which can be overridden to return false to eek out more speed in some cases. (Tim Rodriguez, David Smiley) > Improvements to UnifiedHighlighter OffsetStrategies > --------------------------------------------------- > > Key: LUCENE-7526 > URL: https://issues.apache.org/jira/browse/LUCENE-7526 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter > Reporter: Timothy M. Rodriguez > Assignee: David Smiley > Priority: Minor > Fix For: 6.4 > > > This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies > by reducing reliance on creating or re-creating TokenStreams. > The primary changes are as follows: > * AnalysisOffsetStrategy - split into two offset strategies > ** MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a > MemoryIndex for producing Offsets > ** TokenStreamOffsetStrategy - an offset strategy that avoids creating a > MemoryIndex. Can only be used if the query distills down to terms and > automata. > * TokenStream removal > ** MemoryIndexOffsetStrategy - previously a TokenStream was created to fill > the memory index and then once consumed a new one was generated by > uninverting the MemoryIndex back into a TokenStream if there were automata > (wildcard/mtq queries) involved. Now this is avoided, which should save > memory and avoid a second pass over the data. > ** TermVectorOffsetStrategy - this was refactored in a similar way to avoid > generating a TokenStream if automata are involved. > ** PostingsWithTermVectorsOffsetStrategy - similar refactoring > * CompositePostingsEnum - aggregates several underlying PostingsEnums for > wildcard/mtq queries. This should improve relevancy by providing unified > metrics for a wildcard across all it's term matches > * Added a HighlightFlag for enabling the newly separated > TokenStreamOffsetStrategy since it can adversely affect passage relevancy -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org