[
https://issues.apache.org/jira/browse/LUCENE-7620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Smiley updated LUCENE-7620:
---------------------------------
Attachment: LUCENE_7620_UH_LengthGoalBreakIterator.patch
Here's an updated patch. I added assertions not exceptions because if per
chance this circumstance happens in production, it's really okay to return
possibly the wrong break and have a passage that isn't quite the ideal size
rather than throw some exception.
It now has 2 modes of operation, with 2 corresponding factory methods to
clarify which: {{createMinLength(...)}} and {{createTargetLength(...)}}. The
minLength mode might be useful because it's faster (than target). I think it's
more useful than a MaxLength (which still could be added in the future) because
a too-long passage can possibly be trimmed by the client, but the reverse is
not true -- you can't lengthen a passage that is too short (if it reaches the
client talking to a search server).
I did some benchmarking too; which in addition to observing the overhead also
served to help ensure it didn't throw exceptions (at least for the test queries
& test data). That never happened though; I squashed bugs in the test and
chose sizes to tease out the edge conditions. In so doing I found a minor bug
with CustomSeparatorBreakIterator but I'll leave that for another time.
Benchmarking showed the minLength is noticeably faster than targetLength, maybe
10% overall. Also, (something I already knew) I observed a "cheap" underlying
BreakIterator like CustomSeparatorBreakIterator is ~20% faster than a JDK
Sentence one.
I'll commit it this weekend or possibly tonight if you review it in-time
positively.
> UnifiedHighlighter: add target character width BreakIterator wrapper
> --------------------------------------------------------------------
>
> Key: LUCENE-7620
> URL: https://issues.apache.org/jira/browse/LUCENE-7620
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/highlighter
> Reporter: David Smiley
> Assignee: David Smiley
> Attachments: LUCENE_7620_UH_LengthGoalBreakIterator.patch,
> LUCENE_7620_UH_LengthGoalBreakIterator.patch
>
>
> The original Highlighter includes a {{SimpleFragmenter}} that delineates
> fragments (aka Passages) by a character width. The default is 100 characters.
> It would be great to support something similar for the UnifiedHighlighter.
> It's useful in its own right and of course it helps users transition to the
> UH. I'd like to do it as a wrapper to another BreakIterator -- perhaps a
> sentence one. In this way you get back Passages that are a number of
> sentences so they will look nice instead of breaking mid-way through a
> sentence. And you get some control by specifying a target number of
> characters. This BreakIterator wouldn't be a general purpose
> java.text.BreakIterator since it would assume it's called in a manner exactly
> as the UnifiedHighlighter uses it. It would probably be compatible with the
> PostingsHighlighter too.
> I don't propose doing this by default; besides, it's easy enough to pick your
> BreakIterator config.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]