[
https://issues.apache.org/jira/browse/LUCENE-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753789#action_12753789
]
Chris Harris commented on LUCENE-1370:
--------------------------------------
{quote}
here i think you could save a clone() by not calling captureState twice?
even though it doesnt have to recompute the state, captureState does have to
clone it.
{quote}
So is the idea to replace
{code}
if (getNextToken()) {
if (outputUnigramIfNoNgrams && firstToken == null) {
firstToken = captureState();
}
shingleBuf.add(captureState());
{code}
with
{code}
if (getNextToken()) {
State curState = captureState();
if (outputUnigramIfNoNgrams && firstToken == null) {
firstToken = curState;
}
shingleBuf.add(curState);
{code}
That seems fine, unless there's some hidden reason why you can't share State
objects.
I'd guess you could optimize more than that, but I think you run into
diminishing returns, making the code harder to read more than you're making it
faster. For example:
{code}
if (getNextToken()) {
if (outputUnigramIfNoNgrams && firstToken == null) {
firstToken = captureState();
shingleBuf.add(firstToken);
}
else {
shingleBuf.add(captureState());
}
{code}
> Patch to make ShingleFilter output a unigram if no ngrams can be generated
> --------------------------------------------------------------------------
>
> Key: LUCENE-1370
> URL: https://issues.apache.org/jira/browse/LUCENE-1370
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Reporter: Chris Harris
> Assignee: Karl Wettin
> Attachments: LUCENE-1370.patch, LUCENE-1370.patch, LUCENE-1370.patch,
> ShingleFilter.patch
>
>
> Currently if ShingleFilter.outputUnigrams==false and the underlying token
> stream is only one token long, then ShingleFilter.next() won't return any
> tokens. This patch provides a new option, outputUnigramIfNoNgrams; if this
> option is set and the underlying stream is only one token long, then
> ShingleFilter will return that token, regardless of the setting of
> outputUnigrams.
> My use case here is speeding up phrase queries. The technique is as follows:
> First, doing index-time analysis using ShingleFilter (using
> outputUnigrams==true), thereby expanding things as follows:
> "please divide this sentence into shingles" ->
> "please", "please divide"
> "divide", "divide this"
> "this", "this sentence"
> "sentence", "sentence into"
> "into", "into shingles"
> "shingles"
> Second, do query-time analysis using ShingleFilter (using
> outputUnigrams==false and outputUnigramIfNoNgrams==true). If the user enters
> a phrase query, it will get tokenized in the following manner:
> "please divide this sentence into shingles" ->
> "please divide"
> "divide this"
> "this sentence"
> "sentence into"
> "into shingles"
> By doing phrase queries with bigrams like this, I can gain a very
> considerable speedup. Without the outputUnigramIfNoNgrams option, then a
> single word query would tokenize like this:
> "please" ->
> [no tokens]
> But thanks to outputUnigramIfNoNgrams, single words will now tokenize like
> this:
> "please" ->
> "please"
> ****
> The patch also adds a little to the pre-outputUnigramIfNoNgrams option tests.
> ****
> I'm not sure if the patch in this state is useful to anyone else, but I
> thought I should throw it up here and try to find out.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]