Re: ComplexPhraseQueryParser performance question

baris . kazar Wed, 12 Feb 2020 11:08:15 -0800

Hi,-

Regarding this mechanisms below i mentioned,


does this class offer any Shingling capability embedded to it?

I could not find any api within this class ComplexPhraseQueryParser forthat purpose.



For instance does this class offer the most commonly used words api?

i can then use one of those words as to use the second and third charfrom it to search like

term1 term2FirstCharTerm2SecondChar* (where i would look upterm2FirstChar in my dictionary hashmap for the most common word valueand bring its second char into the search query)



Having second char in the search query reduces search time by 20 times.

Otherwise, do i have to use the following at index time? i already haveTextField index with my custom analyzer.

How should i embed the shingling filter into my current custom analyzer?i dont want to disturb my current indexing.

All i want to do is to find most common word in my data for each letterin the alphabet.


Should i do this at search time? That would be costly, right?


view-source:http://www.philippeadjiman.com/blog/2009/11/02/writing-a-token-n-grams-analyzer-in-few-lines-of-code-using-lucene/

<a href="http://lucene.apache.org/";><img class="alignleft size-fullwp-image-524" title="lucene_green_300"src="http://www.philippeadjiman.com/blog/wp-content/uploads/2009/11/lucene_green_3001.gif";alt="lucene_green_300" hspace="15" width="300" height="46" align="left"/></a> If you need to parse the tokens n-grams of a string, you may usethe facilities offered by lucene analyzers.What you simply have to do is to build you own analyzer using aShingleMatrixFilter with the parameters that suits you needs. Forinstance, here the few lines of code to build a token bi-grams analyzer:

<pre lang="java">public class NGramAnalyzer extends Analyzer {
    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {

return new StopFilter(new LowerCaseFilter(newShingleMatrixFilter(new StandardTokenizer(reader),2,2,' ')),

           StopAnalyzer.ENGLISH_STOP_WORDS);
     }
}</pre>

The parameters of the ShingleMatrixFilter simply states the minimumand maximum shingle size. “Shingle” is just another name fortoken N-Grams and is popular to be the basic units to help solvingproblems in spell checking, near-duplicate detection and others. Note also the use of a StandardTokenizer to deal with basic specialcharacters like hyphens or other “disturbers”. 

<p>To use the analyzer, you can for instance do:</p>
<pre lang="java">
    public static void main(String[] args) {
        try {

String str = "An easy way to write an analyzer for tokensbi-gram (or even tokens n-grams) with lucene";

            Analyzer analyzer = new NGramAnalyzer();

TokenStream stream = analyzer.tokenStream("content", newStringReader(str));

            Token token = new Token();
            while ((token = stream.next(token)) != null){
                System.out.println(token.term());
            }

        } catch (IOException ie) {
            System.out.println("IO Error " + ie.getMessage());
        }
    }
</pre>
<p>The output will print:</p>
<pre lang="none">
an easy
easy way
way to
to write
write an
an analyzer
analyzer for
for tokens
tokens bi
bi gram
gram or
or even
even tokens
tokens n
n grams
grams with
with lucene
</pre>

Note that the text “bi-gram” was treated like twodifferent tokens, as a desired consequence of using a StandardTokenizerin the ShingleMatrixFilter initialization.



Best regards

On 2/4/20 11:14 AM, [email protected] wrote:


Thanks but i thought this class would have a mechanism to fix this issue.
Thanks

On Feb 4, 2020, at 4:14 AM, Mikhail Khludnev <[email protected]> wrote:

It's slow per se, since it loads terms positions. Usual advices are
shingling or edge ngrams. Note, if this is not a text but a string or enum,
it probably let to apply another tricks. Another idea is perhaps
IntervalQueries can be smarter and faster in certain cases, although they
are backed on the same slow positions.

On Tue, Feb 4, 2020 at 7:25 AM <[email protected]> wrote:

How can this slowdown be resolved?
is this another limitation of this class?
Thanks

On Feb 3, 2020, at 4:14 PM, [email protected] wrote:

Please ignore the first comparison there. i was comparing there {term1

with 2 chars} vs {term1 with >= 5 chars + term2 with 1 char}


The slowdown is

The query "term1 term2*" slows down 400 times (~1500 millisecs) compared

to "term1*" when term1 has >5 chars and term2 is still 1 char.

Best regards

On 2/3/20 4:13 PM, [email protected] wrote:
Hi,-

i hope everyone is doing great.

I saw this issue with this class such that if you search for "term1*"

it is good, (i.e., 4 millisecs when it has >= 5 chars and it is ~250
millisecs when it is 2 chars)

but when you search for "term1 term2*" where when term2 is a single

char, the performance degrades too much.

The query "term1 term2*" slows down 50 times (~200 millisecs) compared

to "term1*" case when term 1 has >5 chars and term2 is still 1 char.

The query "term1 term2*" slows down 400 times (~1500 millisecs)

compared to "term1*" when term1 has >5 chars and term2 is still 1 char.

Is there any suggestion to speed it up?

Best regards



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

--
Sincerely yours
Mikhail Khludnev


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: ComplexPhraseQueryParser performance question

Reply via email to