Re: ComplexPhraseQueryParser performance question

baris . kazar Wed, 12 Feb 2020 20:34:13 -0800

org.apache.lucene.search.PhraseWildcardQuery
looks very good, i hope this makes into Lucene
build soon.
Thanks


> On Feb 12, 2020, at 10:01 PM, baris.ka...@oracle.com wrote:
> 
> Thanks David, can i look at the source code?
> i think ComplexPhraseQueryParser uses
> something similar. 
> i will check the differences but do You know the differences for quick 
> reference?
> Thanks
> 
> 
> 
>>> On Feb 12, 2020, at 6:41 PM, David Smiley <david.w.smi...@gmail.com> wrote:
>>> 
>> 
>> Hi,
>> 
>> See org.apache.lucene.search.PhraseWildcardQuery in Lucene's sandbox module. 
>>  It was recently added by my amazing colleague Bruno.  At this time there is 
>> no query parser that uses it in Lucene unfortunately but you can rectify 
>> this for your own purposes.  I hope this query "graduates" to Lucene core 
>> some day.  It's placement in sandbox is why it can't be added to any of 
>> Lucene's query parsers like complex phrase.
>> 
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>> 
>> 
>>> On Wed, Feb 12, 2020 at 11:07 AM <baris.ka...@oracle.com> wrote:
>>> Hi,-
>>> 
>>> Regarding this mechanisms below i mentioned,
>>> 
>>> does this class offer any Shingling capability embedded to it?
>>> 
>>> I could not find any api within this class ComplexPhraseQueryParser for 
>>> that purpose.
>>> 
>>> 
>>> For instance does this class offer the most commonly used words api?
>>> 
>>> i can then use one of those words as to use the second and third char 
>>> from it to search like
>>> 
>>> term1 term2FirstCharTerm2SecondChar* (where i would look up 
>>> term2FirstChar in my dictionary hashmap for the most common word value 
>>> and bring its second char into the search query)
>>> 
>>> 
>>> Having second char in the search query reduces search time by 20 times.
>>> 
>>> 
>>> Otherwise, do i have to use the following at index time? i already have 
>>> TextField index with my custom analyzer.
>>> 
>>> How should i embed the shingling filter into my current custom analyzer? 
>>> i dont want to disturb my current indexing.
>>> 
>>> All i want to do is to find most common word in my data for each letter 
>>> in the alphabet.
>>> 
>>> Should i do this at search time? That would be costly, right?
>>> 
>>> 
>>> view-source:http://www.philippeadjiman.com/blog/2009/11/02/writing-a-token-n-grams-analyzer-in-few-lines-of-code-using-lucene/
>>> 
>>> 
>>> <p><a href="http://lucene.apache.org/";><img class="alignleft size-full 
>>> wp-image-524" title="lucene_green_300" 
>>> src="http://www.philippeadjiman.com/blog/wp-content/uploads/2009/11/lucene_green_3001.gif";
>>>  
>>> alt="lucene_green_300" hspace="15" width="300" height="46" align="left" 
>>> /></a> If you need to parse the tokens n-grams of a string, you may use 
>>> the facilities offered by lucene analyzers.</p>
>>> <p>What you simply have to do is to build you own analyzer using a 
>>> ShingleMatrixFilter with the parameters that suits you needs. For 
>>> instance, here the few lines of code to build a token bi-grams analyzer:</p>
>>> <pre lang="java">public class NGramAnalyzer extends Analyzer {
>>>      @Override
>>>      public TokenStream tokenStream(String fieldName, Reader reader) {
>>>         return new StopFilter(new LowerCaseFilter(new 
>>> ShingleMatrixFilter(new StandardTokenizer(reader),2,2,' ')),
>>>             StopAnalyzer.ENGLISH_STOP_WORDS);
>>>       }
>>> }</pre>
>>> <p>The parameters of the ShingleMatrixFilter simply states the minimum 
>>> and maximum shingle size. &#8220;Shingle&#8221; is just another name for 
>>> token N-Grams and is popular to be the basic units to help solving 
>>> problems in spell checking, near-duplicate detection and others.<br />
>>> Note also the use of a StandardTokenizer to deal with basic special 
>>> characters like hyphens or other &#8220;disturbers&#8221;. </p>
>>> <p>To use the analyzer, you can for instance do:</p>
>>> <pre lang="java">
>>>      public static void main(String[] args) {
>>>          try {
>>>              String str = "An easy way to write an analyzer for tokens 
>>> bi-gram (or even tokens n-grams) with lucene";
>>>              Analyzer analyzer = new NGramAnalyzer();
>>> 
>>>              TokenStream stream = analyzer.tokenStream("content", new 
>>> StringReader(str));
>>>              Token token = new Token();
>>>              while ((token = stream.next(token)) != null){
>>>                  System.out.println(token.term());
>>>              }
>>> 
>>>          } catch (IOException ie) {
>>>              System.out.println("IO Error " + ie.getMessage());
>>>          }
>>>      }
>>> </pre>
>>> <p>The output will print:</p>
>>> <pre lang="none">
>>> an easy
>>> easy way
>>> way to
>>> to write
>>> write an
>>> an analyzer
>>> analyzer for
>>> for tokens
>>> tokens bi
>>> bi gram
>>> gram or
>>> or even
>>> even tokens
>>> tokens n
>>> n grams
>>> grams with
>>> with lucene
>>> </pre>
>>> <p>Note that the text &#8220;bi-gram&#8221; was treated like two 
>>> different tokens, as a desired consequence of using a StandardTokenizer 
>>> in the ShingleMatrixFilter initialization.</p>
>>> 
>>> 
>>> Best regards
>>> 
>>> On 2/4/20 11:14 AM, baris.ka...@oracle.com wrote:
>>> >
>>> > Thanks but i thought this class would have a mechanism to fix this issue.
>>> > Thanks
>>> >
>>> >> On Feb 4, 2020, at 4:14 AM, Mikhail Khludnev <m...@apache.org> wrote:
>>> >>
>>> >> It's slow per se, since it loads terms positions. Usual advices are
>>> >> shingling or edge ngrams. Note, if this is not a text but a string or 
>>> >> enum,
>>> >> it probably let to apply another tricks. Another idea is perhaps
>>> >> IntervalQueries can be smarter and faster in certain cases, although they
>>> >> are backed on the same slow positions.
>>> >>
>>> >>> On Tue, Feb 4, 2020 at 7:25 AM <baris.ka...@oracle.com> wrote:
>>> >>>
>>> >>> How can this slowdown be resolved?
>>> >>> is this another limitation of this class?
>>> >>> Thanks
>>> >>>
>>> >>>>> On Feb 3, 2020, at 4:14 PM, baris.ka...@oracle.com wrote:
>>> >>>> Please ignore the first comparison there. i was comparing there {term1
>>> >>> with 2 chars} vs {term1 with >= 5 chars + term2 with 1 char}
>>> >>>>
>>> >>>> The slowdown is
>>> >>>>
>>> >>>> The query "term1 term2*" slows down 400 times (~1500 millisecs) 
>>> >>>> compared
>>> >>> to "term1*" when term1 has >5 chars and term2 is still 1 char.
>>> >>>> Best regards
>>> >>>>
>>> >>>>
>>> >>>>> On 2/3/20 4:13 PM, baris.ka...@oracle.com wrote:
>>> >>>>> Hi,-
>>> >>>>>
>>> >>>>> i hope everyone is doing great.
>>> >>>>>
>>> >>>>> I saw this issue with this class such that if you search for "term1*"
>>> >>> it is good, (i.e., 4 millisecs when it has >= 5 chars and it is ~250
>>> >>> millisecs when it is 2 chars)
>>> >>>>> but when you search for "term1 term2*" where when term2 is a single
>>> >>> char, the performance degrades too much.
>>> >>>>> The query "term1 term2*" slows down 50 times (~200 millisecs) compared
>>> >>> to "term1*" case when term 1 has >5 chars and term2 is still 1 char.
>>> >>>>> The query "term1 term2*" slows down 400 times (~1500 millisecs)
>>> >>> compared to "term1*" when term1 has >5 chars and term2 is still 1 char.
>>> >>>>> Is there any suggestion to speed it up?
>>> >>>>>
>>> >>>>> Best regards
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> ---------------------------------------------------------------------
>>> >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> >>>>>
>>> >>>> ---------------------------------------------------------------------
>>> >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> >>>>
>>> >>>
>>> >>> ---------------------------------------------------------------------
>>> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> >>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> >>>
>>> >>>
>>> >> -- 
>>> >> Sincerely yours
>>> >> Mikhail Khludnev
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>

Re: ComplexPhraseQueryParser performance question

Reply via email to