org.apache.lucene.search.PhraseWildcardQuery looks very good, i hope this makes into Lucene build soon. Thanks
> On Feb 12, 2020, at 10:01 PM, baris.ka...@oracle.com wrote: > > Thanks David, can i look at the source code? > i think ComplexPhraseQueryParser uses > something similar. > i will check the differences but do You know the differences for quick > reference? > Thanks > > > >>> On Feb 12, 2020, at 6:41 PM, David Smiley <david.w.smi...@gmail.com> wrote: >>> >> >> Hi, >> >> See org.apache.lucene.search.PhraseWildcardQuery in Lucene's sandbox module. >> It was recently added by my amazing colleague Bruno. At this time there is >> no query parser that uses it in Lucene unfortunately but you can rectify >> this for your own purposes. I hope this query "graduates" to Lucene core >> some day. It's placement in sandbox is why it can't be added to any of >> Lucene's query parsers like complex phrase. >> >> ~ David Smiley >> Apache Lucene/Solr Search Developer >> http://www.linkedin.com/in/davidwsmiley >> >> >>> On Wed, Feb 12, 2020 at 11:07 AM <baris.ka...@oracle.com> wrote: >>> Hi,- >>> >>> Regarding this mechanisms below i mentioned, >>> >>> does this class offer any Shingling capability embedded to it? >>> >>> I could not find any api within this class ComplexPhraseQueryParser for >>> that purpose. >>> >>> >>> For instance does this class offer the most commonly used words api? >>> >>> i can then use one of those words as to use the second and third char >>> from it to search like >>> >>> term1 term2FirstCharTerm2SecondChar* (where i would look up >>> term2FirstChar in my dictionary hashmap for the most common word value >>> and bring its second char into the search query) >>> >>> >>> Having second char in the search query reduces search time by 20 times. >>> >>> >>> Otherwise, do i have to use the following at index time? i already have >>> TextField index with my custom analyzer. >>> >>> How should i embed the shingling filter into my current custom analyzer? >>> i dont want to disturb my current indexing. >>> >>> All i want to do is to find most common word in my data for each letter >>> in the alphabet. >>> >>> Should i do this at search time? That would be costly, right? >>> >>> >>> view-source:http://www.philippeadjiman.com/blog/2009/11/02/writing-a-token-n-grams-analyzer-in-few-lines-of-code-using-lucene/ >>> >>> >>> <p><a href="http://lucene.apache.org/"><img class="alignleft size-full >>> wp-image-524" title="lucene_green_300" >>> src="http://www.philippeadjiman.com/blog/wp-content/uploads/2009/11/lucene_green_3001.gif" >>> >>> alt="lucene_green_300" hspace="15" width="300" height="46" align="left" >>> /></a> If you need to parse the tokens n-grams of a string, you may use >>> the facilities offered by lucene analyzers.</p> >>> <p>What you simply have to do is to build you own analyzer using a >>> ShingleMatrixFilter with the parameters that suits you needs. For >>> instance, here the few lines of code to build a token bi-grams analyzer:</p> >>> <pre lang="java">public class NGramAnalyzer extends Analyzer { >>> @Override >>> public TokenStream tokenStream(String fieldName, Reader reader) { >>> return new StopFilter(new LowerCaseFilter(new >>> ShingleMatrixFilter(new StandardTokenizer(reader),2,2,' ')), >>> StopAnalyzer.ENGLISH_STOP_WORDS); >>> } >>> }</pre> >>> <p>The parameters of the ShingleMatrixFilter simply states the minimum >>> and maximum shingle size. “Shingle” is just another name for >>> token N-Grams and is popular to be the basic units to help solving >>> problems in spell checking, near-duplicate detection and others.<br /> >>> Note also the use of a StandardTokenizer to deal with basic special >>> characters like hyphens or other “disturbers”. </p> >>> <p>To use the analyzer, you can for instance do:</p> >>> <pre lang="java"> >>> public static void main(String[] args) { >>> try { >>> String str = "An easy way to write an analyzer for tokens >>> bi-gram (or even tokens n-grams) with lucene"; >>> Analyzer analyzer = new NGramAnalyzer(); >>> >>> TokenStream stream = analyzer.tokenStream("content", new >>> StringReader(str)); >>> Token token = new Token(); >>> while ((token = stream.next(token)) != null){ >>> System.out.println(token.term()); >>> } >>> >>> } catch (IOException ie) { >>> System.out.println("IO Error " + ie.getMessage()); >>> } >>> } >>> </pre> >>> <p>The output will print:</p> >>> <pre lang="none"> >>> an easy >>> easy way >>> way to >>> to write >>> write an >>> an analyzer >>> analyzer for >>> for tokens >>> tokens bi >>> bi gram >>> gram or >>> or even >>> even tokens >>> tokens n >>> n grams >>> grams with >>> with lucene >>> </pre> >>> <p>Note that the text “bi-gram” was treated like two >>> different tokens, as a desired consequence of using a StandardTokenizer >>> in the ShingleMatrixFilter initialization.</p> >>> >>> >>> Best regards >>> >>> On 2/4/20 11:14 AM, baris.ka...@oracle.com wrote: >>> > >>> > Thanks but i thought this class would have a mechanism to fix this issue. >>> > Thanks >>> > >>> >> On Feb 4, 2020, at 4:14 AM, Mikhail Khludnev <m...@apache.org> wrote: >>> >> >>> >> It's slow per se, since it loads terms positions. Usual advices are >>> >> shingling or edge ngrams. Note, if this is not a text but a string or >>> >> enum, >>> >> it probably let to apply another tricks. Another idea is perhaps >>> >> IntervalQueries can be smarter and faster in certain cases, although they >>> >> are backed on the same slow positions. >>> >> >>> >>> On Tue, Feb 4, 2020 at 7:25 AM <baris.ka...@oracle.com> wrote: >>> >>> >>> >>> How can this slowdown be resolved? >>> >>> is this another limitation of this class? >>> >>> Thanks >>> >>> >>> >>>>> On Feb 3, 2020, at 4:14 PM, baris.ka...@oracle.com wrote: >>> >>>> Please ignore the first comparison there. i was comparing there {term1 >>> >>> with 2 chars} vs {term1 with >= 5 chars + term2 with 1 char} >>> >>>> >>> >>>> The slowdown is >>> >>>> >>> >>>> The query "term1 term2*" slows down 400 times (~1500 millisecs) >>> >>>> compared >>> >>> to "term1*" when term1 has >5 chars and term2 is still 1 char. >>> >>>> Best regards >>> >>>> >>> >>>> >>> >>>>> On 2/3/20 4:13 PM, baris.ka...@oracle.com wrote: >>> >>>>> Hi,- >>> >>>>> >>> >>>>> i hope everyone is doing great. >>> >>>>> >>> >>>>> I saw this issue with this class such that if you search for "term1*" >>> >>> it is good, (i.e., 4 millisecs when it has >= 5 chars and it is ~250 >>> >>> millisecs when it is 2 chars) >>> >>>>> but when you search for "term1 term2*" where when term2 is a single >>> >>> char, the performance degrades too much. >>> >>>>> The query "term1 term2*" slows down 50 times (~200 millisecs) compared >>> >>> to "term1*" case when term 1 has >5 chars and term2 is still 1 char. >>> >>>>> The query "term1 term2*" slows down 400 times (~1500 millisecs) >>> >>> compared to "term1*" when term1 has >5 chars and term2 is still 1 char. >>> >>>>> Is there any suggestion to speed it up? >>> >>>>> >>> >>>>> Best regards >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> --------------------------------------------------------------------- >>> >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>>>> >>> >>>> --------------------------------------------------------------------- >>> >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>>> >>> >>> >>> >>> --------------------------------------------------------------------- >>> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >>> >>> >>> >> -- >>> >> Sincerely yours >>> >> Mikhail Khludnev >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>