Thanks Uwe for the detailed & clear response. I will definitely go further into details of 4.0 improvements.
Thanks, Arun On Fri, Mar 29, 2013 at 4:21 PM, Uwe Schindler <[email protected]> wrote: > Hi Arun, > > this is hard to explain with one single reason - To conclude: > - A wildcard query that consists only of a prefix ("ab*") is internally > implemented by iterating all terms starting with "ab" and ending with the > one before "ac"). This is not different to Lucene 3.x and there is no room > for optimization. If you look at > o.a.l.util.automaton.CompiledAutomaton#getTermsEnum(), you see that prefix > terms are implemented by PrefixTermsEnum, that uses the above algorithm and > is mostly a copy of the Lucene 3.x PrefixTermEnum. > - A more complex wildcard or regex query that does not solely consist of a > prefix but more restrictions after the prefix is faster in 4.x than in > Lucene 3.x, because Lucene 3.x was only able to scan all terms starting > with the common prefix ("ab*xy" would have approx. same speed as a prefix > "ab*", it just removes all terms not matching *xy in a second step). This > is also the improvement in FuzzyQuery: In Lucene 3.x a FuzzyQuery has no > prefix at all (unless you specify one, but that limits the fuzzy use), so > it has to scan all terms in the term dictionary and filter them against the > Levenshtein distance. In Lucene 4.x, it can use an automaton, to > effectively seek inside the term enum to find all terms that match the > (non-prefix) regex, wildcard or Levensthein distance. In short: Lucene 4 is > able to step over terms that cannot match the automaton. A cool example is > the following regex: "(ht|f)tp://.*". In Lucene 3.x this would scan all > terms because there is no common prefix, in 4.x this expands to something > similar to the intersection of 2 prefix queries, so the automaton > intersection between the regex and the term dictionary will seek to term > "ftp://" first, iterate until the last term of this prefix and then seeks > to "http://", doing the same. > > If you only have a prefix query (wildcard "ab*" or regex "ab.*"), then the > query is executed in the same way like in Lucene 3.x, no automatons are > involved at all! The reason why you see a speed improvement is not caused > by the automaton implementation (it's not used at all), but it is caused by > a complete rewrite of Lucene's algorithms regarding the term dictionary and > the posting lists (Lucene Codecs). E.g. Lucene 4.0 uses a new term > dictionary based on FST, Lucene 4.0 no longer creates Strings all the time > (all terms are just slices inside a large byte[]),... and many more > improvements. Also when scanning the posting lists to find all documents > related to the terms found, Lucene uses different data structures (e.g., > block postings in 4.1). So it’s a number of improvements that speed up the > index (also while indexing!). Just review the numerous presentations on > conferences or the blog posts by Mike McCandless ( > http://blog.mikemccandless.com/). > > > - Can we do something to improve speed for such queries ? > > You cannot improve speed of pure prefix queries. Maybe another data > structure is helpful? E.g. if you scan for dates as string (queries on > dates like "201301*"), it might be better to use a numeric field and use > NumericRangeQuery(20130101, 20130131)? Another approach for improving > prefix queries is indexing additional terms: If you are always searching > for a 2-char prefix for "ab*", then simply index an additional term in a > separate field with 2 chars (e.g., "ab") in your documents while indexing > and just search for "ab"? This is very fast! This also applies to Lucene > 3.x! > > Uwe > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: [email protected] > > > > -----Original Message----- > > From: Arun Kumar K [mailto:[email protected]] > > Sent: Friday, March 29, 2013 11:13 AM > > To: java-user > > Subject: Re: Wild Card Query Performance > > > > Hi Uwe, > > > > Thanks for the info. > > You were mentioning about term dictionary and and other index > > components. I didn't get this. > > - What could be the other factors that improve the speed of such query ? > > Can u explain or give some pointers to this ? > > - Can we do something to improve speed for such queries ? > > > > Also other observation is indexing time has increased by around 6% in > 4.0. > > > > Arun > > > > On Fri, Mar 29, 2013 at 3:25 PM, Uwe Schindler <[email protected]> wrote: > > > > > Hi, > > > > > > It depends on the type of wildcard query. If you only have a prefix > > > (ab*), they rewrite to a simple PrefixQuery and this one is > > > implemented exactly like in 3.x, so you only see the speed > > > improvements of Lucene 4.0 in the term dictionary and and other index > > > components, not related to the query itsself. > > > > > > If you have wildcards like ab?xy, then this query will be multiple > > > times faster than in 3.x, because the "?" wildcard can only expand to > > > a limited set of terms, while in Lucene 3.x, it still scans all terms > > > with prefix "ab". The same applies to other wildcard constructs, if > > > they limit more than just prefix. > > > > > > Uwe > > > > > > ----- > > > Uwe Schindler > > > H.-H.-Meier-Allee 63, D-28213 Bremen > > > http://www.thetaphi.de > > > eMail: [email protected] > > > > > > > > > > -----Original Message----- > > > > From: Arun Kumar K [mailto:[email protected]] > > > > Sent: Friday, March 29, 2013 10:38 AM > > > > To: java-user > > > > Subject: Wild Card Query Performance > > > > > > > > Hi Guys, > > > > > > > > I have been testing the search time improvement in Lucene 4.0 from > > > > Lucene > > > > 3.0.2 version for Wildcard Queries (with atleast say 2 chars Eg.ar*). > > > > > > > > For a 2GB size index with 4000000 docs, the following observations > > > > were > > > > made: > > > > > > > > Around 3X improvement with and without STRING sort on a sortable > > field. > > > > > > > > I guess this improvement is because of the Automation Query by > > > > Robert which is used in WildCard Queries. > > > > > > > > As per mike's blog, FuzzyQueries are 100X times faster in 4.0 but > > > > these wildcard queries are not that faster comparatively. > > > > > > > > I have used default codecs and postings format. > > > > > > > > Did i miss something or is it the max improvement that we can expect > > > > currently for WildCard Queries? > > > > > > > > > > > > Arun > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [email protected] > > > For additional commands, e-mail: [email protected] > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
