Re: BooleanQuery TooManyClauses in wildcard search

Erick Erickson Mon, 03 Dec 2007 05:19:17 -0800

First time I tried this I made it WAAAAAY more complex than it is <G>....


WARNING: this is from an older code base so you may have to tweak
it. Might be 1.9 code....

public class WildcardTermFilter
        extends Filter {

    private static final long serialVersionUID = 1L;


    protected BitSet bits = null;
    private String   field;
    private String   value;

    public WildcardTermFilter(String field, String value) {
        this.field = field;
        this.value = value;
    }

    public BitSet bits(IndexReader reader)
            throws IOException {
        bits = new BitSet(reader.maxDoc());

        TermDocs         termDocs = reader.termDocs();
        WildcardTermEnum wildEnum = new WildcardTermEnum(reader, new
Term(field, value));

        for (Term term = null; (term = wildEnum.term()) != null;
wildEnum.next()) {
            termDocs.seek(new Term(
                    field,
                    term.text()));

            while (termDocs.next()) {
                bits.set(termDocs.doc());
            }
        }

        return bits;
    }
}


On Dec 2, 2007 8:34 AM, Ruchi Thakur <[EMAIL PROTECTED]> wrote:

> Erick can you please point me to some example of creating a filtered
> wildcard query. I have not used filters anytime before. Tried reading but
> still am really not able to understand how filters actually work and will
> help me getting rid of MaxClause Exception.
>
>  Regards,
>  Ruchika
>
> Erick Erickson <[EMAIL PROTECTED]> wrote:
>  See below:
>
> On Dec 1, 2007 1:16 AM, Ruchi Thakur wrote:
>
> >
> > Erick/John, thank you so much for the reply. I have gone through the
> > mailing list u have redirected me to. I know i need to read more, but
> some
> > quick questions. Please bear with me if they appear to be too simple.
> > Below is the code snippet of my current search. Also i need to get score
> > info of each of my document returned in search, as i display the search
> > result in the order of scroing.
> > {
> > Directory fsDir = FSDirectory.getDirectory(aIndexDir, false);
> > IndexSearcher is = new IndexSearcher(fsDir);
> > ELSAnalyser elsAnalyser = new ELSStopAnalyser();
> > Analyzer analyzer = elsAnalyser.getAnalyzer();
> > QueryParser parser = new QueryParser(aIndexField, analyzer);
> > Query query = parser.parse(aSearchStr);
> > hits = is.search(query);
> > }
> >
>
> EOE: Minor point that you probably already know, but opening a searcher is
> expensive.
> I'm assuming you put it in here for clarity, but in case not be
> aware you should
> open a reader and re-use it as much as posslble.
>
> Also, it looks like you're using an older version of Lucene, since
> the
> getDirecotory(dir, bool) is deprecated.....
>
>
> >
> > Now as i have understood, through the mail archives you have suggsted,
> > below is what we need to do.
> > 1)The second was to build a *Filter* that uses WildcardTermEnum -- not a
> > Query.
> > because it's a filter, the scoring aspects of each document are taken
> out
> > of the equation (I am worried abt it , as i need scoring info)
> >
>
> This is true *for the wildcard clause*. It's a legitimate question to ask
> what
> scoring means for a wildcard clause. Rather, it's legitimate to ask
> whether
> that adds much value. I managed to convince my product manager that
> the end user experience didn't suffer enough to matter, but it can be
> argued
> either way.
>
> That said, I'm pretty sure that if you make this a sub-clause of a boolean
> query,
> you still get scoring for the *other* parts of the query. That is,
> BooleanQuery bq = ....
> bq.add(regular query);
> bq.add(filtered wildcard query);
>
> search (bq);
>
> (note, really sloppy pseudo code there) will give you scoring for
> the "regular query" part of the bq. Of course that requires you to
> break up the incoming query to the wildcard parts and the not
> wildcard parts.......
>
>
> >
> > 2)Once you have a "WildcardFilter" wrapping it in a ConstantScoreQuery
> > would give you a drop in replacement for WildCardQuery that would
> sacrifive
> > the TF/IDF scoring factors for speed and garunteed execution on any
> pattern
> > in any index regardless of size. (Does that mean it will solve my
> scoring
> > issue and i will get scoring info)
> >
>
> I'm pretty sure that you don't get scoring here. ConstantScoreQuery is
> named that way on purpose .
>
>
> >
> > Also it suggests "SpanNearQuery on a wildcard". I am kinda cofused which
> > is the approach that should be actually used. Please suggest. At the
> same
> > time i am studing more abt it. Thanks a lot for ur help on this.
> >
>
> I think I was looking at this for a method of highlighting, but span
> queries
> won't fix up wildcard queries.
>
> handling arbitrary wildcard queries, that is queries with, say, only
> one or two leading letters is an area of Lucene that requires that
> one really dig into the guts of querying and do some custom work.
> We've had quite reasonable results by imposing the restriction that
> wildcard queries MUST have three leading characters. That is,
> a*, ab* are illegal, but abc* is legal.
>
> What I'd suggest is to start by imposing that restriction, bumping the
> maxbooleanclauses number and just let Lucene do its tricks.
> Perhaps catching the TooManyClauses exception and informing
> the user with some message like "query to general" or some such.
> Then, if the product manager says that is unacceptable let them
> know what the cost is and go from there. You'll get push-back that
> "of course we must allow one character wildcards", but in my
> experience that's often just a knee-jerk reaction and not as
> much of a requirement as people think. Your situation may vary....
>
> Yet another possibility, depending upon the size and space
> requirements is to play interesting games with your index, for
> instance for any word index special one and two letter tokens,
> e.g. abcde gets a$, ab$ and abcde all indexed. Then you
> pre-process your query and turn a* into a$, ab* into ab$
> and just use these terms (you have to take some car
> that your query analyzer doesn't strip these out). Now, you've
> turned arbitrary wildcards into your three-letter-prefix rule
> or simple term queries, preserving relevance etc. You have
> to take some care to index your special tokens with 0 increments
> so phrase and span queries work, but that's not very hard
> especially since Lucene In Action demonstrates the
> SynonymAnalyzer which does the same thing.
>
> But note the implicit restriction here that you really
> don't search for ab*cd*e* with this scheme, but
> neither does Google.
>
> Whew! enough for Saturday morning ....
>
> Best
> Erick
>
>
> >
> > Best Regards,
> > Ruchika
> >
> > Erick Erickson wrote:
> > John's answer is spot-on. There's a wealth of information in the user
> > group
> > archives that you should be able to search on discussing ways of
> providing
> > the functionality. One thread titled "I just don't get wildcards at all"
> > is one where the folks who know generously helped me out.
> >
> > Once you find out how to search for that you'll know you're in the right
> > place.
> > Here's the searchable archive.....
> >
> >
> >
> http://www.gossamer-threads.com/lists/engine?do=search;search_forum=forum_2;;list=lucene
> >
> > Make sure you select the "java user" from the top drop-down labeled
> > "Search".
> >
> > Best
> > Erick
> >
> > On Nov 30, 2007 2:07 PM, John Byrne wrote:
> >
> > > Hi,
> > >
> > > Your problem is that when you do a wildacrd search, Lucene expands the
> > > wildacrd term into all possible terms. So, searching for "stat*"
> > > produces a list of terms like "state", "states", "stating" etc. (It
> only
> > > uses terms that actually occur in your index, however). These terms
> are
> > > all added as OR clauses of a boolean query.
> > >
> > > The thing is, be defult, there is a limit of 1024 caluses for a
> boolean
> > > query. If yuor wildacrd term expands into more than this, (which
> happens
> > > very easily), you get that exception you described. You can solve the
> > > issues by setting the maximum clause count yourself, using
> > >
> > > BooleanQuery.setMaxClauseCount(int maxClauseCount)
> > >
> > > See
> > >
> > >
> >
> http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/core/index.html
> > > for mroe info.
> > >
> > > Bear in mind that putting a wildcard near the start of the term
> results
> > > in a large number of boolean clauses, which increases memory usage.
> This
> > > is the reason for the default limit. This limit will also affect fuzzy
> > > queries, because they are expanded in the same way.
> > >
> > > Regards,
> > > JB
> > >
> > > Ruchi Thakur wrote:
> > > >
> > > > Hi there.
> > > > I am a new Lucene user and I have been searching the group archives
> > but
> > > couldn't solve the problem. I have just joined a project that uses
> > Lucene.
> > > > We use the StandardAnalyzer for indexing our documents and our query
> > is
> > > as
> > > > follows when we issue a search string of t* for example:
> > > > +t* +cont_type:pa
> > > >
> > > > We get an Exception when we issue some of our wildcard text searches
> > > we get following Exception
> > > > org.apache.lucene.search.BooleanQuery$TooManyClauses Exception : Max
> > > clause if 1024
> > > >
> > > > Please suggest.
> > > >
> > > > Regards,
> > > > Ruchi
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > ---------------------------------
> > > > Never miss a thing. Make Yahoo your homepage.
> > > >
> > > >
> > ------------------------------------------------------------------------
> > > >
> > > > No virus found in this incoming message.
> > > > Checked by AVG Free Edition.
> > > > Version: 7.5.503 / Virus Database: 269.16.11/1161 - Release Date:
> > > 30/11/2007 12:12
> > > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> >
> >
> >
> > ---------------------------------
> > Be a better pen pal. Text or chat with friends inside Yahoo! Mail. See
> > how.
> >
>
>
>
> ---------------------------------
> Be a better pen pal. Text or chat with friends inside Yahoo! Mail. See
> how.
>

Re: BooleanQuery TooManyClauses in wildcard search

Reply via email to