I didn't code it, so I'm speaking at least second hand....

It's a valid question whether having larger clauses is useful to
the user. Having a 1024 term OR clause isn't narrowing that much.
Plus, I think, it was a number that says, in effect, "you should know
that this is getting to be an expensive query, don't be surprised
if it takes a while". I suspect it's probably semi-arbitrary way to
define "big and expensive" <G>.

<<If some user thought of just inputting random words then the system
will be brought to its knees and eventually die.>>>

Well, yes. If you allowed this pathological query through. Which you did
by bumping the max clauses parameter. In effect, you've been forced to
consider this issue in your design rather than be left wondering,
months from now when the system is in production, "why are some
queries slow?"...

One solution is to reject prefix queries that have fewer than, say, three
leading characters. Similarly for single-character wildcard characters.
The number of terms  that have to be ORed together reduces drastically
with more characters.

I'd recommend you think first about rejecting pathological queries rather
than wasting a lot of time making them work. I had to get over thinking
that "any query should work, no matter how bizarre", which I did by
thinking about how much use to a user a query that returned 100,000
results would actually be.


You could limit pathology by either just dropping the max booleans
back to a "reasonable" number as defined by you and responding
with an appropriate message or pre-processing the query and
rejecting it up front.

Best
Erick

On Thu, Apr 2, 2009 at 9:55 AM, Lebiram <lebi...@ymail.com> wrote:

> Hi Erick
>
> The query was a test data basically in anticipation of searches on all
> indices (4 index) with 12 million docs
> that should yield very small results. Obviously that query does not happen
> in real life but it did break the system.
> If some user thought of just inputting random words then the system will be
> brought to its knees and eventually die.
>
> Essentially, all our lucene index has about 8 fields; 1 field is being used
> as a filter (timestamp)
> the rest are normal fields which can accept wildcards.
>
> You have a point in Filters being useful for a few other fields we do have.
> I'll apply that.
> So that leaves about 5 fields that allows fuzzy search.
>
> Which goes back to the max clause problem. Lucene's default Max Clause is
> 1024, is there any reason behind this max?
>
> Thanks,
>
> M
>
>
> ________________________________
> From: Erick Erickson <erickerick...@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Thursday, April 2, 2009 2:34:47 PM
> Subject: Re: Search using MultiSearcher generates OOM on a 1GB total
>  Partitioned indeces
>
> Ah, I get it now. Given that you bumped your max clause up, it makes
> sense. I'm pretty sure that the wildcard expansion is the root or your
> memory problems. The folks on the list helped me out a lot understanding
> what wildcards were about, see the thread titled "I just don't get
> wildcards
> at all" in the searchable archives from several years ago...
>
> Why do you want to generate queries of the form you showed? I'm
> wondering if this is an XY problem and if you gave us a higher level
> description of the problem you're trying to solve we'd be able to
> suggest other approaches. I have a really hard time imagining a use
> case where a user is well served by a clause that says
> "any document that has word beginning with g and h and d and s.....",
> so I'm assuming you're trying to solve something specific to your
> domain.....
>
> But if you really, truly do require this form, consider Filters. If your
> problem really requires single-letter starts, consider creating 26
> Filters at start up time and use those (see ConstantScoreQuery)
> That'll chew up about 1.5M each of memory, faaaaar less than
> you're consuming presently and will be blazingly fast. If you're
> not limited to single-characters, *still* consider filters. They'll
> consume little memory and are quite speedy to construct.
>
> Best
> Erick
>
>
> On Thu, Apr 2, 2009 at 5:04 AM, Lebiram <lebi...@ymail.com> wrote:
>
> > Hi Erick,
> >
> > I did a search just as JVM started... so I'm thinking that the JVM is
> busy
> > with some startup stuff... and that this search required more memory than
> > what is available at that time.
> >
> > Had I done this search a while after the JVM has started, then this query
> > succeeds.
> > I then pump in several similar queries running on a different thread and
> it
> > takes a long time but still runs to completion until one of them
> generates
> > OOM.But still, queries like this is just using too much memory.
> >
> > As for clauses, the BooleanQuery was set to max clause of... 9,000,000
> > I'm guessing that might have caused the usage of too much memory?
> >
> > I'll try the explain on you've suggested.
> >
> > Thanks,
> >
> > M
> >
> >
> >
> >
> > ________________________________
> > From: Erick Erickson <erickerick...@gmail.com>
> > To: java-user@lucene.apache.org
> > Sent: Wednesday, April 1, 2009 6:51:13 PM
> > Subject: Re: Search using MultiSearcher generates OOM on a 1GB total
> >  Partitioned indeces
> >
> > Think about putting this query in Luke and doing an "explain" for
> details,
> > but....
> >
> > I'm surprised this is working at all without throwing TooManyClauses
> > errors.
> > Under the covers, Lucene expands your wildcards to all terms in the field
> > that match. For instance, assume your document field has the following:
> > aa
> > ab
> > ac
> > ad
> > ae
> >
> > Now, searching for a* produces a clause like:
> > (aa OR ab OR ac OR ad OR ae) in place of the a*
> >
> > So your query is generating ginormous OR clauses, one that
> > contains every term in your content field starting with 'g'. Another
> > with every term in your content field starting with 'h' etc. So I suspect
> > that your content field doesn't have very many distinct terms in it....
> >
> > As for why it's returning few entries, what does this part of your
> > query return by itself? Since it's anded with your wildcard query,
> > it might be what's limiting your results.
> >
> > ((+sender:cpuser9 +viewers:cpuser4) (+sender:cpuser4 +viewers:cpuser9)
> > (+viewers:cpuser9 +viewers:cpuser4))
> >
> > But I'm puzzled, because saying that you're getting OOM errors
> > doesn't square very well with getting *any* results at all, so is
> > there something else going on?
> >
> > Best
> > er...@morequestionsthananswers.
> >
> >
> > On Wed, Apr 1, 2009 at 1:31 PM, Lebiram <lebi...@ymail.com> wrote:
> >
> > > Hi All,
> > >
> > > I have the following query on a 1GB index with about 12 million docs :
> > > As you can see the terms consist of wildcards...
> > >
> > > query.toString()=+(+content:g* +content:h* +content:d* +content:s*
> > > +content:a* +content:w* +content:b* +content:c* +content:m*
> +content:e*)
> > > +((+sender:cpuser9 +viewers:cpuser4) (+sender:cpuser4 +viewers:cpuser9)
> > > (+viewers:cpuser9 +viewers:cpuser4))
> > >
> > > The Searcher is a MultiSearcher with 4 IndexSearchers pointing to 4
> > > different Lucene Index.
> > > This search returns very few records, several ten thousand hits.
> > >
> > > Java is assigned with 1GB max memory.
> > >
> > > Somehow this search eats the entire java heap space and causes OOM.
> > > Looking at jProfiler, i see org.apache.lucene package generating a lot
> of
> > > objects which I believe is taking all this space.
> > >
> > > Can anyone explain the reason why this particular search might take so
> > much
> > > memory?
> > > Is there anything I am doing wrong here?
> > > More importantly, is there anything I can do to improve this?
> > >
> > > -M
> > >
> > >
> > >
> >
> >
> >
> >
> >
>
>
>
>
>

Reply via email to