Re: How to edit / compile the SOLR source code

Erick Erickson Thu, 11 Mar 2010 16:50:58 -0800

Leaving aside some historical reasons, the root of
the issue is that any search has to identify all the
terms in a field that satisfy it. Let's take a normal
non-leading wildcard case first.

Finding all the terms like 'some*' will have to
deal with many fewer terms than 's*'. Just dealing with
that many terms will decrease performance, regardless
of the underlying mechanisms used. Imagine you're
searching down an ordered list of all the terms for
a field, assembling a list, and then comparing that list with
all the terms in that field with your list.....

So, pure wildcard serches, i.e. just *, would have to
handle all the terms in the index for the field.

The situation with leading wildcards is worse than
trailing, since all the terms in the index have to be
examined. Even doing something as bad as
a* will examine only terms starting in a. But looking
for *a has to examine each and every term in the index
because australia and zebra both qualify, there aren't
any good shortcuts if you think of having an ordered
list of terms in a field.

So performance can degrade pretty dramatically when
you allow this kind of thing and the original writers
(my opinion here, I wasn't one of them) decided it was
much better to disallow it by default and require users
to dig around for the why rather than have them
crash and burn a lot by something that seems innocent
if you aren't familiar with the issues involved.

A better approach is, and this isn't very obvious,
is to index your terms reversed, and do leading wildcard
searches on the *reversed* field as trailing wildcards.
E.g. 'some' gets indexed as 'emos' and the wildcard
search '*me' gets searched in the reversed field as
'em*'.

There may still be performance issues if you allow
single-letter wildcards, e.g. s* or *s, although a lot of
work has been done in this area in the last few years.
You'll have to measure in your situation. And beware
that a really common problem when deciding how many
real letters to allow is that it all works fine in your test
data, but when you load your real corpus and suddenly
SOLR/Lucene has to deal with 100,000 terms that
might match rather than the 1,000 in your test set, response
time changes....for the worse.

So I'd look around for the reversed idea (See SOLR-1321
in the JIRA), and at least one of the schema examples
has it.

One hurdle for me was asking the question "does it
really help the user to allow one or two leading
characters in a wildcard search?". Surprisingly often,
that's of no use to real users because so many
terms match that it's overwhelming. YMMV, but it's
a good question to ask if you find yourself in a
quagmire because you allow a* type of queries.

There are other strategies too, but that seems easiest....

Now, all that said, SOLR has done significant work
to make wildcards work well, these are just general
things to look out for when thinking about wildcards...

I really think hacking the parser will come back to bite
you as both as a maintenance and performance issue,
I wouldn't go there without a pretty exhaustive look at
other options.

HTH
Erick

On Thu, Mar 11, 2010 at 6:29 PM, JavaGuy84 <bbar...@gmail.com> wrote:

>
> Eric,
>
> Thanks a lot for your reply.
>
> I was able to successfully hack the query parser and enabled the leading
> wild card search.
>
> As of today I hacked the code for this reason only, I am not sure how to
> make the leading wild card search to work without hacking the code and this
> type of search is the preferred type of search in our organization.
>
> I had previously searched all over the web to find out 'why' that feature
> was disabled as default but couldn't find any solid answer stating the
> reason. In one of the posting in nabble it was mentioned that it might take
> a performance hit if we enable the leading wild card search, can you please
> let me know your comments on that?
>
> But I am very much interested in contributing some new stuff to SOLR group
> so I consider this as a starting point..
>
>
> Thanks,
> Barani
>
> Erick Erickson wrote:
> >
> > See Trey's comment, but before you go there.....
> >
> > What about SOLR's wildcard searching capabilities aren't
> > working for you now? There are a couple of tricks for making
> > leading wildcard searches work quickly, but this is a solved
> > problem. Although whether the existing solutions work in
> > your situation may be an open question...
> >
> > Or do you have to hack into the parser for other reasons?
> >
> > Best
> > Erick
> >
> > On Thu, Mar 11, 2010 at 12:07 PM, JavaGuy84 <bbar...@gmail.com> wrote:
> >
> >>
> >> Hi,
> >>
> >> Sorry for asking this very simple question but I am very new to SOLR and
> >> I
> >> want to play with its source code.
> >>
> >> As a initial step I have a requirement to enable wildcard search (*text)
> >> in
> >> SOLR. I am trying to figure out a way to import the complete SOLR build
> >> to
> >> Eclipse and edit QueryParsing.java file but I am not able to import (I
> >> tried
> >> to import with ant project in Eclipse and selected the build.xml file
> and
> >> got an error stating javac is not present in the build.xml file).
> >>
> >> Can someone help me out with the initial steps on how to import / edit /
> >> compile / test the SOLR source?
> >>
> >> Thanks a lot for your help!!!
> >>
> >> Thanks,
> >> B
> >> --
> >> View this message in context:
> >>
> http://old.nabble.com/How-to-edit---compile-the-SOLR-source-code-tp27866410p27866410.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://old.nabble.com/How-to-edit---compile-the-SOLR-source-code-tp27866410p27871470.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: How to edit / compile the SOLR source code

Reply via email to