Re: Handling wildcard search containing special characters (unicode)

2011-04-20 Thread Chris Hostetter

: Facing a Solr issue, I have been told that queries with a term like:
: Kiinteistösih*
: will not match the Finnish word Kiinteistösihteeri and that it's a
: known limitation of Lucene.

that is a missleading statement -- that types of query *can* match that 
word in an document, if the schema is configured in a way to preserve that 
raw term.

where people run into trouble is if they use stemming, or loewrcasing, or 
ascii foldering, or any other forms of analysis at indexing time, because 
at query time the query parser does not use analysis for prefix and 
wildcard searches  (if it did a search for something like dogs* might 
stem to dog* which is not what the user asked for)


PS...

http://people.apache.org/~hossman/#solr-user
Please Use solr-user@lucene Not dev@lucene

Your question is better suited for the solr-user@lucene mailing list ...
not the dev@lucene list.  The dev list is for discussing development of
the internals of Solr and the Lucene Java library ... it is *not* the 
appropriate place to ask questions about how to use Solr or the Lucene 
Java library when developing your own applications.  



-Hoss

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Handling wildcard search containing special characters (unicode)

2011-03-31 Thread Patrick ALLAERT
Hello,

Facing a Solr issue, I have been told that queries with a term like:
Kiinteistösih*
will not match the Finnish word Kiinteistösihteeri and that it's a
known limitation of Lucene.
Instead, using the word directly, without wildcard, works.

Do you confirm this a known limitation/bug?
If so do you have any registered issue about that?

Searching the ML archive and the issue tracker in both SOLR and LUCENE
projects didn't provide me a pointer to this problem.

One of the reference I found on the web talking about this problem is:
http://forum.compass-project.org/message.jspa?messageID=227709
But again, no pointer to a discussion or issue.

Thanks in advance for your help,
Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Handling wildcard search containing special characters (unicode)

2011-03-31 Thread Robert Muir
On Thu, Mar 31, 2011 at 9:51 AM, Patrick ALLAERT
patrick.alla...@gmail.com wrote:
 Hello,

 Facing a Solr issue, I have been told that queries with a term like:
 Kiinteistösih*
 will not match the Finnish word Kiinteistösihteeri and that it's a
 known limitation of Lucene.
 Instead, using the word directly, without wildcard, works.

 Do you confirm this a known limitation/bug?
 If so do you have any registered issue about that?

this isn't the case, there's no unicode limitation here.

more likely, your analyzer is configured to lowercase text, so in the
index Kiinteistösihteeri is really kiinteistösihteeri
in other words, try kiinteistösih* and see how that works.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Handling wildcard search containing special characters (unicode)

2011-03-31 Thread Patrick ALLAERT
2011/3/31 Robert Muir rcm...@gmail.com:
 On Thu, Mar 31, 2011 at 9:51 AM, Patrick ALLAERT
 patrick.alla...@gmail.com wrote:
 Hello,

 Facing a Solr issue, I have been told that queries with a term like:
 Kiinteistösih*
 will not match the Finnish word Kiinteistösihteeri and that it's a
 known limitation of Lucene.
 Instead, using the word directly, without wildcard, works.

 Do you confirm this a known limitation/bug?
 If so do you have any registered issue about that?

 this isn't the case, there's no unicode limitation here.

 more likely, your analyzer is configured to lowercase text, so in the
 index Kiinteistösihteeri is really kiinteistösihteeri
 in other words, try kiinteistösih* and see how that works.

Following your suggestion, I tested with:
kiinteistösih*

but it doesn't show me the intended result.

I have found the reason why, this is because of the
ISOLatin1AccentFilterFactory filter which is present for both the
index and query analyzer.
Searching with:
kiinteistosih*
did the trick.

One question remains now: why should I lowercase terms containing a
wildcard and making the ISO Latin1 accent conversion myself while I do
have:
analyzer type=query
...
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.ISOLatin1AccentFilterFactory/
...
for the corresponding fieldType?
I would have guessed it would does it for me.

Your reply helped me a lot understanding what's going on.
Thank you very much for your participation!

Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org