Peter:

Very interesting. To take care of the issue you mention, could you add
multiple "synonyms" with progressively less accents? 

E.g. you'd index "préférence" as 4 tokens:
 préférence (unchanged)
 preférence (stripped one accent)
 préference (stripped the other accent)
 preference (stripped both accents)

Or does it yield too many tokens to be useful?

And how does this take care of scoring? Do you get a higher score with a
closer match?


 

-----Original Message-----
From: Binkley, Peter [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 11, 2008 8:37 AM
To: solr-user@lucene.apache.org
Subject: RE: Accented search

We've done this in a pre-Solr Lucene context by using the position
increment: when a token contains accented characters, you add a stripped
version of that token with a zero increment, so that for matching purposes
the original and the stripped version are at the same position. Accents are
not stripped from queries. The effect is that an accented search matches
your Doc A, and an unaccented search matches Docs A and B. We do that after
lower-casing the token.

There are some limitations: users might start to expect that they can freely
add accents to restrict their search to accented hits, but if they don't
match the accents exactly they won't get any hits: e.g. if a word contains
two accented characters and the user only accents one of them in their
query, they won't match the accented or the unaccented version. 

Peter

Peter Binkley
Digital Initiatives Technology Librarian Information Technology Services
4-30 Cameron Library University of Alberta Libraries Edmonton, Alberta
Canada T6G 2J8
Phone: (780) 492-3743
Fax: (780) 492-9243
e-mail: [EMAIL PROTECTED]

~ The code is willing, but the data is weak. ~


-----Original Message-----
From: climbingrose [mailto:[EMAIL PROTECTED]
Sent: Monday, March 10, 2008 10:01 PM
To: solr-user@lucene.apache.org
Subject: Accented search

Hi guys,

I'm running to some problems with accented (UTF-8) language. I'd love to
hear some ideas about how to use Solr with those languages. Basically, I
want to achieve what Google did with UTF-8 language.

My requirements including:
1) Accent insensitive search and proper highlighting:
  For example, we have 2 documents:

  Doc A (title:L?p Trình Viên)
  Doc B (title:Lap Trinh Vien)

  if the user enters "L?p Trình Viên", then Doc B is also matched and "L?p
Trình Viên" is highlighted.
  On the other hand, if the query is "Lap Trinh Vien", Doc A is also
matched.
2) Assign proper scores to accented or non-accented searches:
  if the user enters "L?p Trình Viên", then Doc A should be given higher
score than DOC B.
  if the query is "Lap Trinh Vien", Doc A should be given higher score.

Any ideas guys? Thanks in advance!

--
Regards,

Cuong Hoang


Reply via email to