[Dspace-tech] DSpace/Lucene removing suffixes when searching?

2008-12-01 Thread Rafael Henkin
Hi,

 

We're using Dspace 1.4.2 with the Lucene that's shipped with it.

We don't know if it's a matter of configuration (apparently not) 
but Dspace (or Lucene) is removing (common) suffixes from words before 
searching.

For example (removing ações)

2008-12-01 17:44:04,120 INFO  org.dspace.search.DSQuery @ Final query string: 
+(((title:alterações))) +location:m72

2008-12-01 17:44:04,129 INFO  org.dspace.search.DSQuery @ Search[+title:alter 
+location:m72], sort by (sorttitle), Result: [EMAIL PROTECTED]

 

   Or: (again ações)

 

2008-12-01 17:44:58,857 INFO  org.dspace.search.DSQuery @ Final query string: 
+(((title:modificações))) +location:m72

2008-12-01 17:44:58,887 INFO  org.dspace.search.DSQuery @ Search[+title:modific 
+location:m72], sort by (sorttitle), Result: [EMAIL PROTECTED]

 

   With these words it wouldn't matter (as sometimes you don't know exactly 
the title that you are searching  but when you search for author Laura, it also 
returns results for Lauro, when there ARE results for Laura (if there wasn't 
any I would understand).

 

2008-12-01 17:49:37,522 INFO  org.dspace.search.DSQuery @ Final query string: 
+(((tdautor:laura))) +location:m72

2008-12-01 17:49:37,530 INFO  org.dspace.search.DSQuery @ Search[+tdautor:laur 
+location:m72], sort by (sorttitle), Result: [EMAIL PROTECTED]

 

Is there any way to disable this through configuration or 
through Dspace or is it natural to Lucene?

 

Thanks,

 

Rafael

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] DSpace/Lucene removing suffixes when searching?

2008-12-01 Thread Kim Shepherd
Hi Rafael,

 

Have a look for the “search.analyzer = “ line in your dspace.cfg file. The 
default is to use a standard English analyzer:

 

search.analyzer = org.dspace.search.DSAnalyzer

 

You’ll probably find that you have a Brasilian Portuguese analyser being used, 
and it’s “helpfully” doing stemming, gender analysis on words, etc., which 
perhaps you don’t want?

The only downside you’ll encounter going to the default DSAnalyzer is that 
characters like ç and õ won’t be automatically filtered to ‘c’ and ‘o’, which 
might make it difficult for people searching your site with 
non-international/extended keyboard layouts.

 

It is possible to easily hack DSAnalyzer.java to include just the filters you 
want, without doing full stemming, stop-words, etc. I’ve done this on our 
repositories to filter out macronised vowels which are common to words in 
Māori. If you really don’t want the full Portuguese search analyzer but do want 
the extended Latin characters filtered, I suggest sticking with DSAnalyzer and 
adding  ISOLatin1AccentFilter.java (available from Lucene) to the list of 
filters used.

 

It may be just as easy to hack org.apache.lucene.analysis.br 
(http://www.docjar.com/docs/api/org/apache/lucene/analysis/br/package-index.html)
 to only stem the words you want it to. 

 

Cheers,

 

Kim

 

From: Rafael Henkin [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, 2 December 2008 8:46 a.m.
To: DSpace-tech@lists.sourceforge.net
Subject: [Dspace-tech] DSpace/Lucene removing suffixes when searching?

 

Hi,

 

We’re using Dspace 1.4.2 with the Lucene that’s shipped with it.

We don’t know if it’s a matter of configuration (apparently not) 
but Dspace (or Lucene) is removing (common) suffixes from words before 
searching.

For example (removing “ações”)

2008-12-01 17:44:04,120 INFO  org.dspace.search.DSQuery @ Final query string: 
+(((title:alterações))) +location:m72

2008-12-01 17:44:04,129 INFO  org.dspace.search.DSQuery @ Search[+title:alter 
+location:m72], sort by (sorttitle), Result: [EMAIL PROTECTED]

 

   Or: (again “ações”)

 

2008-12-01 17:44:58,857 INFO  org.dspace.search.DSQuery @ Final query string: 
+(((title:modificações))) +location:m72

2008-12-01 17:44:58,887 INFO  org.dspace.search.DSQuery @ Search[+title:modific 
+location:m72], sort by (sorttitle), Result: [EMAIL PROTECTED]

 

   With these words it wouldn’t matter (as sometimes you don’t know exactly 
the title that you are searching  but when you search for author Laura, it also 
returns results for Lauro, when there ARE results for Laura (if there wasn’t 
any I would understand).

 

2008-12-01 17:49:37,522 INFO  org.dspace.search.DSQuery @ Final query string: 
+(((tdautor:laura))) +location:m72

2008-12-01 17:49:37,530 INFO  org.dspace.search.DSQuery @ Search[+tdautor:laur 
+location:m72], sort by (sorttitle), Result: [EMAIL PROTECTED]

 

Is there any way to disable this through configuration or 
through Dspace or is it natural to Lucene?

 

Thanks,

 

Rafael

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech