Re: [Dspace-tech] Validation and accents

David Hjelm Mon, 11 Jun 2007 02:09:12 -0700

luis jose miralls skrev:

- DSpace does not fing words with accents. Example: If i put a wordlike Camión
and then i try to find it. "Find : Camion" it would not find it.

The following worked for us (University Library in Gothenburg) with
dspace-1.4.1 & tomcat4. You could try it out here:
https://gupea.ub.gu.se/dspace

you can e.g. search for "thoren" and get authors named "Thorén" or
search for "skold" and get authors named "Sköld"...

---------------------

Short answer:

- Write a new lucene analyzer class which uses a filter that removes
accents.
- Edit dspace.cfg so that it uses the new class instead of the default
- Rebuild, reindex and restart.

----------------------

Long answer:

org/dspace/search/DSAnalyzer.java is the default dspace search analyzer,
used for indexing the contents of the documents in the repository. If
you look at the method

public final TokenStream tokenStream(String fieldName, final Reader reader)

you can see that various filters are applied: one that converts all
words to lowercase, one that filters out stopwords etcetera.

Here you need to make it apply a new filter that removes accents. I
found this one:

http://www.google.com/codesearch?q=isolatin1accentfilter&hl=en&btnG=Search+Code

appareantly included in lucene 2.1.0 (We have 2.0.0). It seems to exist
both under GPL and Apache licenses.

I created two classes:
- se.gu.ub.lucene.analysis.sv.SwedishEnglishAnalyzer.java
( it contains swedish stopwords in addition to the english ones)
- se.gu.ub.lucene.analysis.sv.SwedishISOLatin1Filter.java
( which is a copy of the IsoLatin1AccentFilter.java, I might have some
plans to make it not strip swedish åäö-accents, thus the renaming).

... put them in a folder $DSPACE_SRC/src/se/gu/ub/lucene/analysis/sv/

... added the following line to $DSPACE_SRC/config/dspace.cfg:
search.analyzer = se.gu.ub.lucene.analysis.sv.SwedishEnglishAnalyzer

... rebuilt and restarted dspace

... did $DSPACE_INSTALLDIR/bin/index-all

... restarted dspace

... done

package se.gu.ub.lucene.analysis.sv;

/**
 * SwedishEnglishAnalyzer.java 
 * David Hjelm 2007
 * licens: GNU General Public License, ingen särskild version.
 * 
 * Bygger lite på den DSAnalyzer som följer med dspace 1.4.1, men förutom de
 * engelska stopporden kan den klassen knappt anses ha verkshöjd jämfört med
 * Lucenes medföljande exempel. Om man tar bort de engelska stopporden, eller
 * återställer till Lucenes original-stoppord kan man nog lugnt också ta bort
 * nedanstående copyrightmeddelande:
 *
 ** Copyright (c) 2002-2005, Hewlett-Packard Company and Massachusetts
 ** Institute of Technology.  All rights reserved.
 *
 * De svenska stopporden är tagna från 
 * http://snowball.tartarus.org/algorithms/swedish/stop.txt under följande
 * licens:
 * 
 ** All the software given out on this Snowball site is covered by the BSD
 ** License (see http://www.opensource.org/licenses/bsd-license.html ), 
 ** with Copyright (c) 2001, Dr Martin Porter, and (for the Java developments)
 ** Copyright (c) 2002, Richard Boulton.
 */
 
import java.io.Reader;
import java.util.Set;
import org.dspace.search.DSTokenizer;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardFilter;

public class SwedishEnglishAnalyzer extends Analyzer{

    private static final String[] STOP_WORDS =
    {   
	// new stopwords (per MargretB)
	"a", "am", "and", "are", "as", "at", "be", "but", "by", "for",
	"if", "in", "into", "is", "it", "no", "not", "of", "on", "or",
	"the", "to", "was",
	// svenska stoppord
	// (http://snowball.tartarus.org/algorithms/swedish/stop.txt)
	"och","det","att","i","en","jag","hon","som","han","på","den","med",
	"var","sig","för","så","till","är","men","ett","om","hade","de","av",
	"icke","mig","du","henne","då","sin","nu","har","inte","hans","honom",
	"skulle","hennes","där","min","man","ej","vid","kunde","något","från",
	"ut","när","efter","upp","vi","dem","vara","vad","över","än","dig",
	"kan","sina","här","ha","mot","alla","under","någon","eller","allt",
	"mycket","sedan","ju","denna","själv","detta","åt","utan","varit",
	"hur","ingen","mitt","ni","bli","blev","oss","din","dessa","några",
	"deras","blir","mina","samma","vilken","er","sådan","vår","blivit",
	"dess","inom","mellan","sådant","varför","varje","vilka","ditt","vem",
	"vilket","sitta","sådana","vart","dina","vars","vårt","våra","ert",
	"era","vilkas"
    };

    final static private Set stopSet = StopFilter.makeStopSet(STOP_WORDS);
    
    public final TokenStream tokenStream(String fieldName, final Reader reader)
    {
        TokenStream result = new DSTokenizer(reader);
	
        result = new StandardFilter(result);
	result = new SwedishISOLatin1AccentFilter(result);
        result = new LowerCaseFilter(result);
		
        result = new StopFilter(result, stopSet);
        //result = new PorterStemFilter(result);

        return result;
    }







}

package se.gu.ub.lucene.analysis.sv;

/**
 * SwedishISOLatin1AccentFilter.java
 * David Hjelm 2007
 * licens: GNU General Public License, ingen särskild version.
 * 
 * Är en kopia av 
 * fr.gouv.culture.sdx.search.lucene.analysis.filter.ISOLatin1Filter.java
 * som är publicerad under GPL. Har bara bytt namn på klassen. 
 * 
 * Tar bort allehanda accenter från tecken vid sökning och indexering. 
 * En enda stor case-sats som man kan ändra i vid behov (om man t.ex. inte 
 * vill att ö skall indexeras som o).
 *
 */

import org.apache.lucene.analysis.*;

/**
 * A filter that replaces accented characters in the ISO Latin 1 character set
 * (ISO-8859-1) by their unaccented equivalent. The case will not be altered.
 * <p>
 * For instance, '&agrave;' will be replaced by 'a'.
 * <p>
 */
public class SwedishISOLatin1AccentFilter extends TokenFilter {
	public SwedishISOLatin1AccentFilter(TokenStream input) {
		super(input);
	}

	public final Token next() throws java.io.IOException {
		final Token t = input.next();
		if (t == null)
			return null;
		// Return a token with filtered characters.
		return new Token(removeAccents(t.termText()), t.startOffset(), t.endOffset(), t.type());
	}

	/**
	 * To replace accented characters in a String by unaccented equivalents.
	 */
    public final static String removeAccents(String input) {
	final StringBuffer output = new StringBuffer();
	for (int i = 0; i < input.length(); i++) {
	    switch (input.charAt(i)) {
		
	    case '\u00C0' : // Ã
	    case '\u00C1' : // Ã
	    case '\u00C2' : // Ã
	    case '\u00C3' : // Ã
	    case '\u00C4' : // Ã
	    case '\u00C5' : // Ã
		output.append("A");
		break;
	    case '\u00C6' : // Ã
		output.append("AE");
		break;
	    case '\u00C7' : // Ã
		output.append("C");
		break;
	    case '\u00C8' : // Ã
	    case '\u00C9' : // Ã
	    case '\u00CA' : // Ã
	    case '\u00CB' : // Ã
		output.append("E");
		break;
	    case '\u00CC' : // Ã
	    case '\u00CD' : // Ã
	    case '\u00CE' : // Ã
	    case '\u00CF' : // Ã
		output.append("I");
		break;
	    case '\u00D0' : // Ã
		output.append("D");
		break;
	    case '\u00D1' : // Ã
		output.append("N");
		break;
	    case '\u00D2' : // Ã
	    case '\u00D3' : // Ã
	    case '\u00D4' : // Ã
	    case '\u00D5' : // Ã
	    case '\u00D6' : // Ã
	    case '\u00D8' : // Ã
		output.append("O");
		break;
	    case '\u0152' : // Â
		output.append("OE");
		break;
	    case '\u00DE' : // Ã
		output.append("TH");
		break;
	    case '\u00D9' : // Ã
	    case '\u00DA' : // Ã
	    case '\u00DB' : // Ã
	    case '\u00DC' : // Ã
		output.append("U");
		break;
	    case '\u00DD' : // Ã
	    case '\u0178' : // Â
		output.append("Y");
		break;
	    case '\u00E0' : // Ã 
	    case '\u00E1' : // Ã¡
	    case '\u00E2' : // Ã¢
	    case '\u00E3' : // Ã£
	    case '\u00E4' : // Ã¤
	    case '\u00E5' : // Ã¥
		output.append("a");
		break;
	    case '\u00E6' : // Ã¦
		output.append("ae");
		break;
	    case '\u00E7' : // Ã§
		output.append("c");
		break;
	    case '\u00E8' : // Ã¨
	    case '\u00E9' : // Ã©
	    case '\u00EA' : // Ãª
	    case '\u00EB' : // Ã«
		output.append("e");
		break;
	    case '\u00EC' : // Ã¬
	    case '\u00ED' : // Ã
	    case '\u00EE' : // Ã®
	    case '\u00EF' : // Ã¯
		output.append("i");
		break;
	    case '\u00F0' : // Ã°
		output.append("d");
		break;
	    case '\u00F1' : // Ã±
		output.append("n");
		break;
	    case '\u00F2' : // Ã²
	    case '\u00F3' : // Ã³
	    case '\u00F4' : // Ã´
	    case '\u00F5' : // Ãµ
	    case '\u00F6' : // Ã¶
	    case '\u00F8' : // Ã¸
		output.append("o");
		break;
	    case '\u0153' : // Â
		output.append("oe");
		break;
	    case '\u00DF' : // Ã
		output.append("ss");
		break;
	    case '\u00FE' : // Ã¾
		output.append("th");
		break;
	    case '\u00F9' : // Ã¹
	    case '\u00FA' : // Ãº
	    case '\u00FB' : // Ã»
	    case '\u00FC' : // Ã¼
		output.append("u");
		break;
	    case '\u00FD' : // Ã½
	    case '\u00FF' : // Ã¿
		output.append("y");
		break;
	    default :
		output.append(input.charAt(i));
		break;
	    }
	}
	return output.toString();
    }
}

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] Validation and accents

Reply via email to