Date: 2004-07-08T06:30:01 Editor: 128.230.38.21 <> Wiki: Jakarta Lucene Wiki Page: IndexingOtherLanguages URL: http://wiki.apache.org/jakarta-lucene/IndexingOtherLanguages
no comment Change Log: ------------------------------------------------------------------------------ @@ -10,7 +10,7 @@ 1. Know the encoding of the documents you wish to index. Java assumes the native encoding when reading in files unless you tell it otherwise. To create a Reader that supports reading in other encodings, see [http://java.sun.com/j2se/1.4.2/docs/api/java/io/InputStreamReader.html InputStreamReader]. I find it easiest to convert all of my files to UTF-8 before indexing, and then I read them in by doing:[[BR]] `Reader reader = new InputStreamReader(new FileInputStream("path to file"), "UTF-8");` -Note: The demo supplied with Lucene does not support UTF-8 out of the box. You will have to modify it. + 2. Identify the Analyzer you will use or write your own if none exists. There are many great analyzers available that will index a wide variety of languages. See [http://jakarta.apache.org/lucene/docs/lucene-sandbox/ Sandbox] for some. Otherwise, look around the web. If you are writing your own, consider donating it to the Lucene Sandbox so that others can benefit from your brilliance. See item 3. below for what is needed in a custom analyzer. 'Put example of writing an Analyzer here' --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
