Indexing multiple languages

2005-05-31 Thread Tansley, Robert
Hi all, The DSpace (www.dspace.org) currently uses Lucene to index metadata (Dublin Core standard) and extracted full-text content of documents stored in it. Now the system is being used globally, it needs to support multi-language indexing. I've looked through the mailing list archives etc. and

Re: Indexing multiple languages

2005-05-31 Thread jian chen
Hi, Interesting topic. I thought about this as well. I wanted to index Chinese text with English, i.e., I want to treat the English text inside Chinese text as English tokens rather than Chinese text tokens. Right now I think maybe I have to write a special analyzer that takes the text input, and

Re: Indexing multiple languages

2005-05-31 Thread Erik Hatcher
Jian - have you tried Lucene's StandardAnalyzer with Chinese? It will keep English as-is (removing stop words, lowercasing, and such) and separate CJK characters into separate tokens also. Erik On May 31, 2005, at 5:49 PM, jian chen wrote: Hi, Interesting topic. I thought about this

Re: Indexing multiple languages

2005-05-31 Thread Erik Hatcher
Robert, I'm very likely going to be using DSpace and some related technologies from the SIMILE project very soon :) On May 31, 2005, at 5:08 PM, Tansley, Robert wrote: Hi all, The DSpace (www.dspace.org) currently uses Lucene to index metadata (Dublin Core standard) and extracted full-text

Re: Indexing multiple languages

2005-05-31 Thread jian chen
Hi, Erik, Thanks for your info. No, I haven't tried it yet. I will give it a try and maybe produce some Chinese/English text search demo online. Currently I used Lucene as the indexing engine for Velocity mailing list search. I have a demo at www.jhsystems.net. It is yet another mailing list s

Re: Indexing multiple languages

2005-06-01 Thread Paul Libbrecht
Le 1 juin 05, à 01:12, Erik Hatcher a écrit : 1/ one index for all languages 2/ one index for all languages, with an extra language field so searches can be constrained to a particular language 3/ separate indices for each language? I would vote for option #2 as it gives the most flexibilty - y

RE: Indexing multiple languages

2005-06-02 Thread Tansley, Robert
- > From: Paul Libbrecht [mailto:[EMAIL PROTECTED] > Sent: 01 June 2005 04:10 > To: java-user@lucene.apache.org > Subject: Re: Indexing multiple languages > > Le 1 juin 05, à 01:12, Erik Hatcher a écrit : > >> 1/ one index for all languages > >> 2/ one in

RE: Indexing multiple languages

2005-06-02 Thread Bob Cheung
Hi Erik, I am a new comer to this list and please allow me to ask a dumb question. For the StandardAnalyzer, will it have to be modified to accept different character encodings. We have customers in China, Taiwan and Hong Kong. Chinese data may come in 3 different encoding: Big5, GB and UTF8.

Re: Indexing multiple languages

2005-06-03 Thread Andy Roberts
On Friday 03 Jun 2005 01:06, Bob Cheung wrote: > For the StandardAnalyzer, will it have to be modified to accept > different character encodings. > > We have customers in China, Taiwan and Hong Kong. Chinese data may come > in 3 different encoding: Big5, GB and UTF8. > > What is the default encod

Re: Indexing multiple languages

2005-06-03 Thread Erik Hatcher
On Jun 2, 2005, at 9:06 PM, Bob Cheung wrote: Btw, I did try running the lucene demo (web template) to index the HTML files after I added one including English and Chinese characters. I was not able to search for any Chinese in that HTML file (returned no hits). I wonder whether I need to

Re: Indexing multiple languages

2005-06-03 Thread Grant Ingersoll
http://wiki.apache.org/jakarta-lucene/IndexingOtherLanguages >>> [EMAIL PROTECTED] 6/3/2005 6:03:31 AM >>> On Jun 2, 2005, at 9:06 PM, Bob Cheung wrote: > Btw, I did try running the lucene demo (web template) to index the > HTML > files after I added one including English and Chinese characters

Re: Indexing multiple languages

2005-06-03 Thread Paul Libbrecht
Robert, Le 2 juin 05, à 21:42, Tansley, Robert a écrit : It seems that there are even more options -- 4/ One index, with a separate Lucene document for each (item,language) combination, with one field that specifies the language 5/ One index, one Lucene document per item, with field names that

RE: Indexing multiple languages

2005-06-03 Thread Max Pfingsthorn
] Sent: Friday, June 03, 2005 14:23 To: java-user@lucene.apache.org Subject: Re: Indexing multiple languages Robert, Le 2 juin 05, à 21:42, Tansley, Robert a écrit : > It seems that there are even more options -- > 4/ One index, with a separate Lucene document for each (item,language) > co

Re: Indexing multiple languages

2005-06-03 Thread Doug Cutting
Tansley, Robert wrote: What if we're trying to index multiple languages in the same site? Is it best to have: 1/ one index for all languages 2/ one index for all languages, with an extra language field so searches can be constrained to a particular language 3/ separate indices for each language

RE: Indexing multiple languages

2005-06-03 Thread Bruce Ritchie
> Tansley, Robert wrote: > > What if we're trying to index multiple languages in the > same site? Is > > it best to have: > > > > 1/ one index for all languages > > 2/ one index for all languages, with an extra language field so > > searches can be constrained to a particular language 3/ separ

Re: Indexing multiple languages

2005-06-07 Thread sergiu gordea
Tansley, Robert wrote: Hi all, The DSpace (www.dspace.org) currently uses Lucene to index metadata (Dublin Core standard) and extracted full-text content of documents stored in it. Now the system is being used globally, it needs to support multi-language indexing. I've looked through the mail