Re: English and French documents together / analysis, indexing, searching

Bernhard Messer Thu, 20 Jan 2005 11:16:14 -0800

you could try to create a more complex query and expand it into both languages using different analyzers. Would this solve your problem ?

Would that mean I would have to actually conduct two searches (one in English and one in French) then merge the results and display them to the user? It sounds to me like a long way around, so then actually writing an analyzer that has the language guesser might be a better solution on the long run?

It's no problem to guess the language based on the document corpus. But how do you want to guess the language of a simple Term Query ? What if your users are searching for names like "George Bush" ? You can't guess the language of such a query and you have to expand it into both languages. I don't see an easier way for solving that problem.

This is a behaviour is implemented in StandardTokenizer used by StandardAnalyzer. Look at the documentation of StandardTokenizer:

Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.
Hmm I feel this is beyond my abilities at the moment, writing my own tokenizer, without more in-depth knowledge of everything else. Perhaps I'll try taking the StandardTokenizer and expand it or change it based on other tokenziers available in Lucene such as WhiteSpaceTokenizer.

What's about using the WhitespaceAnalyzer directly ? Maybe this fits more into your requirement and you could use it for both languages.

Bernhard


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: English and French documents together / analysis, indexing, searching

Reply via email to