Re: Lucene and multi-lingual Unicode - advice needed

Robert Muir Mon, 15 Jun 2009 14:56:11 -0700

its not too bad, here would be a simple one that only breaks words on
whitespace and lowercases:


public class Example extends Analyzer {
 public TokenStream tokenStream(String fieldName, Reader reader) {
   TokenStream ts = new WhitespaceTokenizer(reader);
   ts = new LowerCaseFilter(ts);
   return ts;
 }
}

can you give a better idea as to what languages you have and what your
search requirements are (accent marks, punctuation, etc etc) ?

On Mon, Jun 15, 2009 at 5:39 PM, OBender Hotmail<[email protected]> wrote:
> I've looked over SolR quickly, it is a bit too heavy for my project.
> So what is required (at a minimum) to build an analyzer, sandbox has a few of 
> them varying in complexity.
>
> -----Original Message-----
> From: Robert Muir [mailto:[email protected]]
> Sent: Monday, June 15, 2009 4:51 PM
> To: [email protected]
> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>
> Well just reply back if SolR is inappropriate for your needs.
>
> In that case, you will need to build a custom analyzer (its not too
> bad), so that you can use compass.
>
> On Mon, Jun 15, 2009 at 4:19 PM, OBender Hotmail<[email protected]> 
> wrote:
>> Hi,
>>
>> My goal is to find a framework that encapsulates as much low level 
>> indexing/search technology as possible and have it integrate nicely with 
>> Spring.
>> It looked like Compass was/is a good encapsulation of the functionality. 
>> I'll take a look at SolR though, thanks for the pointer.
>>
>> -----Original Message-----
>> From: Robert Muir [mailto:[email protected]]
>> Sent: Monday, June 15, 2009 1:14 PM
>> To: [email protected]
>> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>>
>> Hi,
>>
>> (Since this is an issue you brought up on the Compass forums)
>>
>> I wonder what stage you are in the development process?
>> Have you considered SolR, or does compass provide some other
>> functionality that you need?
>>
>> The reason I say this, is because the easiest solution might be to use
>> a nightly SolR for your application.
>>
>> I'm not personally biased one way or the other for any particular
>> framework, but recently there has been some improvements added to SolR
>> so that the default type 'text' is pretty good for multilingual
>> processing.
>>
>> In fact I hope in the future it will be improved in lucene so that
>> your decision is really based upon other application needs...
>>
>> On Mon, Jun 15, 2009 at 1:10 PM, OBender Hotmail<[email protected]> 
>> wrote:
>>> Hi All!
>>>
>>>
>>>
>>> I'm new to Lucene so forgive me if this question was asked before.
>>>
>>> I have a database with records in the same table in many different languages
>>> (up to 70) it includes all W-European, Arabic, Eastern, CJK, Cyrillic, etc.
>>> you name it.
>>> I've looked at what people say about Lucene and it looks like for the most
>>> part standard analyzers should do fine with most Unicode languages but there
>>> are quite a few exceptions.
>>> Here is some recently updated Lucene Jira thread:
>>> https://issues.apache.org/jira/browse/LUCENE-1488
>>>
>>> My question is, what would be the safest bet for me in terms of
>>> analyzers/tokenizers?
>>> Do I really have to write my own ones for the bunch of languages that are
>>> not supported?
>>> Did anyone already solve the problem similar to mine? I'm sure someone
>>> already did :)
>>>
>>> And yes, I looked at the Lucene sandbox analyzers. It just adds more
>>> confusion. For example why there analyzers for DE and FR? Wouldn't the
>>> standard analyzer (which is Unicode complaint as I understood) deal with EU
>>> languages just fine?
>>>
>>> Thanks in advance for advices :)
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> [email protected]
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>
>
>
> --
> Robert Muir
> [email protected]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>



-- 
Robert Muir
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Lucene and multi-lingual Unicode - advice needed

Reply via email to