These are familiar. Any other approaches that people use? I guess I'm hoping ...
On 4/6/2014 7:37 AM, Benson Margulies wrote:
On Sun, Apr 6, 2014 at 10:30 AM, Herb Roitblat <>wrote:

Just curious, what are some of the things that people do to properly
tokenize the queries with mixed language collections?  What do you do with
mixed language queries?

You can either force the user to tell you the language, or ...

    you can run a language detector. They are less accurate for short
strings, or ...

      you can process it in _all_ of the languages and OR up the results.

On 4/6/2014 4:51 AM, Benson Margulies wrote:

You must know what language each text is in, and use an appropriate
analyzer. Some people do this by using a separate field (text_eng,
text_spa, text_jpn). Other people put some extra information at the
beginning of the field, and then make an analyzer that peeks in order to
dispatch to the correct tokenizer.

On Sat, Apr 5, 2014 at 9:59 PM, <> wrote:

  I am pretty new with Lucene, however I have not problem understanding
is about.
My big problem is trying to understand how Kuromoji works. I need to
implement a search functinality thats supports initially English, Spanish
and Japanese. I doesn't seem to be a deal with the two firsts, as I can
just use the analyzersーcommon to index both languages contents, but when
comes to Japanese it has it's own analyzer. I could't find any clues
combining analyzers, so I still don't if I can combine all languages
the same index (which would be ideal, as I expect mix searches in the
context of my project) or I have to detect the language first and then
index Japanese texts separately (what it will be a big disadvantage when
comes to mixed searches and future localization expansion).
I found out about Lucene throgh Kuromoji, it will be great to find out a
solution to be able to use all the greatness that Lucene offers.

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to