RE: Choosing tokenizer based on language of document

2012-04-10 Thread Prakashganesh, Prabhu
icult to get the query language. Any thoughts/ideas on turning stemming on/off? Thanks Prabhu -Original Message- From: Dominique Bejean [mailto:dominique.bej...@eolya.fr] Sent: 06 April 2012 10:58 To: solr-user@lucene.apache.org Subject: Re: Choosing tokenizer based on language of document

Re: Choosing tokenizer based on language of document

2012-04-06 Thread Dominique Bejean
Hi, Yes, I agree it is not an easy issue. Index all languages with the appropriate char filter, tokenizer and filters for each language is not possible without new text type and new analyzer development. If you plan to index up to 10 different languages, I suggest one text field per language

Re: Choosing tokenizer based on language of document

2012-04-05 Thread Erick Erickson
This is really difficult to imagine working well. Even if you do choose the appropriate analysis chain (and it must be a chain here), and manage to appropriately tokenize for each language, what happens at query time? How do you expect to get matches on, say, Ukranian when the tokens of the query

Choosing tokenizer based on language of document

2012-04-04 Thread Prakashganesh, Prabhu
Hi, I have documents in different languages and I want to choose the tokenizer to use for a document based on the language of the document. The language of the document is already known and is indexed in a field. What I want to do is when I index the text in the document, I want to choose