Re: Dynamic analizer settings change

2013-09-11 Thread Erick Erickson
I wouldn't :). Here's the problem. Say you do this successfully at
index time. How do you then search reasonably? There's often
not near enough information to know what the search language is,
there's little or no context.

If the number of languages is limited, people often index into separate
language-specific fields, say title_fr and title_en and use edismax
to automatically distribute queries against all the fields.

Others index families of languages in separate fields using things
like the folding filters for Western languages, another field for, say,
CJK languages and another for Middle Eastern languages etc.

FWIW,
Erick


On Wed, Sep 11, 2013 at 6:55 AM, maephisto my_sky...@yahoo.com wrote:

 Let's take the following type definition and schema (borrowed from Rafal
 Kuc's Solr 4 cookbook) :
 fieldType name=text class=solr.TextField positionIncrementGap=100
 analyzer
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.SnowballPorterFilterFactory language=English/
 /analyzer
 /fieldType

 and schema:

 field name=id type=string indexed=true stored=true
 required=true /
 field name=title type=text indexed=true stored=true /

 The above analizer will apply SnowballPorterFilter english language filter.
 But would it be possible to change the language to french during indexing
 for some documents. is this possible? If not, what would be the best
 solution for having the same analizer but with different languages, which
 languange being determined at index time ?

 Thanks!



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Dynamic analizer settings change

2013-09-11 Thread maephisto
Thanks, Erik!

I might have missed mentioning something relevant. When querying Solr, I
wouldn't actually need to query all fields, but only the one corresponding
to the language picked by the user on the website. If he's using DE, then
the search should only apply to the text_de field.

What if I need to work with 50 different languages?
Then I would get a schema with 50 types and 50 fields (text_en, text_fr,
text_de, ...): won't this affect the performance ? bigger documents -
slower queries.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089288.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Dynamic analizer settings change

2013-09-11 Thread Markus Jelsma


 
 
-Original message-
 From:maephisto my_sky...@yahoo.com
 Sent: Wednesday 11th September 2013 14:34
 To: solr-user@lucene.apache.org
 Subject: Re: Dynamic analizer settings change
 
 Thanks, Erik!
 
 I might have missed mentioning something relevant. When querying Solr, I
 wouldn't actually need to query all fields, but only the one corresponding
 to the language picked by the user on the website. If he's using DE, then
 the search should only apply to the text_de field.
 
 What if I need to work with 50 different languages?
 Then I would get a schema with 50 types and 50 fields (text_en, text_fr,
 text_de, ...): won't this affect the performance ? bigger documents -
 slower queries.

Yes, that will affect performance greatly! The problem is not searching 50 
languages but when using (e)dismax, the problem is creating the entire query.  
You will see good performance in the `process` part of a search but poor 
performance in the `prepare` part of the search when debugging.

 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089288.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 


Re: Dynamic analizer settings change

2013-09-11 Thread Jack Krupansky
Yes, supporting multiple languages will be a performance hit, but maybe it 
won't be so bad since all but one of these language-specific fields will be 
empty for each document and Lucene text search should handle empty field 
values just fine. If you can't accept that performance hit, don't support 
multiple languages! It is completely your choice.


There are index-time update processors that can do language detection and 
then automatically direct the text to the proper text_xx field.


See:
https://cwiki.apache.org/confluence/display/solr/Detecting+Languages+During+Indexing

Although my e-book has a lot better examples, especially for the field 
redirection aspect.


-- Jack Krupansky

-Original Message- 
From: maephisto

Sent: Wednesday, September 11, 2013 8:33 AM
To: solr-user@lucene.apache.org
Subject: Re: Dynamic analizer settings change

Thanks, Erik!

I might have missed mentioning something relevant. When querying Solr, I
wouldn't actually need to query all fields, but only the one corresponding
to the language picked by the user on the website. If he's using DE, then
the search should only apply to the text_de field.

What if I need to work with 50 different languages?
Then I would get a schema with 50 types and 50 fields (text_en, text_fr,
text_de, ...): won't this affect the performance ? bigger documents -
slower queries.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089288.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Dynamic analizer settings change

2013-09-11 Thread maephisto
Thanks Jack! Indeed, very nice examples in your book.

Inspired from there, here's a crazy idea: would it be possible to build a
custom processor chain that would detect the language and use it to apply
filters, like the aforementioned SnowballPorterFilter.
That would leave at the end a document having as fields: text(with filtered
content) and language(the one determined by the processor).
And at search time, always append the language=user selected language.

Does this make sense? If so, would it affect the performance at index time?
Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089305.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Dynamic analizer settings change

2013-09-11 Thread Erick Erickson
You're still in danger of overly-broad hits. When you
try stemming differently into the _same_ underlying
field you get things that make sense in one language
but are totally bogus in another language matching
the query.

As far as lots and lots of fields is concerned, if you
want to restrict your searches to only one language
you have a couple of choices here

Consider a different core per language. Solr easily
handles many cores/server. Now you have no
'wasted' space, it just happens that the stemmer for
the core uses the DE-specific stemmers. Which
you can extend to German de-compounding etc.

Alternatively, you can form your queries with some
care. There's nothing that requires, say, edismax to
be specified in solrconfig.xml. Anything you would
put in the defaults section of the config you can
override on the command line. So, for instance,
if you knew you were querying in French, you could
form something like (going from memory)
defType=edismaxqf=title_fr,text_fr
or
qf=title_de,text_de

and so completely avoid cross-languge searching.

Or you could simply include a field that has the
language and tack on an fq clause like fq=de.

But you haven't told us how big your problem is. I wouldn't
worry at all about efficiency at this stage if you have, say,
10M documents, I'd just try the simplest thing first and
measure.

500M documents is probably another story.

FWIW
Erick


On Wed, Sep 11, 2013 at 9:50 AM, maephisto my_sky...@yahoo.com wrote:

 Thanks Jack! Indeed, very nice examples in your book.

 Inspired from there, here's a crazy idea: would it be possible to build a
 custom processor chain that would detect the language and use it to apply
 filters, like the aforementioned SnowballPorterFilter.
 That would leave at the end a document having as fields: text(with filtered
 content) and language(the one determined by the processor).
 And at search time, always append the language=user selected language.

 Does this make sense? If so, would it affect the performance at index time?
 Thanks!



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089305.html
 Sent from the Solr - User mailing list archive at Nabble.com.