Re: Dynamic analizer settings change
You're still in danger of overly-broad hits. When you try stemming differently into the _same_ underlying field you get things that make sense in one language but are totally bogus in another language matching the query. As far as lots and lots of fields is concerned, if you want to restrict your searches to only one language you have a couple of choices here Consider a different core per language. Solr easily handles many cores/server. Now you have no 'wasted' space, it just happens that the stemmer for the core uses the DE-specific stemmers. Which you can extend to German de-compounding etc. Alternatively, you can form your queries with some care. There's nothing that requires, say, edismax to be specified in solrconfig.xml. Anything you would put in the defaults section of the config you can override on the command line. So, for instance, if you knew you were querying in French, you could form something like (going from memory) defType=edismax&qf=title_fr,text_fr or &qf=title_de,text_de and so completely avoid cross-languge searching. Or you could simply include a field that has the language and tack on an fq clause like fq=de. But you haven't told us how big your problem is. I wouldn't worry at all about efficiency at this stage if you have, say, 10M documents, I'd just try the simplest thing first and measure. 500M documents is probably another story. FWIW Erick On Wed, Sep 11, 2013 at 9:50 AM, maephisto wrote: > Thanks Jack! Indeed, very nice examples in your book. > > Inspired from there, here's a crazy idea: would it be possible to build a > custom processor chain that would detect the language and use it to apply > filters, like the aforementioned SnowballPorterFilter. > That would leave at the end a document having as fields: text(with filtered > content) and language(the one determined by the processor). > And at search time, always append the language=. > > Does this make sense? If so, would it affect the performance at index time? > Thanks! > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089305.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Dynamic analizer settings change
Thanks Jack! Indeed, very nice examples in your book. Inspired from there, here's a crazy idea: would it be possible to build a custom processor chain that would detect the language and use it to apply filters, like the aforementioned SnowballPorterFilter. That would leave at the end a document having as fields: text(with filtered content) and language(the one determined by the processor). And at search time, always append the language=. Does this make sense? If so, would it affect the performance at index time? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089305.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Dynamic analizer settings change
Yes, supporting multiple languages will be a performance hit, but maybe it won't be so bad since all but one of these language-specific fields will be empty for each document and Lucene text search should handle empty field values just fine. If you can't accept that performance hit, don't support multiple languages! It is completely your choice. There are index-time update processors that can do language detection and then automatically direct the text to the proper text_xx field. See: https://cwiki.apache.org/confluence/display/solr/Detecting+Languages+During+Indexing Although my e-book has a lot better examples, especially for the field redirection aspect. -- Jack Krupansky -Original Message- From: maephisto Sent: Wednesday, September 11, 2013 8:33 AM To: solr-user@lucene.apache.org Subject: Re: Dynamic analizer settings change Thanks, Erik! I might have missed mentioning something relevant. When querying Solr, I wouldn't actually need to query all fields, but only the one corresponding to the language picked by the user on the website. If he's using DE, then the search should only apply to the text_de field. What if I need to work with 50 different languages? Then I would get a schema with 50 types and 50 fields (text_en, text_fr, text_de, ...): won't this affect the performance ? bigger documents -> slower queries. -- View this message in context: http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089288.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Dynamic analizer settings change
-Original message- > From:maephisto > Sent: Wednesday 11th September 2013 14:34 > To: solr-user@lucene.apache.org > Subject: Re: Dynamic analizer settings change > > Thanks, Erik! > > I might have missed mentioning something relevant. When querying Solr, I > wouldn't actually need to query all fields, but only the one corresponding > to the language picked by the user on the website. If he's using DE, then > the search should only apply to the text_de field. > > What if I need to work with 50 different languages? > Then I would get a schema with 50 types and 50 fields (text_en, text_fr, > text_de, ...): won't this affect the performance ? bigger documents -> > slower queries. Yes, that will affect performance greatly! The problem is not searching 50 languages but when using (e)dismax, the problem is creating the entire query. You will see good performance in the `process` part of a search but poor performance in the `prepare` part of the search when debugging. > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089288.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Dynamic analizer settings change
Thanks, Erik! I might have missed mentioning something relevant. When querying Solr, I wouldn't actually need to query all fields, but only the one corresponding to the language picked by the user on the website. If he's using DE, then the search should only apply to the text_de field. What if I need to work with 50 different languages? Then I would get a schema with 50 types and 50 fields (text_en, text_fr, text_de, ...): won't this affect the performance ? bigger documents -> slower queries. -- View this message in context: http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089288.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Dynamic analizer settings change
I wouldn't :). Here's the problem. Say you do this successfully at index time. How do you then search reasonably? There's often not near enough information to know what the search language is, there's little or no context. If the number of languages is limited, people often index into separate language-specific fields, say title_fr and title_en and use edismax to automatically distribute queries against all the fields. Others index "families" of languages in separate fields using things like the folding filters for Western languages, another field for, say, CJK languages and another for Middle Eastern languages etc. FWIW, Erick On Wed, Sep 11, 2013 at 6:55 AM, maephisto wrote: > Let's take the following type definition and schema (borrowed from Rafal > Kuc's Solr 4 cookbook) : > > > > > > > > > and schema: > > required="true" /> > > > The above analizer will apply SnowballPorterFilter english language filter. > But would it be possible to change the language to french during indexing > for some documents. is this possible? If not, what would be the best > solution for having the same analizer but with different languages, which > languange being determined at index time ? > > Thanks! > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Dynamic analizer settings change
Let's take the following type definition and schema (borrowed from Rafal Kuc's Solr 4 cookbook) : and schema: The above analizer will apply SnowballPorterFilter english language filter. But would it be possible to change the language to french during indexing for some documents. is this possible? If not, what would be the best solution for having the same analizer but with different languages, which languange being determined at index time ? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274.html Sent from the Solr - User mailing list archive at Nabble.com.