Re: How to implement multilingual word components fields schema?
Hi Ilia, I see that Trey answered your question about how you might stack language specific filters in one field. If I remember correctly, his approach assumes you have identified the language of the query. That is not the same as detecting the script of the query and is much harder. Trying to do language-specific processing on multiple languages, especially a large number such as the 200 you mention or the 400 in HathiTrust is a very difficult problem. Detecting language (rather than script) in short queries is an open problem in the research literature. As others have suggested, you might want to start with something less ambitious that meets most of your business needs. You also might want to consider whether the errors a stemmer might make on some queries will be worth the increase in recall that you will get on others. Concern about getting results that can confuse users is the one of the main reason we haven't seriously pursued stemming in HathiTrust full-text search. Regarding the papers listed in my previous e-mail, you can get the first paper at the link I gave and the second paper (although on re-reading it, I don't think it will be very useful) is available if you go to the link for the code and follow the link on that page for the paper. I suspect you might want to think about the differences between scripts and languages. Most of the Solr/Lucene stemmers either assume you are only giving them the language they are designed for, or work on the basis of script. This works well when there is only one language per script, but breaks if you have many languages using the same script such as the Latin-1 languages. (Because of an issue with the Solr-user spam filter and an issue with my e-mail client all the URLs except the one below have http[s] removed and/or spaces added. See this gist for all the URLS with context: https://gist.github.com/anonymous/2e1233d80f37354001a3) That PolyGlotStemming filter uses the ICUTokenizer's script identification, but there are at least 12 different languages that use the Arabic script (www omniglot com writing arabic) and over 100 that use Latin-1. Please see the list of languages and scripts at aspell. net/ man-html /Supported. html#Supported. or www. omniglot .com /writing/langalph .htm#latin As a simple example, German and English both use the Latin-1 character set. Using an English stemmer for German or a German stemmer for English is unlikely to work very well. If you try to use stop words for multiple languages you will run into difficulties where a stop word in one language is a content word in another. For example if you use German stop words such as "die", you will eliminate the English content word "die". Identifying languages in short texts such as queries is a hard problem. About half the papers looking at query language identification cheat, and look at things such as the language of the pages that a user has clicked on. If all you have to make a guess is the text of the query, language identification is very difficult. I suspect that mixed script queries are even harder (see www .transacl. org/wp-content/uploads/2014/02/38.pdf). See the papers by Marco Lui and Tim Baldwin on Marco's web page: ww2 .cs. mu. oz. au /~mlui/ In this paper they explain why short text language identification is a hard problem "Language Identification: the Long and the Short of the Matter" www .aclweb. org/anthology/N10-1027 Other papers available on Marco's page describe the design and implementation of langid.py which is a state-of-the-art language identification program. I've tried several language guessers designed for short texts and at least on queries from our query logs, the results were unusable. Both langid.py with the defaults (noEnglish.langid.gz pipe delimited) and ldig with the most recent latin.model (NonEnglish.ldig.gz tab delimited) did not work well for our queries. However, both of these have parameters that can be tweaked and also facilities for training if you have labeled data. ldig is specifically designed to run on short text like queries or twitter. It can be configured to spit out the scores for each language instead of only the highest score (default). Also we didn't try to limit the list of languages it looks for, and that might give better results. github .com /shuyo/ldig langdetect looks like its by the same programmer and is in Java, but I haven't tried it: code .google. com/p/language-detection/ langid is designed by linguistic experts, but may need to be trained on short queries. github .com/saffsd/langid.py There is also Mike McCandless' port of the Google CLD blog. mikemccandless .com/2013/08/a-new-version-of-compact-language .html code .google .com/p/chromium-compact-language-detector/source/browse/README However here is this comment: "Indeed I see the same results as you; I think CLD2 is just not designed for short text." and a similar comment was made in this talk: videolectures .net/rus
Re: How to implement multilingual word components fields schema?
Ilia, one aspect you surely loose with a single field approach is the differentiation of semantic fields in different languages for words that sounds the same. The words "sitting" and "directions" are easy example that have fully different semantics in French and English, at least. "directions" would appear common with, say, teacher advice in English but not in French. I disagree that the storage should be an issue in your case…. most solr installations do not suffer from that, as far as I can read the list. Generally, you do not need all these stemmed fields to be stored, they're just indexed and that is pretty tiny a storage. Using separate fields also has advantages in terms of IDF, I think. I do not understand the last question to Tom, he provides URLs to at least one of the papers. Also, if you can put a hand on it, the book of Peters, Braschler, and Clough is probably relevant: http://link.springer.com/book/10.1007%2F978-3-642-23008-0 but, as the first article referenced by Tom says, the CLIR approach here relies on parallel corpora, e.g. created by automatic translations. Paul On 8 sept. 2014, at 07:33, Ilia Sretenskii wrote: > Thank you for the replies, guys! > > Using field-per-language approach for multilingual content is the last > thing I would try since my actual task is to implement a search > functionality which would implement relatively the same possibilities for > every known world language. > The closest references are those popular web search engines, they seem to > serve worldwide users with their different languages and even > cross-language queries as well. > Thus, a field-per-language approach would be a sure waste of storage > resources due to the high number of duplicates, since there are over 200 > known languages. > I really would like to keep single field for cross-language searchable text > content, witout splitting it into specific language fields or specific > language cores. > > So my current choice will be to stay with just the ICUTokenizer and > ICUFoldingFilter as they are without any language specific > stemmers/lemmatizers yet at all. > > Probably I will put the most popular languages stop words filters and > stemmers into the same one searchable text field to give it a try and see > if it works correctly in a stack. > Does specific language related filters stacking work correctly in one field? > > Further development will most likely involve some advanced custom analyzers > like the "SimplePolyGlotStemmingTokenFilter" to utilize the ICU generated > ScriptAttribute. > http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/100236 > https://github.com/whateverdood/cross-lingual-search/blob/master/src/main/java/org/apache/lucene/sandbox/analysis/polyglot/SimplePolyGlotStemmingTokenFilter.java > > So I would like to know more about those "academic papers on this issue of > how best to deal with mixed language/mixed script queries and documents". > Tom, could you please share them?
Re: How to implement multilingual word components fields schema?
Hi Ilia, When writing *Solr in Action*, I implemented a feature which can do what you're asking (allow multiple, dynamic analyzers to be used in a single text field). This would allow you to use the same field and dynamically change the analyzers (for example, you could do language-identification on documents and only stem to the identified languages). It also support more than one Analyzer per field (i.e. if you single documents or queries containing multiple languages). This seems to be a feature request which comes up regularly, so I just submitted a new feature request on JIRA to add this feature to Solr and track the progress: https://issues.apache.org/jira/browse/SOLR-6492 I included a comment showing how to use the functionality currently described in *Solr in Action*, but I plan to make it easier to use over the next 2 months before calling it done. I'm going to be talking about multilingual search in November at Lucene/Solr Revolution, so I'd ideally like to finish before then so I can demonstrate it there. Thanks, -Trey Grainger Director of Engineering, Search & Analytics @ CareerBuilder On Mon, Sep 8, 2014 at 3:31 PM, Jorge Luis Betancourt Gonzalez < jlbetanco...@uci.cu> wrote: > In one of the talks by Trey Grainger (author of Solr in Action) it touches > how on CareerBuilder are dealing with multilingual with payloads, its a > little more of work but I think it would payoff. > > On Sep 8, 2014, at 7:58 AM, Jack Krupansky > wrote: > > > You also need to take a stance as to whether you wish to auto-detect the > language at query time vs. have a UI selection of language vs. attempt to > perform the same query for each available language and then "determine" > which has the best "relevancy". The latter two options are very sensitive > to short queries. Keep in mind that auto-detection for indexing full > documents is a different problem that auto-detection for very short queries. > > > > -- Jack Krupansky > > > > -Original Message- From: Ilia Sretenskii > > Sent: Sunday, September 7, 2014 10:33 PM > > To: solr-user@lucene.apache.org > > Subject: Re: How to implement multilingual word components fields schema? > > > > Thank you for the replies, guys! > > > > Using field-per-language approach for multilingual content is the last > > thing I would try since my actual task is to implement a search > > functionality which would implement relatively the same possibilities for > > every known world language. > > The closest references are those popular web search engines, they seem to > > serve worldwide users with their different languages and even > > cross-language queries as well. > > Thus, a field-per-language approach would be a sure waste of storage > > resources due to the high number of duplicates, since there are over 200 > > known languages. > > I really would like to keep single field for cross-language searchable > text > > content, witout splitting it into specific language fields or specific > > language cores. > > > > So my current choice will be to stay with just the ICUTokenizer and > > ICUFoldingFilter as they are without any language specific > > stemmers/lemmatizers yet at all. > > > > Probably I will put the most popular languages stop words filters and > > stemmers into the same one searchable text field to give it a try and see > > if it works correctly in a stack. > > Does specific language related filters stacking work correctly in one > field? > > > > Further development will most likely involve some advanced custom > analyzers > > like the "SimplePolyGlotStemmingTokenFilter" to utilize the ICU generated > > ScriptAttribute. > > http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/100236 > > > https://github.com/whateverdood/cross-lingual-search/blob/master/src/main/java/org/apache/lucene/sandbox/analysis/polyglot/SimplePolyGlotStemmingTokenFilter.java > > > > So I would like to know more about those "academic papers on this issue > of > > how best to deal with mixed language/mixed script queries and documents". > > Tom, could you please share them? > > Concurso "Mi selfie por los 5". Detalles en > http://justiciaparaloscinco.wordpress.com >
Re: How to implement multilingual word components fields schema?
In one of the talks by Trey Grainger (author of Solr in Action) it touches how on CareerBuilder are dealing with multilingual with payloads, its a little more of work but I think it would payoff. On Sep 8, 2014, at 7:58 AM, Jack Krupansky wrote: > You also need to take a stance as to whether you wish to auto-detect the > language at query time vs. have a UI selection of language vs. attempt to > perform the same query for each available language and then "determine" which > has the best "relevancy". The latter two options are very sensitive to short > queries. Keep in mind that auto-detection for indexing full documents is a > different problem that auto-detection for very short queries. > > -- Jack Krupansky > > -Original Message- From: Ilia Sretenskii > Sent: Sunday, September 7, 2014 10:33 PM > To: solr-user@lucene.apache.org > Subject: Re: How to implement multilingual word components fields schema? > > Thank you for the replies, guys! > > Using field-per-language approach for multilingual content is the last > thing I would try since my actual task is to implement a search > functionality which would implement relatively the same possibilities for > every known world language. > The closest references are those popular web search engines, they seem to > serve worldwide users with their different languages and even > cross-language queries as well. > Thus, a field-per-language approach would be a sure waste of storage > resources due to the high number of duplicates, since there are over 200 > known languages. > I really would like to keep single field for cross-language searchable text > content, witout splitting it into specific language fields or specific > language cores. > > So my current choice will be to stay with just the ICUTokenizer and > ICUFoldingFilter as they are without any language specific > stemmers/lemmatizers yet at all. > > Probably I will put the most popular languages stop words filters and > stemmers into the same one searchable text field to give it a try and see > if it works correctly in a stack. > Does specific language related filters stacking work correctly in one field? > > Further development will most likely involve some advanced custom analyzers > like the "SimplePolyGlotStemmingTokenFilter" to utilize the ICU generated > ScriptAttribute. > http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/100236 > https://github.com/whateverdood/cross-lingual-search/blob/master/src/main/java/org/apache/lucene/sandbox/analysis/polyglot/SimplePolyGlotStemmingTokenFilter.java > > So I would like to know more about those "academic papers on this issue of > how best to deal with mixed language/mixed script queries and documents". > Tom, could you please share them? Concurso "Mi selfie por los 5". Detalles en http://justiciaparaloscinco.wordpress.com
Re: How to implement multilingual word components fields schema?
You also need to take a stance as to whether you wish to auto-detect the language at query time vs. have a UI selection of language vs. attempt to perform the same query for each available language and then "determine" which has the best "relevancy". The latter two options are very sensitive to short queries. Keep in mind that auto-detection for indexing full documents is a different problem that auto-detection for very short queries. -- Jack Krupansky -Original Message- From: Ilia Sretenskii Sent: Sunday, September 7, 2014 10:33 PM To: solr-user@lucene.apache.org Subject: Re: How to implement multilingual word components fields schema? Thank you for the replies, guys! Using field-per-language approach for multilingual content is the last thing I would try since my actual task is to implement a search functionality which would implement relatively the same possibilities for every known world language. The closest references are those popular web search engines, they seem to serve worldwide users with their different languages and even cross-language queries as well. Thus, a field-per-language approach would be a sure waste of storage resources due to the high number of duplicates, since there are over 200 known languages. I really would like to keep single field for cross-language searchable text content, witout splitting it into specific language fields or specific language cores. So my current choice will be to stay with just the ICUTokenizer and ICUFoldingFilter as they are without any language specific stemmers/lemmatizers yet at all. Probably I will put the most popular languages stop words filters and stemmers into the same one searchable text field to give it a try and see if it works correctly in a stack. Does specific language related filters stacking work correctly in one field? Further development will most likely involve some advanced custom analyzers like the "SimplePolyGlotStemmingTokenFilter" to utilize the ICU generated ScriptAttribute. http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/100236 https://github.com/whateverdood/cross-lingual-search/blob/master/src/main/java/org/apache/lucene/sandbox/analysis/polyglot/SimplePolyGlotStemmingTokenFilter.java So I would like to know more about those "academic papers on this issue of how best to deal with mixed language/mixed script queries and documents". Tom, could you please share them?
Re: How to implement multilingual word components fields schema?
Thank you for the replies, guys! Using field-per-language approach for multilingual content is the last thing I would try since my actual task is to implement a search functionality which would implement relatively the same possibilities for every known world language. The closest references are those popular web search engines, they seem to serve worldwide users with their different languages and even cross-language queries as well. Thus, a field-per-language approach would be a sure waste of storage resources due to the high number of duplicates, since there are over 200 known languages. I really would like to keep single field for cross-language searchable text content, witout splitting it into specific language fields or specific language cores. So my current choice will be to stay with just the ICUTokenizer and ICUFoldingFilter as they are without any language specific stemmers/lemmatizers yet at all. Probably I will put the most popular languages stop words filters and stemmers into the same one searchable text field to give it a try and see if it works correctly in a stack. Does specific language related filters stacking work correctly in one field? Further development will most likely involve some advanced custom analyzers like the "SimplePolyGlotStemmingTokenFilter" to utilize the ICU generated ScriptAttribute. http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/100236 https://github.com/whateverdood/cross-lingual-search/blob/master/src/main/java/org/apache/lucene/sandbox/analysis/polyglot/SimplePolyGlotStemmingTokenFilter.java So I would like to know more about those "academic papers on this issue of how best to deal with mixed language/mixed script queries and documents". Tom, could you please share them?
Re: How to implement multilingual word components fields schema?
Hi Ilia, I don't know if it would be helpful but below I've listed some academic papers on this issue of how best to deal with mixed language/mixed script queries and documents. They are probably taking a more complex approach than you will want to use, but perhaps they will help to think about the various ways of approaching the problem. We haven't tackled this problem yet. We have over 200 languages. Currently we are using the ICUTokenizer and ICUFolding filter but don't do any stemming due to a concern with overstemming (we have very high recall, so don't want to hurt precision by stemming) and the difficulty of correct language identification of short queries. If you have languages where there is only one language per script however, you might be able to do much more. I'm not sure if I'm remembering correctly but I believe some of the stemmers such as the Greek stemmer will pass through any strings that don't contain characters in the Greek script. So it might be possible to at least do stemming on some of your languages/scripts. I'll be very interested to learn what approach you end up using. Tom -- Some papers: Mohammed Mustafa, Izzedin Osman, and Hussein Suleman. 2011. Indexing and weighting of multilingual and mixed documents. In *Proceedings of the South African Institute of Computer Scientists and Information Technologists Conference on Knowledge, Innovation and Leadership in a Diverse, Multidisciplinary Environment* (SAICSIT '11). ACM, New York, NY, USA, 161-170. DOI=10.1145/2072221.2072240 http://doi.acm.org/10.1145/2072221.2072240 That paper and some others are here: http://www.husseinsspace.com/research/students/mohammedmustafaali.html There is also some code from this article: Parth Gupta, Kalika Bali, Rafael E. Banchs, Monojit Choudhury, and Paolo Rosso. 2014. Query expansion for mixed-script information retrieval. In *Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval* (SIGIR '14). ACM, New York, NY, USA, 677-686. DOI=10.1145/2600428.2609622 http://doi.acm.org/10.1145/2600428.2609622 Code: http://users.dsic.upv.es/~pgupta/mixed-script-ir.html Tom Burton-West Information Retrieval Programmer Digital Library Production Service University of Michigan Library tburt...@umich.edu http://www.hathitrust.org/blogs/large-scale-search On Fri, Sep 5, 2014 at 10:06 AM, Ilia Sretenskii wrote: > Hello. > We have documents with multilingual words which consist of different > languages parts and seach queries of the same complexity, and it is a > worldwide used online application, so users generate content in all the > possible world languages. > > For example: > 言語-aware > Løgismose-alike > ຄໍາຮ້ອງສະຫມັກ-dependent > > So I guess our schema requires a single field with universal analyzers. > > Luckily, there exist ICUTokenizer and ICUFoldingFilter for that. > > But then it requires stemming and lemmatization. > > How to implement a schema with universal stemming/lemmatization which would > probably utilize the ICU generated token script attribute? > > http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html > > By the way, I have already examined the Basistech schema of their > commercial plugins and it defines tokenizer/filter language per field type, > which is not a universal solution for such complex multilingual texts. > > Please advise how to address this task. > > Sincerely, Ilia Sretenskii. >
RE: How to implement multilingual word components fields schema?
Agree with the approach Jack suggested to use same source text in multiple fields for each language and then doing a dismax query. Would love to hear if it works for you? Thanks, Susheel -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Friday, September 05, 2014 10:21 AM To: solr-user@lucene.apache.org Subject: Re: How to implement multilingual word components fields schema? It comes down to how you personally want to value compromises between conflicting requirements, such as relative weighting of false positives and false negatives. Provide a few use cases that illustrate the boundary cases that you care most about. For example field values that have snippets in one language embedded within larger values in a different language. And, whether your fields are always long or sometimes short - the former can work well for language detection, but not the latter, unless all fields of a given document are always in the same language. Otherwise simply index the same source text in multiple fields, one for each language. You can then do a dismax query on that set of fields. -- Jack Krupansky -Original Message- From: Ilia Sretenskii Sent: Friday, September 5, 2014 10:06 AM To: solr-user@lucene.apache.org Subject: How to implement multilingual word components fields schema? Hello. We have documents with multilingual words which consist of different languages parts and seach queries of the same complexity, and it is a worldwide used online application, so users generate content in all the possible world languages. For example: 言語-aware Løgismose-alike ຄໍາຮ້ອງສະຫມັກ-dependent So I guess our schema requires a single field with universal analyzers. Luckily, there exist ICUTokenizer and ICUFoldingFilter for that. But then it requires stemming and lemmatization. How to implement a schema with universal stemming/lemmatization which would probably utilize the ICU generated token script attribute? http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html By the way, I have already examined the Basistech schema of their commercial plugins and it defines tokenizer/filter language per field type, which is not a universal solution for such complex multilingual texts. Please advise how to address this task. Sincerely, Ilia Sretenskii. This e-mail message may contain confidential or legally privileged information and is intended only for the use of the intended recipient(s). Any unauthorized disclosure, dissemination, distribution, copying or the taking of any action in reliance on the information herein is prohibited. E-mails are not secure and cannot be guaranteed to be error free as they can be intercepted, amended, or contain viruses. Anyone who communicates with us by e-mail is deemed to have accepted these risks. The Digital Group is not responsible for errors or omissions in this message and denies any responsibility for any damage arising from the use of e-mail. Any opinion defamatory or deemed to be defamatory or any material which could be reasonably branded to be a species of plagiarism and other statements contained in this message and any attachment are solely those of the author and do not necessarily represent those of the company.
Re: How to implement multilingual word components fields schema?
It comes down to how you personally want to value compromises between conflicting requirements, such as relative weighting of false positives and false negatives. Provide a few use cases that illustrate the boundary cases that you care most about. For example field values that have snippets in one language embedded within larger values in a different language. And, whether your fields are always long or sometimes short - the former can work well for language detection, but not the latter, unless all fields of a given document are always in the same language. Otherwise simply index the same source text in multiple fields, one for each language. You can then do a dismax query on that set of fields. -- Jack Krupansky -Original Message- From: Ilia Sretenskii Sent: Friday, September 5, 2014 10:06 AM To: solr-user@lucene.apache.org Subject: How to implement multilingual word components fields schema? Hello. We have documents with multilingual words which consist of different languages parts and seach queries of the same complexity, and it is a worldwide used online application, so users generate content in all the possible world languages. For example: 言語-aware Løgismose-alike ຄໍາຮ້ອງສະຫມັກ-dependent So I guess our schema requires a single field with universal analyzers. Luckily, there exist ICUTokenizer and ICUFoldingFilter for that. But then it requires stemming and lemmatization. How to implement a schema with universal stemming/lemmatization which would probably utilize the ICU generated token script attribute? http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html By the way, I have already examined the Basistech schema of their commercial plugins and it defines tokenizer/filter language per field type, which is not a universal solution for such complex multilingual texts. Please advise how to address this task. Sincerely, Ilia Sretenskii.
How to implement multilingual word components fields schema?
Hello. We have documents with multilingual words which consist of different languages parts and seach queries of the same complexity, and it is a worldwide used online application, so users generate content in all the possible world languages. For example: 言語-aware Løgismose-alike ຄໍາຮ້ອງສະຫມັກ-dependent So I guess our schema requires a single field with universal analyzers. Luckily, there exist ICUTokenizer and ICUFoldingFilter for that. But then it requires stemming and lemmatization. How to implement a schema with universal stemming/lemmatization which would probably utilize the ICU generated token script attribute? http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html By the way, I have already examined the Basistech schema of their commercial plugins and it defines tokenizer/filter language per field type, which is not a universal solution for such complex multilingual texts. Please advise how to address this task. Sincerely, Ilia Sretenskii.