Thanks Will, This does look like a bug and I also couldn't find a Jira issue for it. Feel free to create one.
Tomás On Mon, Jan 23, 2017 at 10:37 PM, Will Martin <[email protected]> wrote: > Hello, > > While using Solr 6.0.4 I noticed that the org.apache.solr.update. > processor.LangDetectLanguageIdentifierUpdateProcessor has a bug in it > where it does not respect the "langid.map.individual" parameter in > solrconfig.xml. The documentation for langid.map.individual > <https://wiki.apache.org/solr/LanguageDetection#langid.map.individual> > specifies: > > If you require detecting languages separately for each field, supply >> langid.map.individual=true. The supplied fields will then be renamed >> according to detected language on an individual field basis. >> > > However, when this field is set to "true" the fields are still mapped to > the language code of the entire document. For example: With the following > snippet from solrconfig.xml > > <processor > class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory"> > <lst name="defaults"> > <str name="langid.fl">title,text</str> > <str name="langid.langField">language_s</str> > <bool name="langid.map">true</bool> > <bool name="langid.map.individual">true</bool> > </lst></processor> > > a document that takes the form > > { > "title": "This is an English title", > "text": "Pero el texto de este documento está en español." > } > > will be turned into > > { > "title_es": "This is an english title", > "text_es": "Pero el texto de este documento está en español.", > "language_s": ["es"] > } > > rather than > > { > "title_en": "This is an english title", > "text_es": "Pero el texto de este documento está en español.", > "language_s": ["es","en"] > } > > during processing. > > This bug seems to have been introduced in SOLR-3881 > <https://issues.apache.org/jira/browse/SOLR-3881> when the abstract > method (LangDetectLanguageIdentifierUpdateProcessor.java:52) > > protected List<DetectedLanguage> detectLanguage(String content) > > was changed to the signature > > protected List<DetectedLanguage> detectLanguage(SolrInputDocument doc) > > which does not allow one to recognize individual fields while preforming > language detection. As it stands, the entire document is analysed per > individual field (included in the "langid.fl" or "langid.map.individual.fl" > parameters) and the field is mapped to the language of the entire document. > > I searched the Apache Jira for a ticket tracking this bug but did not find > anything that seemed related. I thought before filing a new ticket I would > ping this mailing list to see if anyone knows about work relating to this > issue or if there is already a ticket for it (not directly related to the > term "langid.map.individual" perhaps). If not I can go ahead and file the > ticket. > > > Thanks, > > -William Martin >
