Re: Language Detection Individual Field Mapping Bug

2017-01-27 Thread Tomás Fernández Löbbe
Thanks Will,
This does look like a bug and I also couldn't find a Jira issue for it.
Feel free to create one.

Tomás

On Mon, Jan 23, 2017 at 10:37 PM, Will Martin 
wrote:

> Hello,
>
> While using Solr 6.0.4 I noticed that the org.apache.solr.update.
> processor.LangDetectLanguageIdentifierUpdateProcessor has a bug in it
> where it does not respect the "langid.map.individual" parameter in
> solrconfig.xml. The documentation for langid.map.individual
> 
> specifies:
>
> If you require detecting languages separately for each field, supply
>> langid.map.individual=true. The supplied fields will then be renamed
>> according to detected language on an individual field basis.
>>
>
> However, when this field is set to "true" the fields are still mapped to
> the language code of the entire document. For example: With the following
> snippet from solrconfig.xml
>
>  class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
>
>  title,text
>  language_s
>  true
>  true
>
>
> a document that takes the form
>
> {
>   "title": "This is an English title",
>   "text": "Pero el texto de este documento está en español."
> }
>
> will be turned into
>
> {
>   "title_es": "This is an english title",
>   "text_es": "Pero el texto de este documento está en español.",
>   "language_s": ["es"]
> }
>
> rather than
>
> {
>   "title_en": "This is an english title",
>   "text_es": "Pero el texto de este documento está en español.",
>   "language_s": ["es","en"]
> }
>
> during processing.
>
> This bug seems to have been introduced in SOLR-3881
>  when the abstract
> method (LangDetectLanguageIdentifierUpdateProcessor.java:52)
>
> protected List detectLanguage(String content)
>
> was changed to the signature
>
> protected List detectLanguage(SolrInputDocument doc)
>
> which does not allow one to recognize individual fields while preforming
> language detection. As it stands, the entire document is analysed per
> individual field (included in the "langid.fl" or "langid.map.individual.fl"
> parameters) and the field is mapped to the language of the entire document.
>
> I searched the Apache Jira for a ticket tracking this bug but did not find
> anything that seemed related. I thought before filing a new ticket I would
> ping this mailing list to see if anyone knows about work relating to this
> issue or if there is already a ticket for it (not directly related to the
> term "langid.map.individual" perhaps). If not I can go ahead and file the
> ticket.
>
>
> Thanks,
>
> -William Martin
>


Language Detection Individual Field Mapping Bug

2017-01-23 Thread Will Martin
Hello,

While using Solr 6.0.4 I noticed that the
org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor
has a bug in it where it does not respect the "langid.map.individual"
parameter in solrconfig.xml. The documentation for langid.map.individual

specifies:

If you require detecting languages separately for each field, supply
> langid.map.individual=true. The supplied fields will then be renamed
> according to detected language on an individual field basis.
>

However, when this field is set to "true" the fields are still mapped to
the language code of the entire document. For example: With the following
snippet from solrconfig.xml


   
 title,text
 language_s
 true
 true
   

a document that takes the form

{
  "title": "This is an English title",
  "text": "Pero el texto de este documento está en español."
}

will be turned into

{
  "title_es": "This is an english title",
  "text_es": "Pero el texto de este documento está en español.",
  "language_s": ["es"]
}

rather than

{
  "title_en": "This is an english title",
  "text_es": "Pero el texto de este documento está en español.",
  "language_s": ["es","en"]
}

during processing.

This bug seems to have been introduced in SOLR-3881
 when the abstract method
(LangDetectLanguageIdentifierUpdateProcessor.java:52)

protected List detectLanguage(String content)

was changed to the signature

protected List detectLanguage(SolrInputDocument doc)

which does not allow one to recognize individual fields while preforming
language detection. As it stands, the entire document is analysed per
individual field (included in the "langid.fl" or "langid.map.individual.fl"
parameters) and the field is mapped to the language of the entire document.

I searched the Apache Jira for a ticket tracking this bug but did not find
anything that seemed related. I thought before filing a new ticket I would
ping this mailing list to see if anyone knows about work relating to this
issue or if there is already a ticket for it (not directly related to the
term "langid.map.individual" perhaps). If not I can go ahead and file the
ticket.


Thanks,

-William Martin