[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-1979:
------------------------------

    Attachment: SOLR-1979.patch

New version. Example of accepted params:

{code}
 <processor 
class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
   <defaults>
     <str name="langid">true</str>
     <str name="langid.fl">title,subject,text,keywords</str>
     <str name="langid.langField">language_s</str>
     <str name="langid.langsField">languages</str>
     <str name="langid.overwrite">false</str>
     <float name="langid.threshold">0.5</float>
     <str name="langid.whitelist">no,en,es,dk</str>
     <str name="langid.map">true</str>
     <str name="langid.map.fl">title,text</str>
     <bool name="langid.map.overwrite">false</bool>
     <bool name="langid.map.keepOrig">false</bool>
     <bool name="langid.map.individual">false</bool>
     <str name="langid.map.individual.fl"></str>
     <str name="langid.fallbackFields">meta_content_language,lang</str>
     <str name="langid.fallback">en</str>
   </defaults>
 </processor>
{code}

The only mandatory parameter is langid.fl
To enable field name mapping, set langid.map=true. It will then map field names 
for all fields in langid.fl. If the set of fields to map is different from 
langid.fl, supply langid.map.fl. Those fields will then be renamed with a 
language suffix equal to the language detected from the langid.fl fields.

If you require detecting languages separately for each field, supply 
langid.map.individual=true. The supplied fields will then be renamed according 
to detected language on an individual basis. If the set of fields to detect 
individually is different from the already supplied langid.fl or langid.map.fl, 
supply langid.map.individual.fl. The fields listed in langid.map.individual.fl 
will then be detected individually, while the rest of the mapping fields will 
be mapped according to global document language.

> Create LanguageIdentifierUpdateProcessor
> ----------------------------------------
>
>                 Key: SOLR-1979
>                 URL: https://issues.apache.org/jira/browse/SOLR-1979
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Jan Høydahl
>            Assignee: Jan Høydahl
>            Priority: Minor
>         Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>   <processor 
> class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
>     <str name="inputFields">name,subject</str>
>     <str name="outputField">language_s</str>
>     <str name="idField">id</str>
>     <str name="fallback">en</str>
>   </processor>
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to