[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

Yonik Seeley (JIRA) Sun, 05 Dec 2010 12:57:39 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967016#action_12967016
 ]


Yonik Seeley commented on SOLR-1979:
------------------------------------

bq. The new field is made by concatenating the original field name with "_" + 
the ISO 639 code. 

This could be problematic given a large set of language codes since they could 
collide with existing dynamic field definitions.
Perhaps something with "text" in the name also?

Perhaps fieldName_${langCode}Text

Examples:
name_enText
name_frText

It would probably also be nice to be able to map a number of languages to a 
single field.... say you have a single analyzer that can handle CJK, then you 
may want that whole collection of languages mapped to a single _cjk field.

And just because you can detect a language doesn't mean you know how to handle 
it differently... so also have an optional catchall that handles all languages 
not specifically mapped.




> Create LanguageIdentifierUpdateProcessor
> ----------------------------------------
>
>                 Key: SOLR-1979
>                 URL: https://issues.apache.org/jira/browse/SOLR-1979
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Jan Høydahl
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: SOLR-1979.patch, SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>   <processor 
> class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
>     <str name="inputFields">name,subject</str>
>     <str name="outputField">language_s</str>
>     <str name="idField">id</str>
>     <str name="fallback">en</str>
>   </processor>
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

Reply via email to