[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

Grant Ingersoll (JIRA) Mon, 06 Dec 2010 17:08:34 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968528#action_12968528
 ]


Grant Ingersoll commented on SOLR-1979:
---------------------------------------

bq. So for all unmapped languages, you may want to map to a single generic 
field, or not map at all (leave field as is).

It currently leaves it in the original field.

bq. Also, if there are multiple input fields, the current patch would create 
multiple language field values requiring that field to be multi-valued. Is the 
goal here to identify a single language for a document? Or a separate language 
value for each of the input fields (which seems odd to me)?

Current patch requires multivalued language field.  I figure the main thing you 
want the lang. field for is faceting and filtering, but it can be changed.  As 
for the broader goal, I think it makes sense to detect languages per field and 
not per document.  In other words, you can have multiple languages in a single 
document.

> Create LanguageIdentifierUpdateProcessor
> ----------------------------------------
>
>                 Key: SOLR-1979
>                 URL: https://issues.apache.org/jira/browse/SOLR-1979
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Jan Høydahl
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>   <processor 
> class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
>     <str name="inputFields">name,subject</str>
>     <str name="outputField">language_s</str>
>     <str name="idField">id</str>
>     <str name="fallback">en</str>
>   </processor>
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

Reply via email to