[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

Yonik Seeley (JIRA) Mon, 06 Dec 2010 14:15:37 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968445#action_12968445
 ]


Yonik Seeley commented on SOLR-1979:
------------------------------------

bq. In skimming the current patch, it looks like fields get mapped no matter 
what. What if I just want the language detected and added as another field, but 
no field mapping desired?

Yeah, that's sort of in line with my:
bq. And just because you can detect a language doesn't mean you know how to 
handle it differently... so also have an optional catchall that handles all 
languages not specifically mapped.

So for all unmapped languages, you may want to map to a single generic field, 
or not map at all (leave field as is).
I guess it also depends on the general strategy... if you are detecting 
language on the "body" field, are we using a copyField type approach and only 
storing the body field while indexing as body_enText, or are we moving the 
field from "body" to "body_enText"?

bq. Also, if there are multiple input fields, the current patch would create 
multiple language field values requiring that field to be multi-valued. Is the 
goal here to identify a single language for a document?

I could see both making sense.

> Create LanguageIdentifierUpdateProcessor
> ----------------------------------------
>
>                 Key: SOLR-1979
>                 URL: https://issues.apache.org/jira/browse/SOLR-1979
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Jan Høydahl
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>   <processor 
> class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
>     <str name="inputFields">name,subject</str>
>     <str name="outputField">language_s</str>
>     <str name="idField">id</str>
>     <str name="fallback">en</str>
>   </processor>
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

Reply via email to