[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

Grant Ingersoll (JIRA) Mon, 06 Dec 2010 05:50:39 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967186#action_12967186
 ]


Grant Ingersoll commented on SOLR-1979:
---------------------------------------

bq.  but in solr, when designing up front, i was just saying we shouldn't limit 
any abstract portion to 639-1 when another implementation might support 3066 or 
BCP47... we should make sure we allow that.

Agreed.The only thing we are doing now is using the language that the language 
detector returns as part of the field name.  Both of these steps are easily 
overridable.  Both also rely on those fields existing.

bq. This could be problematic given a large set of language codes since they 
could collide with existing dynamic field definitions.

Yonik, I wasn't planning on relying on dynamic fields necessarily.  It may make 
sense to have users either predeclare the variations.

All in all, I would like to see Solr have better support for languages in both 
the schema and the config.  In my experience, in apps that have to support a 
lot of languages, there is a lot of redundancy in both the schema and the 
config.

> Create LanguageIdentifierUpdateProcessor
> ----------------------------------------
>
>                 Key: SOLR-1979
>                 URL: https://issues.apache.org/jira/browse/SOLR-1979
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Jan Høydahl
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>   <processor 
> class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
>     <str name="inputFields">name,subject</str>
>     <str name="outputField">language_s</str>
>     <str name="idField">id</str>
>     <str name="fallback">en</str>
>   </processor>
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

Reply via email to