[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

Tommaso Teofili (JIRA) Tue, 07 Dec 2010 00:54:38 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968633#action_12968633
 ]


Tommaso Teofili commented on SOLR-1979:
---------------------------------------

bq. However, have you considered extending the document model to allow metadata 
per field? Then @language would be a valid field metadata, mostly as a means 
for later processing to pick up and act on. This can be a valuable mechanism 
for other inter processor communication as well as to pass info between 
document centric processing and Analysis.

I've also thought about this option and it sounds somehow reasonable but I 
think that it'd be a very huge change on the API; so from one point of view I 
like the idea but from another standpoint I think it could lead to a 
proliferation of @metadata.
So in the end I've not a strong opinion on that but I also have to say that 
I've seen such customizations in a production environment to leverage per field 
metadata.

Regarding per field and per document language fields I think that a document 
language field could be handled with two fixed strategies/policies (that can be 
also extended):
# restrictive strategy - if different languages result to be mapped inside the 
document language field than say that document language is, for example, 
"x-unspecified"
# simple strategy - map all the retrieved languages (per field) inside the 
document language field as different values (so multivalued="true")




> Create LanguageIdentifierUpdateProcessor
> ----------------------------------------
>
>                 Key: SOLR-1979
>                 URL: https://issues.apache.org/jira/browse/SOLR-1979
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Jan Høydahl
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>   <processor 
> class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
>     <str name="inputFields">name,subject</str>
>     <str name="outputField">language_s</str>
>     <str name="idField">id</str>
>     <str name="fallback">en</str>
>   </processor>
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

Reply via email to