[
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968582#action_12968582
]
Erik Hatcher commented on SOLR-1979:
------------------------------------
Oh, and don't get me wrong, I get the multivalued language per document need
too, here. Anyway, it'll be easy enough add support for this to be controlled
through configuration. In single language per doc mode, basically concatenate
all of the fields specified and detect on that and map into a singled value
language field. Language-per-field I get too, of course... just depends on the
domain being modeled and in my experience I've seen apps designed both ways.
Neither way is the one true way, it just depends.
And of course Muir is smirking and saying "heck, you have multiple languages
within a field often too, so we need to account for that somehow too". But
probably not here, yet.
> Create LanguageIdentifierUpdateProcessor
> ----------------------------------------
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
> Issue Type: New Feature
> Components: update
> Reporter: Jan Høydahl
> Assignee: Grant Ingersoll
> Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch,
> SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act
> upon it, such as indexing the content into language aware fields. Another
> usecase is to be able to filter/facet on language on random unstructured
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The
> processor is configurable like this:
> {code:xml}
> <processor
> class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> <str name="inputFields">name,subject</str>
> <str name="outputField">language_s</str>
> <str name="idField">id</str>
> <str name="fallback">en</str>
> </processor>
> {code}
> It will then read the text from inputFields name and subject, perform
> language identification and output the ISO code for the detected language in
> the outputField. If no language was detected, fallback language is used.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]