[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated SOLR-1979:
----------------------------------

    Attachment: SOLR-1979.patch

I took Jan's and Tommaso's patches and reworked them a bit.  It seems to me 
that there isn't much point in merely identifying the language if you aren't 
going to do something about it.  So, this patch builds on what Jan and Tommaso 
did and then will remap the input fields to new per language fields (note, we 
could make this optional).  I also tried to standardize the input parameters a 
bit.  I dropped the outputField setting and a number of other settings and I 
made the language detection to be per input field.  The basic gist of it is 
that if you input two fields: name, subject, it will detect the language of 
each field and then attempt to map them to a new field.  The new field is made 
by concatenating the original field name with "_" + the ISO 639 code.  For 
example, if en is the detected language, then the new field for name would be 
name_en.  If that field doesn't exist, it will fall back to the original field 
(i.e. name).

Left to do:
# Fix the tests.  I don't like how we currently tests UpdateProcessorChains.  
It should not require writing your own little piece of update mechanism.  You 
should be able to simply setup the appropriate configuration, hook it into an 
update handler and then hit that update handler.  
# Need to check the license headers, builds, etc.

> Create LanguageIdentifierUpdateProcessor
> ----------------------------------------
>
>                 Key: SOLR-1979
>                 URL: https://issues.apache.org/jira/browse/SOLR-1979
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Jan Høydahl
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: SOLR-1979.patch, SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>   <processor 
> class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
>     <str name="inputFields">name,subject</str>
>     <str name="outputField">language_s</str>
>     <str name="idField">id</str>
>     <str name="fallback">en</str>
>   </processor>
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to