[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076259#comment-13076259
 ] 

Jan Høydahl commented on SOLR-1979:
-----------------------------------

This has been tested on a real, several hundred thousand docs dataset, 
including HTML, office docs and multiple other formats and it works well.

I'd like some more pairs of eyes on this however.

One thing which is less than perfect is that the threshold conversion from Tika 
currently parses out the (internal) distance value from a String, in lack of a 
getDistance() method (TIKA-568). This is a bit of a hack, but I argue it's a 
beneficial one since we can now configure langid.threshold to something 
meaningful for our own data instead of the preset binary isReasonablyCertain(). 
As we also normalize to a value between 0-1, we abstract away the TIKA 
implementation detail, and are free to use any improved distance measures from 
TIKA in the future e.g. as a result of TIKA-369, or even plug in a non-Tika 
identifier or a hybrid solution.

> Create LanguageIdentifierUpdateProcessor
> ----------------------------------------
>
>                 Key: SOLR-1979
>                 URL: https://issues.apache.org/jira/browse/SOLR-1979
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Jan Høydahl
>            Assignee: Jan Høydahl
>            Priority: Minor
>              Labels: UpdateProcessor
>             Fix For: 3.4
>
>         Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to