[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966972#action_12966972
 ] 

Robert Muir commented on SOLR-1979:
-----------------------------------

bq. cause that distance measure is kind of an internal value, not very 
normalized and is bound to change in future versions of TIKA.

bq. we can make a new isReasonablyCertain() implementation taking into account 
the relative distance between first and second candidate languages...

I don't follow the logic: if its not very normalized then it seems like this 
approach doesnt tell you anything... language 1 could be uncertain,
 and language 2 is just completely uncertain, but that tells you nothing: isn't 
it like trying to determine if a good lucene search result score is "certainly 
a hit" and not really the right way to go?

For example: consider the case where the language isn't supported at all by 
Tika (i dont see a list of supported languages anywhere by the way!).
It would be good for us to know that the detection is uncertain at all... how 
relatively uncertain it is with regards to the next language, is not very 
important.

I think its also important we be able to get this uncertainty or whatever 
different agnostic of the implementation.
For example, we should be able to somehow think of chaining detectors... 

Its really important to "cheat" and not use heuristics for languages that don't 
need them.
For example, disregarding some strange theoretical/historical cases, you can 
simply look at the unicode properties 
in the document to determine that its in the Greek language, as its basically 
the only modern language using the greek alphabet


> Create LanguageIdentifierUpdateProcessor
> ----------------------------------------
>
>                 Key: SOLR-1979
>                 URL: https://issues.apache.org/jira/browse/SOLR-1979
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Jan Høydahl
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>   <processor 
> class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
>     <str name="inputFields">name,subject</str>
>     <str name="outputField">language_s</str>
>     <str name="idField">id</str>
>     <str name="fallback">en</str>
>   </processor>
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to