[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966964#action_12966964 ]
Jan Høydahl commented on SOLR-1979: ----------------------------------- Simply allowing to set the threshold for isReasonablyCertain() is probably not enough to get a robust detection. This is because the distance measure is very sensitive to the length of the profiles in use. Thus, it is a bit dangerous to expose getDistance() as in TIKA-568, cause that distance measure is kind of an internal value, not very normalized and is bound to change in future versions of TIKA. See TIKA-369 and TIKA-496. I think the right way to go is solving these two issues first. By fixing so that getDisance() is not biased towards profile length, we can make a new isReasonablyCertain() implementation taking into account the relative distance between first and second candidate languages... > Create LanguageIdentifierUpdateProcessor > ---------------------------------------- > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update > Reporter: Jan Høydahl > Assignee: Grant Ingersoll > Priority: Minor > Attachments: SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} > <processor > class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > <str name="inputFields">name,subject</str> > <str name="outputField">language_s</str> > <str name="idField">id</str> > <str name="fallback">en</str> > </processor> > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org