[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967032#action_12967032
 ] 

Jan Høydahl commented on SOLR-1979:
-----------------------------------

@Robert: Yes, there must be a way to tell whether or not the language even has 
a profile, through some well defined method. It's not important HOW we improve 
detection certainty, but comparing the top n distances could help. I'm also a 
fan of including other metrics than profile similarity if that can help, 
however for unique scripts that will automatically be covered by profile 
similarity. Detailed solution discussions should continue in TIKA-369.

Macro languages: See TIKA-493

It makes sense to allow for detecting languages outside 639-1, and I believe 
RFC3066 and BCP47 are both re-using the 639 codes, so that if there is a 
2-letter code for a language it will be used. 639-1 is what "everyone" already 
knows.

In general, improvements should be done in Tika space, then use those in Solr, 
thus building one strong language detection library.

@Grant: I actually planned to do the regEx based field name mapping in a 
separate UpdateProcessor, to make things more flexible. Example:
{code:xml} 
  <processor 
class="org.apache.solr.update.processor.LanguageFieldMapperUpdateProcessor">
    <str name="languageField">language</str>
    <str name="fromRegEx">(.*?)_lang</str>
    <str name="toRegEx">$1_$lang</str>
    <str name="notSupportedLanguageToRegEx">$1_t</str>
    <str name="supportedLanguages">de,en,fr,it,es,nl</str>
  </processor>
{code} 

Your thought of allowing to detect language for individual fields in one go is 
also interesting. I'd love to see metadata support in SolrInputDocument, so 
that one processor could annotate a @language on the fields analyzed. Then next 
processor could act on metadata to rename field...

@Yonik: By allowing regex naming of field names, we give users a generic tool 
to avoid field name clashes, by picking the pattern.. Mapping multiple 
languages to same suffix also makes sense.


> Create LanguageIdentifierUpdateProcessor
> ----------------------------------------
>
>                 Key: SOLR-1979
>                 URL: https://issues.apache.org/jira/browse/SOLR-1979
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Jan Høydahl
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: SOLR-1979.patch, SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>   <processor 
> class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
>     <str name="inputFields">name,subject</str>
>     <str name="outputField">language_s</str>
>     <str name="idField">id</str>
>     <str name="fallback">en</str>
>   </processor>
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to