[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13124366#comment-13124366 ] Jan Høydahl commented on SOLR-1979: --- Fixed overview.html in branch > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: contrib - LangId, update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Labels: UpdateProcessor > Fix For: 3.5, 4.0 > > Attachments: SOLR-1979-branch_3x.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > Language identification from document fields, and mapping of field names to > language-specific fields based on detected language. > Wrap the Tika LanguageIdentifier in an UpdateProcessor. > See user documentation at http://wiki.apache.org/solr/LanguageDetection -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13124343#comment-13124343 ] T Jake Luciani commented on SOLR-1979: -- build on 3x branch still failing because solr/contrib/langid/src/java/overview.html was only committed to trunk. This file needs to be added to branch_3x as well. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: contrib - LangId, update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Labels: UpdateProcessor > Fix For: 3.5, 4.0 > > Attachments: SOLR-1979-branch_3x.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > Language identification from document fields, and mapping of field names to > language-specific fields based on detected language. > Wrap the Tika LanguageIdentifier in an UpdateProcessor. > See user documentation at http://wiki.apache.org/solr/LanguageDetection -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121569#comment-13121569 ] Mark Miller commented on SOLR-1979: --- Nice! Great feature to get in - thanks guys. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: contrib - LangId, update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Labels: UpdateProcessor > Fix For: 3.5, 4.0 > > Attachments: SOLR-1979-branch_3x.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > Language identification from document fields, and mapping of field names to > language-specific fields based on detected language. > Wrap the Tika LanguageIdentifier in an UpdateProcessor. > See user documentation at http://wiki.apache.org/solr/LanguageDetection -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107867#comment-13107867 ] Jan Høydahl commented on SOLR-1979: --- Question: Since I plan to commit this for both 3.x and 4.x, I will be adding the CHANGES entry under 3.5 section, also for TRUNK. I know there have been some discussion around where to log changes, but as long as 4.0 is not released before 3.5, it will always be true that the feature was released in 3.5 and exists for all later revisions, not? > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: contrib - LangId, update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Labels: UpdateProcessor > Fix For: 3.5, 4.0 > > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch > > > Language identification from document fields, and mapping of field names to > language-specific fields based on detected language. > Wrap the Tika LanguageIdentifier in an UpdateProcessor. > See user documentation at http://wiki.apache.org/solr/LanguageDetection -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102723#comment-13102723 ] Jan Høydahl commented on SOLR-1979: --- Any changes you'd like before committing this? Lance, what config param changes did you have in mind? > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Labels: UpdateProcessor > Fix For: 3.5 > > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch > > > Language identification from document fields, and mapping of field names to > language-specific fields based on detected language. > Wrap the Tika LanguageIdentifier in an UpdateProcessor. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102646#comment-13102646 ] Jan Høydahl commented on SOLR-1979: --- Yep, it will skip detection if the field defined in langid.langField is not emtpty and langid.overwrite==false > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Labels: UpdateProcessor > Fix For: 3.5 > > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch > > > Language identification from document fields, and mapping of field names to > language-specific fields based on detected language. > Wrap the Tika LanguageIdentifier in an UpdateProcessor. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102578#comment-13102578 ] Markus Jelsma commented on SOLR-1979: - Hi. This is not what i understood from reading the wiki doc. Will the update processor skip detection with these settings? It's rather costly on many docs. Anyway, this is great work already, thanks! > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Labels: UpdateProcessor > Fix For: 3.5 > > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch > > > Language identification from document fields, and mapping of field names to > language-specific fields based on detected language. > Wrap the Tika LanguageIdentifier in an UpdateProcessor. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102573#comment-13102573 ] Jan Høydahl commented on SOLR-1979: --- @Markus: Sure. If you put your pre-known language code in the same field configured in langid.langField and use langid.overwrite=false, you will obtain that behavior. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Labels: UpdateProcessor > Fix For: 3.5 > > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch > > > Language identification from document fields, and mapping of field names to > language-specific fields based on detected language. > Wrap the Tika LanguageIdentifier in an UpdateProcessor. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102520#comment-13102520 ] Markus Jelsma commented on SOLR-1979: - Hi Jan, Can we also use the mapping feature without detection? Our detection is done in a Nutch cluster so we already identified many millions of docs. Thanks > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Labels: UpdateProcessor > Fix For: 3.5 > > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch > > > Language identification from document fields, and mapping of field names to > language-specific fields based on detected language. > Wrap the Tika LanguageIdentifier in an UpdateProcessor. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102374#comment-13102374 ] Jan Høydahl commented on SOLR-1979: --- An updated documentation of the Processor is now at http://wiki.apache.org/solr/LanguageDetection @Lance: What params were on your mind as candidates for keyword instead of true/false, and for what potential future reasons? > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Labels: UpdateProcessor > Fix For: 3.5 > > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch > > > Language identification from document fields, and mapping of field names to > language-specific fields based on detected language. > Wrap the Tika LanguageIdentifier in an UpdateProcessor. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101612#comment-13101612 ] Lance Norskog commented on SOLR-1979: - I'm impressed! This is a lot of work and empirical testing for a difficult problem. Comments: There are a few parameters that are true/false, but in the future you might want a third answer. It might be worth making the decision via a keyword so you can add new keywords later. About the multiple languages in one field problem: you can't solve everything at once. The other document analysis components like UIMA should be able to identify parts of documents, and then you use this on one part at a time. This is the point of a modular toolkit: you combine the tools to solve advanced problems. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Labels: UpdateProcessor > Fix For: 3.5 > > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch > > > Language identification from document fields, and mapping of field names to > language-specific fields based on detected language. > Wrap the Tika LanguageIdentifier in an UpdateProcessor. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076259#comment-13076259 ] Jan Høydahl commented on SOLR-1979: --- This has been tested on a real, several hundred thousand docs dataset, including HTML, office docs and multiple other formats and it works well. I'd like some more pairs of eyes on this however. One thing which is less than perfect is that the threshold conversion from Tika currently parses out the (internal) distance value from a String, in lack of a getDistance() method (TIKA-568). This is a bit of a hack, but I argue it's a beneficial one since we can now configure langid.threshold to something meaningful for our own data instead of the preset binary isReasonablyCertain(). As we also normalize to a value between 0-1, we abstract away the TIKA implementation detail, and are free to use any improved distance measures from TIKA in the future e.g. as a result of TIKA-369, or even plug in a non-Tika identifier or a hybrid solution. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Labels: UpdateProcessor > Fix For: 3.4 > > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch > > > Language identification from document fields, and mapping of field names to > language-specific fields based on detected language. > Wrap the Tika LanguageIdentifier in an UpdateProcessor. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053227#comment-13053227 ] Jan Høydahl commented on SOLR-1979: --- One question regarding the JUnit test: I now use {code} assertU(commit()); {code} How can I add update request params to this commit? To select another update chain from different tests, I'd like to add update params on the fly, e.g.: {code} assertU(commit(), "update.chain=langid2"); {code} > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch > > > Language identification from document fields, and mapping of field names to > language-specific fields based on detected language. > Wrap the Tika LanguageIdentifier in an UpdateProcessor. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13043448#comment-13043448 ] Jan Høydahl commented on SOLR-1979: --- Continuing on this implementing the ideas above... > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971400#action_12971400 ] Tommaso Teofili commented on SOLR-1979: --- bq. Keep it basic in first version. Allow for per-document and per-field detection. Make field-mapping configurable and optional (default off), allowing people to chain in their own mapper downstream if they choose. I agree, this sounds good for a basic implementation. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971338#action_12971338 ] Jan Høydahl commented on SOLR-1979: --- {quote} Jan, do you have any updates to the patch? I'd like to move forward with the basic functionality at least, but I still think we need the field mapping stuff, or we should punt all field mapping stuff to another processor. WDYT? {quote} I don't have any updates. Keep it basic in first version. Allow for per-document and per-field detection. Make field-mapping configurable and optional (default off), allowing people to chain in their own mapper downstream if they choose. Mixed-language per field is a different beast and should be dealt with to later. Probably requires analysis changes as well if we want analyzers to pick up language from payloads or something. My 2 cents > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971322#action_12971322 ] Grant Ingersoll commented on SOLR-1979: --- bq. What about leveraging payloads (we can output term|payload strings to the payload field type) for associating languages with fields? Yeah, that could be used with mixed language text (or a marker token). Jan, do you have any updates to the patch? I'd like to move forward with the basic functionality at least, but I still think we need the field mapping stuff, or we should punt all field mapping stuff to another processor. WDYT? > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969404#action_12969404 ] Erik Hatcher commented on SOLR-1979: What about leveraging payloads (we can output term|payload strings to the payload field type) for associating languages with fields? > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969140#action_12969140 ] Lance Norskog commented on SOLR-1979: - About Thai: there is a lot of South and East Asian language text out there written in phonetic USASCII, especially older pre-Unicode. Samples of these texts from different languages have ngram profiles just as distinct as the European languages. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969138#action_12969138 ] Lance Norskog commented on SOLR-1979: - A use case for multi-language fields: PDFs with different languages in different columns. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968827#action_12968827 ] Robert Muir commented on SOLR-1979: --- bq. We also need to detect whether a language is part of a macro language, and add both to languages multivalue field, because it should be possible to filter on Norwegian (no) without specifying both nn and nb, and also for sami (smi) without specifying all of the specific languages. macrolangs: http://www.sil.org/iso639-3/iso-639-3-macrolanguages_20100128.tab collections: http://www.loc.gov/standards/iso639-5/iso639-5.tab.txt > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968820#action_12968820 ] Jan Høydahl commented on SOLR-1979: --- >>I have a plan to add profiles for the Norwegian and Sami languages when time >>allows: TIKA-491 TIKA-492 >Did you plan to also upgrade tika from 639-1 for the Sami languages? the only >639-1 code i see is "se" but this seems to be appropriate only for North Sami. Exactly. That's one example which will need a wider range of codes. I was planning to use 639-2 for those that do not have a 2-letter code, but BCP47 it will be now (although the end result may be more or less the same) We also need to detect whether a language is part of a macro language, and add both to languages multivalue field, because it should be possible to filter on Norwegian (no) without specifying both nn and nb, and also for sami (smi) without specifying all of the specific languages. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968813#action_12968813 ] Robert Muir commented on SOLR-1979: --- bq. I have a plan to add profiles for the Norwegian and Sami languages when time allows: TIKA-491 TIKA-492 Did you plan to also upgrade tika from 639-1 for the Sami languages? the only 639-1 code i see is "se" but this seems to be appropriate only for North Sami. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968806#action_12968806 ] Jan Høydahl commented on SOLR-1979: --- Discussion on the process for adding language profiles to TIKA should be continued in TIKA-546 I have a plan to add profiles for the Norwegian and Sami languages when time allows: TIKA-491 TIKA-492 > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968786#action_12968786 ] Robert Muir commented on SOLR-1979: --- bq. Kind of random that Thai is thrown in there! I agree, i tend to detect thai by the characters being between U+0E00 and U+0E7F. anyway, if we add more languages it would be good if one of us could document the process, because many important ones are missing. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968777#action_12968777 ] Grant Ingersoll commented on SOLR-1979: --- Sorry, you are right. See http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/language/tika.language.properties {quote} name.da=Danish name.de=German name.et=Estonian name.el=Greek name.en=English name.es=Spanish name.fi=Finnish name.fr=French name.hu=Hungarian name.is=Icelandic name.it=Italian name.nl=Dutch name.no=Norwegian name.pl=Polish name.pt=Portuguese name.ru=Russian name.sv=Swedish name.th=Thai {quote} Kind of random that Thai is thrown in there! > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968760#action_12968760 ] Robert Muir commented on SOLR-1979: --- bq. Have a look at http://tika.apache.org/0.8/detection.html That page does not have a list of languages. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968757#action_12968757 ] Grant Ingersoll commented on SOLR-1979: --- Have a look at http://tika.apache.org/0.8/detection.html Really, though, you need to dig into the Tika class: LanguageIdentifier. Adding languages, AFAICT, involves building the model accordingly and then letting Tika know about it via a properties file. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968753#action_12968753 ] Robert Muir commented on SOLR-1979: --- bq. I also think we need to get together and add a bunch more languages to Tika b/c it is pretty unacceptable to not have, at a minimum, support for the big Asian languages of CJK. What languages does tika support in its identifier? I couldnt find an actual list only a ref to Europarl (http://www.statmt.org/europarl/), is it just those languages? Also is there docs on whats necessary (legally and technically) to contribute a new profile... is just recording ngrams from creative commons text acceptable? > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968748#action_12968748 ] Grant Ingersoll commented on SOLR-1979: --- I'm going to be out of pocket for the next week. If someone can put the field mapping stuff up, then I think we will have the basis for a good first pass at this, which we can then iterate on. I also think we need to get together and add a bunch more languages to Tika b/c it is pretty unacceptable to not have, at a minimum, support for the big Asian languages of CJK. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968633#action_12968633 ] Tommaso Teofili commented on SOLR-1979: --- bq. However, have you considered extending the document model to allow metadata per field? Then @language would be a valid field metadata, mostly as a means for later processing to pick up and act on. This can be a valuable mechanism for other inter processor communication as well as to pass info between document centric processing and Analysis. I've also thought about this option and it sounds somehow reasonable but I think that it'd be a very huge change on the API; so from one point of view I like the idea but from another standpoint I think it could lead to a proliferation of @metadata. So in the end I've not a strong opinion on that but I also have to say that I've seen such customizations in a production environment to leverage per field metadata. Regarding per field and per document language fields I think that a document language field could be handled with two fixed strategies/policies (that can be also extended): # restrictive strategy - if different languages result to be mapped inside the document language field than say that document language is, for example, "x-unspecified" # simple strategy - map all the retrieved languages (per field) inside the document language field as different values (so multivalued="true") > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968627#action_12968627 ] Jan Høydahl commented on SOLR-1979: --- Allow for both a "language" field and a "languages" (multivalued) field. If fields are mapped, the new name reflect the language, so I don't know if we need a field->lang mapping. However, have you considered extending the document model to allow metadata per field? Then @language would be a valid field metadata, mostly as a means for later processing to pick up and act on. This can be a valuable mechanism for other inter processor communication as well as to pass info between document centric processing and Analysis. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968582#action_12968582 ] Erik Hatcher commented on SOLR-1979: Oh, and don't get me wrong, I get the multivalued language per document need too, here. Anyway, it'll be easy enough add support for this to be controlled through configuration. In single language per doc mode, basically concatenate all of the fields specified and detect on that and map into a singled value language field. Language-per-field I get too, of course... just depends on the domain being modeled and in my experience I've seen apps designed both ways. Neither way is the one true way, it just depends. And of course Muir is smirking and saying "heck, you have multiple languages within a field often too, so we need to account for that somehow too". But probably not here, yet. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968576#action_12968576 ] Erik Hatcher commented on SOLR-1979: If a list of fields (by name) is mapped into a corresponding parallel identified language code field, do we leave it up to search clients to also know the list of field names to jive a field (say title) with its identified language? A language field shouldn't have to be multivalued - it just doesn't match the domain model of many search applications where there will only ever be one and only one language per document. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968528#action_12968528 ] Grant Ingersoll commented on SOLR-1979: --- bq. So for all unmapped languages, you may want to map to a single generic field, or not map at all (leave field as is). It currently leaves it in the original field. bq. Also, if there are multiple input fields, the current patch would create multiple language field values requiring that field to be multi-valued. Is the goal here to identify a single language for a document? Or a separate language value for each of the input fields (which seems odd to me)? Current patch requires multivalued language field. I figure the main thing you want the lang. field for is faceting and filtering, but it can be changed. As for the broader goal, I think it makes sense to detect languages per field and not per document. In other words, you can have multiple languages in a single document. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968445#action_12968445 ] Yonik Seeley commented on SOLR-1979: bq. In skimming the current patch, it looks like fields get mapped no matter what. What if I just want the language detected and added as another field, but no field mapping desired? Yeah, that's sort of in line with my: bq. And just because you can detect a language doesn't mean you know how to handle it differently... so also have an optional catchall that handles all languages not specifically mapped. So for all unmapped languages, you may want to map to a single generic field, or not map at all (leave field as is). I guess it also depends on the general strategy... if you are detecting language on the "body" field, are we using a copyField type approach and only storing the body field while indexing as body_enText, or are we moving the field from "body" to "body_enText"? bq. Also, if there are multiple input fields, the current patch would create multiple language field values requiring that field to be multi-valued. Is the goal here to identify a single language for a document? I could see both making sense. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967214#action_12967214 ] Grant Ingersoll commented on SOLR-1979: --- bq. There should be a way to output the language for the whole document to some field as some applications need to filter on language. There is. It's the langField. bq. Can't we validate the output mapping (and log it!) at initialization time? To some extent, but users can also pass it in. bq. We should not be using 639-1 codes in any APIs!!! I'll update the patch. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967211#action_12967211 ] Jan Høydahl commented on SOLR-1979: --- @Grant: "I dropped the outputField setting and a number of other settings" There should be a way to output the language for the whole document to some field as some applications need to filter on language. I like making most things configurable, but with good defaults which fits most needs. The default could be to detect a document wide langauge from all input fields and output this to a "language_s" field, unless you specify params docLangInputFields=f1,f2.. and docLangOutputField=nn. Likewise make it easy to disable field renaming. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967204#action_12967204 ] Yonik Seeley commented on SOLR-1979: bq. Yonik, I wasn't planning on relying on dynamic fields necessarily. It may make sense to have users either predeclare the variations. Sure, but the problem was the ease by which a generated field of originalname_${langcode} could clash with existing fields (regardless of if they are dynamic fields) due to there being many different language codes. If we use regex naming as Jan suggests (or another configurable mechanism) then the issue comes down to what we configure by default or by example. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967201#action_12967201 ] Robert Muir commented on SOLR-1979: --- bq. Both also rely on those fields existing. I don't think this check should be at "runtime" either. What if you are indexing lots of documents and suddenly you encounter a thai document (or mis-detected as Thai!) and the whole thing fails? Can't we validate the output mapping (and log it!) at initialization time? > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967191#action_12967191 ] Robert Muir commented on SOLR-1979: --- bq. Agreed.The only thing we are doing now is using the language that the language detector returns as part of the field name. Both of these steps are easily overridable. Both also rely on those fields existing. "Easily overridable" does not solve the problem! Please don't commit this, its so easy to just change the code, variable names, documentation here to say these interfaces are BCP47 language ids. We should not be using 639-1 codes in any APIs!!! > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967186#action_12967186 ] Grant Ingersoll commented on SOLR-1979: --- bq. but in solr, when designing up front, i was just saying we shouldn't limit any abstract portion to 639-1 when another implementation might support 3066 or BCP47... we should make sure we allow that. Agreed.The only thing we are doing now is using the language that the language detector returns as part of the field name. Both of these steps are easily overridable. Both also rely on those fields existing. bq. This could be problematic given a large set of language codes since they could collide with existing dynamic field definitions. Yonik, I wasn't planning on relying on dynamic fields necessarily. It may make sense to have users either predeclare the variations. All in all, I would like to see Solr have better support for languages in both the schema and the config. In my experience, in apps that have to support a lot of languages, there is a lot of redundancy in both the schema and the config. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967076#action_12967076 ] Robert Muir commented on SOLR-1979: --- {quote} It makes sense to allow for detecting languages outside 639-1, and I believe RFC3066 and BCP47 are both re-using the 639 codes, so that if there is a 2-letter code for a language it will be used. 639-1 is what "everyone" already knows. In general, improvements should be done in Tika space, then use those in Solr, thus building one strong language detection library. {quote} yes they do, the 639-1 codes that tika outputs are also valid BCP47 codes :) but in solr, when designing up front, i was just saying we shouldn't limit any abstract portion to 639-1 when another implementation might support 3066 or BCP47... we should make sure we allow that. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967048#action_12967048 ] Grant Ingersoll commented on SOLR-1979: --- Note, the patch still needs more tests and needs to check headers, etc. as well as the better field mapping and the proper language support that Robert is talking about. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967046#action_12967046 ] Grant Ingersoll commented on SOLR-1979: --- bq. @Grant: I actually planned to do the regEx based field name mapping in a separate UpdateProcessor, to make things more flexible I don't really see that it makes it any more flexible. If it was a general purpose mapper, maybe, but since it is tied to the language field, why not just put in the language processor? I've already got the method that choose the output field as a protected. With that, one merely would need to extend it to provide an alternate method from what you have proposed. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967032#action_12967032 ] Jan Høydahl commented on SOLR-1979: --- @Robert: Yes, there must be a way to tell whether or not the language even has a profile, through some well defined method. It's not important HOW we improve detection certainty, but comparing the top n distances could help. I'm also a fan of including other metrics than profile similarity if that can help, however for unique scripts that will automatically be covered by profile similarity. Detailed solution discussions should continue in TIKA-369. Macro languages: See TIKA-493 It makes sense to allow for detecting languages outside 639-1, and I believe RFC3066 and BCP47 are both re-using the 639 codes, so that if there is a 2-letter code for a language it will be used. 639-1 is what "everyone" already knows. In general, improvements should be done in Tika space, then use those in Solr, thus building one strong language detection library. @Grant: I actually planned to do the regEx based field name mapping in a separate UpdateProcessor, to make things more flexible. Example: {code:xml} language (.*?)_lang $1_$lang $1_t de,en,fr,it,es,nl {code} Your thought of allowing to detect language for individual fields in one go is also interesting. I'd love to see metadata support in SolrInputDocument, so that one processor could annotate a @language on the fields analyzed. Then next processor could act on metadata to rename field... @Yonik: By allowing regex naming of field names, we give users a generic tool to avoid field name clashes, by picking the pattern.. Mapping multiple languages to same suffix also makes sense. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967019#action_12967019 ] Robert Muir commented on SOLR-1979: --- bq. Yeah, that makes sense, however, I believe Tika returns 639. Right, but 639 is just a subset of 3066 etc. So, ignore what tika does. its 639 identifiers are also valid 3066. Our API should at least be 3066, Java7/ICU already support BCP47 locale identifiers etc, so you get the normalization there for free. {quote} It would probably also be nice to be able to map a number of languages to a single field say you have a single analyzer that can handle CJK, then you may want that whole collection of languages mapped to a single _cjk field. And just because you can detect a language doesn't mean you know how to handle it differently... so also have an optional catchall that handles all languages not specifically mapped. {quote} Both of these are good reasons why we must avoid 639-1. We should be able to use things like macrolanguages and undetermined language. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967016#action_12967016 ] Yonik Seeley commented on SOLR-1979: bq. The new field is made by concatenating the original field name with "_" + the ISO 639 code. This could be problematic given a large set of language codes since they could collide with existing dynamic field definitions. Perhaps something with "text" in the name also? Perhaps fieldName_${langCode}Text Examples: name_enText name_frText It would probably also be nice to be able to map a number of languages to a single field say you have a single analyzer that can handle CJK, then you may want that whole collection of languages mapped to a single _cjk field. And just because you can detect a language doesn't mean you know how to handle it differently... so also have an optional catchall that handles all languages not specifically mapped. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967011#action_12967011 ] Grant Ingersoll commented on SOLR-1979: --- Another thought, here, is that, over time, this class becomes a base class and it becomes easy to replace the language detection piece, that way one gets all the infrastructure of this class, but can plugin their own detection. In fact, I'm going to do that right now. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967010#action_12967010 ] Grant Ingersoll commented on SOLR-1979: --- bq. I would like to see RFC 3066 instead Yeah, that makes sense, however, I believe Tika returns 639. (Tika doesn't recognize Chinese yet at all). One approach is we could normalize, I suppose. Another is to fix Tika. I'd really like to see Tika support more languages, too. Longer term, I'd like to not do the fieldName_LangCode thing at all and instead let the user supply a string that could have variable substitution if they want, something like fieldName_${langCode}, or it could be ${langCode}_fieldName or it could just be another literal. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966978#action_12966978 ] Robert Muir commented on SOLR-1979: --- We really need to not be using ISO 639-1 here. For example, Its not expressive enough, not differentiating between Simplified and Traditional chinese, yet SmartChineseAnalyzer only works on Simplified. I would like to see RFC 3066 instead > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966972#action_12966972 ] Robert Muir commented on SOLR-1979: --- bq. cause that distance measure is kind of an internal value, not very normalized and is bound to change in future versions of TIKA. bq. we can make a new isReasonablyCertain() implementation taking into account the relative distance between first and second candidate languages... I don't follow the logic: if its not very normalized then it seems like this approach doesnt tell you anything... language 1 could be uncertain, and language 2 is just completely uncertain, but that tells you nothing: isn't it like trying to determine if a good lucene search result score is "certainly a hit" and not really the right way to go? For example: consider the case where the language isn't supported at all by Tika (i dont see a list of supported languages anywhere by the way!). It would be good for us to know that the detection is uncertain at all... how relatively uncertain it is with regards to the next language, is not very important. I think its also important we be able to get this uncertainty or whatever different agnostic of the implementation. For example, we should be able to somehow think of chaining detectors... Its really important to "cheat" and not use heuristics for languages that don't need them. For example, disregarding some strange theoretical/historical cases, you can simply look at the unicode properties in the document to determine that its in the Greek language, as its basically the only modern language using the greek alphabet > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966970#action_12966970 ] Jan Høydahl commented on SOLR-1979: --- The idField input parameter is just used for decent logging if detection fails. It would be more elegant to get the id field name automatically through SolrCore... > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966964#action_12966964 ] Jan Høydahl commented on SOLR-1979: --- Simply allowing to set the threshold for isReasonablyCertain() is probably not enough to get a robust detection. This is because the distance measure is very sensitive to the length of the profiles in use. Thus, it is a bit dangerous to expose getDistance() as in TIKA-568, cause that distance measure is kind of an internal value, not very normalized and is bound to change in future versions of TIKA. See TIKA-369 and TIKA-496. I think the right way to go is solving these two issues first. By fixing so that getDisance() is not biased towards profile length, we can make a new isReasonablyCertain() implementation taking into account the relative distance between first and second candidate languages... > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966955#action_12966955 ] Grant Ingersoll commented on SOLR-1979: --- See http://wiki.apache.org/solr/LanguageDetection for the start of documentation. bq. isReasonablyCertain() always returns false See TIKA-568. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899568#action_12899568 ] Jan Høydahl commented on SOLR-1979: --- I have implemented a first shot patch using the Tika LanguageIdentifier. It is unfortunately quite limited in features, and for short text segments, isReasonablyCertain() always returns false :( Also, the number of languages supported is still quite low. But it works as a start, and then we can focus on improving the Tika code in future releases. I plan on putting the patch in contrib/extraction, since it depends on Tika. If I put it relative to main, Solr will not compile unless you put tika jar in lib. Agree? > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Priority: Minor > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we should wrap the [Nutch > LanguageIdentifier|http://nutch.apache.org/apidocs-1.1/org/apache/nutch/analysis/lang/LanguageIdentifier.html";] > in an UpdateProcessor. The processor should be configured like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > title,teaser,body > language > language_display > > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884070#action_12884070 ] Chris A. Mattmann commented on SOLR-1979: - I would look at the Language Identifier in Tika (which is based on the Nutch work) as it is likely to be the one that is more maintained going forward IMHO... > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Priority: Minor > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we should wrap the [Nutch > LanguageIdentifier|http://nutch.apache.org/apidocs-1.1/org/apache/nutch/analysis/lang/LanguageIdentifier.html";] > in an UpdateProcessor. The processor should be configured like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > title,teaser,body > language > language_display > > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org