[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Attachment: SOLR-1979.patch SOLR-1979-branch_3x.patch Added final patches which will be committed now. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: contrib - LangId, update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Labels: UpdateProcessor > Fix For: 3.5, 4.0 > > Attachments: SOLR-1979-branch_3x.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > Language identification from document fields, and mapping of field names to > language-specific fields based on detected language. > Wrap the Tika LanguageIdentifier in an UpdateProcessor. > See user documentation at http://wiki.apache.org/solr/LanguageDetection -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Attachment: SOLR-1979.patch New patch: * Added contrib folders to eclipse dot.classpath * Added javadoc entries to build.xml * Fixed Javadoc errors * Upgraded test case to use schema v1.4 > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: contrib - LangId, update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Labels: UpdateProcessor > Fix For: 3.5, 4.0 > > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch > > > Language identification from document fields, and mapping of field names to > language-specific fields based on detected language. > Wrap the Tika LanguageIdentifier in an UpdateProcessor. > See user documentation at http://wiki.apache.org/solr/LanguageDetection -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Attachment: SOLR-1979.patch Fixed java.lang.IndexOutOfBoundsException bug in resolveLanguage() when no languages detected. Added more corner case tests. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: contrib - LangId, update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Labels: UpdateProcessor > Fix For: 3.5, 4.0 > > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > Language identification from document fields, and mapping of field names to > language-specific fields based on detected language. > Wrap the Tika LanguageIdentifier in an UpdateProcessor. > See user documentation at http://wiki.apache.org/solr/LanguageDetection -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Attachment: SOLR-1979.patch Added link to Wiki in example update chain in solrconfig > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: contrib - LangId, update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Labels: UpdateProcessor > Fix For: 3.5, 4.0 > > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch > > > Language identification from document fields, and mapping of field names to > language-specific fields based on detected language. > Wrap the Tika LanguageIdentifier in an UpdateProcessor. > See user documentation at http://wiki.apache.org/solr/LanguageDetection -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Attachment: SOLR-1979.patch Some further improvements: * Default fallback language if none set is now "" to avoid nullpointer exception * All individually detected languages are now added to "langsField" array * More tests > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: contrib - LangId, update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Labels: UpdateProcessor > Fix For: 3.5, 4.0 > > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch > > > Language identification from document fields, and mapping of field names to > language-specific fields based on detected language. > Wrap the Tika LanguageIdentifier in an UpdateProcessor. > See user documentation at http://wiki.apache.org/solr/LanguageDetection -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Component/s: contrib - LangId > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: contrib - LangId, update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Labels: UpdateProcessor > Fix For: 3.5, 4.0 > > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch > > > Language identification from document fields, and mapping of field names to > language-specific fields based on detected language. > Wrap the Tika LanguageIdentifier in an UpdateProcessor. > See user documentation at http://wiki.apache.org/solr/LanguageDetection -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Description: Language identification from document fields, and mapping of field names to language-specific fields based on detected language. Wrap the Tika LanguageIdentifier in an UpdateProcessor. See user documentation at http://wiki.apache.org/solr/LanguageDetection was: Language identification from document fields, and mapping of field names to language-specific fields based on detected language. Wrap the Tika LanguageIdentifier in an UpdateProcessor. Fix Version/s: 4.0 > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Labels: UpdateProcessor > Fix For: 3.5, 4.0 > > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch > > > Language identification from document fields, and mapping of field names to > language-specific fields based on detected language. > Wrap the Tika LanguageIdentifier in an UpdateProcessor. > See user documentation at http://wiki.apache.org/solr/LanguageDetection -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Attachment: SOLR-1979.patch New patch with these improvements: * Now also allows config at first level, without * Added langid to example schema (commented out), so it is really easy to demonstrate > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Labels: UpdateProcessor > Fix For: 3.5 > > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch > > > Language identification from document fields, and mapping of field names to > language-specific fields based on detected language. > Wrap the Tika LanguageIdentifier in an UpdateProcessor. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Attachment: SOLR-1979.patch Patch updated to fit new directory structure, updated comments to point to Wiki doc. Also optimized regex, now pre-compiling patterns instead of using String.replace directly. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Labels: UpdateProcessor > Fix For: 3.5 > > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > Language identification from document fields, and mapping of field names to > language-specific fields based on detected language. > Wrap the Tika LanguageIdentifier in an UpdateProcessor. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Fix Version/s: (was: 3.4) 3.5 Moving to 3.5 > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Labels: UpdateProcessor > Fix For: 3.5 > > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch > > > Language identification from document fields, and mapping of field names to > language-specific fields based on detected language. > Wrap the Tika LanguageIdentifier in an UpdateProcessor. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Attachment: SOLR-1979.patch Updated to latest trunk, simplified build file, added clean target > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Labels: UpdateProcessor > Fix For: 3.4 > > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch > > > Language identification from document fields, and mapping of field names to > language-specific fields based on detected language. > Wrap the Tika LanguageIdentifier in an UpdateProcessor. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Attachment: SOLR-1979.patch Fixed threshold so that Tika distance 0.1 gives certainty 0.5 and distance 0.02 gives certainty 0.9. The default threshold of 0.5 now works pretty well, at least for the tests... *New parameters:* Field name mapping is now configurable to user defined pattern, so to map ABC_title to title_, you set: {code} &langid.map.pattern=ABC_(.*) &langid.map.replace=$1_{lang} {code} A parameter to map multiple detected languages to same field regex. I.e. to map both Japanese, Korean and Chinese texts to a field *_cjk, do: {code}langid.map.lcmap=jp:cjk zh:cjk ko:cjk{code} Turn off validation of field names against schema (useful if you want to rename or delete fields later in the UpdateChain): {code}&langid.enforceSchema=false{code} *Other changes* Removed default on langField, i.e. if langField is not specified, the detected language will not be written anywhere. A typical minimal config for only detecting language and writing to a field is now: {code} title,subject,text,keywords language_s {code} Also added multiple other languages to the tests. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Labels: UpdateProcessor > Fix For: 3.4 > > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch > > > Language identification from document fields, and mapping of field names to > language-specific fields based on detected language. > Wrap the Tika LanguageIdentifier in an UpdateProcessor. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Fix Version/s: 3.4 Labels: UpdateProcessor (was: ) > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Labels: UpdateProcessor > Fix For: 3.4 > > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch > > > Language identification from document fields, and mapping of field names to > language-specific fields based on detected language. > Wrap the Tika LanguageIdentifier in an UpdateProcessor. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Description: Language identification from document fields, and mapping of field names to language-specific fields based on detected language. Wrap the Tika LanguageIdentifier in an UpdateProcessor. was: We need the ability to detect language of some random text in order to act upon it, such as indexing the content into language aware fields. Another usecase is to be able to filter/facet on language on random unstructured content. To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The processor is configurable like this: {code:xml} name,subject language_s id en {code} It will then read the text from inputFields name and subject, perform language identification and output the ISO code for the detected language in the outputField. If no language was detected, fallback language is used. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch > > > Language identification from document fields, and mapping of field names to > language-specific fields based on detected language. > Wrap the Tika LanguageIdentifier in an UpdateProcessor. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Attachment: SOLR-1979.patch New version. Example of accepted params: {code} true title,subject,text,keywords language_s languages false 0.5 no,en,es,dk true title,text false false false meta_content_language,lang en {code} The only mandatory parameter is langid.fl To enable field name mapping, set langid.map=true. It will then map field names for all fields in langid.fl. If the set of fields to map is different from langid.fl, supply langid.map.fl. Those fields will then be renamed with a language suffix equal to the language detected from the langid.fl fields. If you require detecting languages separately for each field, supply langid.map.individual=true. The supplied fields will then be renamed according to detected language on an individual basis. If the set of fields to detect individually is different from the already supplied langid.fl or langid.map.fl, supply langid.map.individual.fl. The fields listed in langid.map.individual.fl will then be detected individually, while the rest of the mapping fields will be mapped according to global document language. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch, SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated SOLR-1979: -- Attachment: SOLR-1979.patch Removes mentions of ISO 639. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, > SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated SOLR-1979: -- Attachment: SOLR-1979.patch Here's a patch that passes the tests. Note, I modified the Solr base test case to have some new methods to properly call update handlers and then validate the results. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated SOLR-1979: -- Attachment: SOLR-1979.patch I took Jan's and Tommaso's patches and reworked them a bit. It seems to me that there isn't much point in merely identifying the language if you aren't going to do something about it. So, this patch builds on what Jan and Tommaso did and then will remap the input fields to new per language fields (note, we could make this optional). I also tried to standardize the input parameters a bit. I dropped the outputField setting and a number of other settings and I made the language detection to be per input field. The basic gist of it is that if you input two fields: name, subject, it will detect the language of each field and then attempt to map them to a new field. The new field is made by concatenating the original field name with "_" + the ISO 639 code. For example, if en is the detected language, then the new field for name would be name_en. If that field doesn't exist, it will fall back to the original field (i.e. name). Left to do: # Fix the tests. I don't like how we currently tests UpdateProcessorChains. It should not require writing your own little piece of update mechanism. You should be able to simply setup the appropriate configuration, hook it into an update handler and then hit that update handler. # Need to check the license headers, builds, etc. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Assignee: Grant Ingersoll >Priority: Minor > Attachments: SOLR-1979.patch, SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Description: We need the ability to detect language of some random text in order to act upon it, such as indexing the content into language aware fields. Another usecase is to be able to filter/facet on language on random unstructured content. To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The processor is configurable like this: {code:xml} name,subject language_s id en {code} It will then read the text from inputFields name and subject, perform language identification and output the ISO code for the detected language in the outputField. If no language was detected, fallback language is used. was: We need the ability to detect language of some random text in order to act upon it, such as indexing the content into language aware fields. Another usecase is to be able to filter/facet on language on random unstructured content. To do this, we should wrap the [Nutch LanguageIdentifier|http://nutch.apache.org/apidocs-1.1/org/apache/nutch/analysis/lang/LanguageIdentifier.html";] in an UpdateProcessor. The processor should be configured like this: {code:xml} title,teaser,body language language_display {code} > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Priority: Minor > Attachments: SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The > processor is configurable like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > name,subject > language_s > id > en > > {code} > It will then read the text from inputFields name and subject, perform > language identification and output the ISO code for the detected language in > the outputField. If no language was detected, fallback language is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Attachment: SOLR-1979.patch First raw patch implementing language identification. > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Priority: Minor > Attachments: SOLR-1979.patch > > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we should wrap the [Nutch > LanguageIdentifier|http://nutch.apache.org/apidocs-1.1/org/apache/nutch/analysis/lang/LanguageIdentifier.html";] > in an UpdateProcessor. The processor should be configured like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > title,teaser,body > language > language_display > > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
[ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-1979: -- Description: We need the ability to detect language of some random text in order to act upon it, such as indexing the content into language aware fields. Another usecase is to be able to filter/facet on language on random unstructured content. To do this, we should wrap the [Nutch LanguageIdentifier|http://nutch.apache.org/apidocs-1.1/org/apache/nutch/analysis/lang/LanguageIdentifier.html";] in an UpdateProcessor. The processor should be configured like this: {code:xml} title,teaser,body language language_display {code} was: We need the ability to detect language of some random text in order to act upon it, such as indexing the content into language aware fields. Another usecase is to be able to filter/facet on language on random unstructured content. To do this, we should wrap the [Nutch LanguageIdentifier|http://nutch.apache.org/apidocs-1.1/org/apache/nutch/analysis/lang/LanguageIdentifier.html";] in an UpdateProcessor. The processor should be configured like this: {{monospaced}} title,teaser,body language language_display {{monospaced}} > Create LanguageIdentifierUpdateProcessor > > > Key: SOLR-1979 > URL: https://issues.apache.org/jira/browse/SOLR-1979 > Project: Solr > Issue Type: New Feature > Components: update >Reporter: Jan Høydahl >Priority: Minor > > We need the ability to detect language of some random text in order to act > upon it, such as indexing the content into language aware fields. Another > usecase is to be able to filter/facet on language on random unstructured > content. > To do this, we should wrap the [Nutch > LanguageIdentifier|http://nutch.apache.org/apidocs-1.1/org/apache/nutch/analysis/lang/LanguageIdentifier.html";] > in an UpdateProcessor. The processor should be configured like this: > {code:xml} >class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> > title,teaser,body > language > language_display > > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org