[jira] Updated: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
[ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tommaso Teofili updated SOLR-2129: -- Description: Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents. The purpose of this is to get unstructured information inside a document and create structured metadata (as fields) to enrich each document. Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents. The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch. More information can be found on dedicated the wiki page: http://wiki.apache.org/solr/SolrUIMA was: Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents. The purpose of this is to get unstructured information inside a document and create structured metadata (as fields) to enrich each document. Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents. The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch. Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA --- Key: SOLR-2129 URL: https://issues.apache.org/jira/browse/SOLR-2129 Project: Solr Issue Type: New Feature Reporter: Tommaso Teofili Assignee: Robert Muir Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129-version-5.patch, SOLR-2129-version-6.patch, SOLR-2129-version2.patch, SOLR-2129-version3.patch, SOLR-2129.patch, SOLR-2129.patch Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents. The purpose of this is to get unstructured information inside a document and create structured metadata (as fields) to enrich each document. Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents. The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch. More information can be found on dedicated the wiki page: http://wiki.apache.org/solr/SolrUIMA -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
[ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tommaso Teofili updated SOLR-2129: -- Description: Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents. The purpose of this is to get unstructured information inside a document and create structured metadata (as fields) to enrich each document. Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents. The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch. More information can be found on the dedicated wiki page: http://wiki.apache.org/solr/SolrUIMA was: Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents. The purpose of this is to get unstructured information inside a document and create structured metadata (as fields) to enrich each document. Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents. The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch. More information can be found on dedicated the wiki page: http://wiki.apache.org/solr/SolrUIMA Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA --- Key: SOLR-2129 URL: https://issues.apache.org/jira/browse/SOLR-2129 Project: Solr Issue Type: New Feature Reporter: Tommaso Teofili Assignee: Robert Muir Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129-version-5.patch, SOLR-2129-version-6.patch, SOLR-2129-version2.patch, SOLR-2129-version3.patch, SOLR-2129.patch, SOLR-2129.patch Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents. The purpose of this is to get unstructured information inside a document and create structured metadata (as fields) to enrich each document. Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents. The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch. More information can be found on the dedicated wiki page: http://wiki.apache.org/solr/SolrUIMA -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
[ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-2129: -- Fix Version/s: 4.0 3.1 Tommaso, thanks for resolving all the items brought up in comments. Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA --- Key: SOLR-2129 URL: https://issues.apache.org/jira/browse/SOLR-2129 Project: Solr Issue Type: New Feature Reporter: Tommaso Teofili Assignee: Robert Muir Fix For: 3.1, 4.0 Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129-version-5.patch, SOLR-2129-version-6.patch, SOLR-2129-version2.patch, SOLR-2129-version3.patch, SOLR-2129.patch, SOLR-2129.patch Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents. The purpose of this is to get unstructured information inside a document and create structured metadata (as fields) to enrich each document. Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents. The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch. More information can be found on the dedicated wiki page: http://wiki.apache.org/solr/SolrUIMA -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
[ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tommaso Teofili updated SOLR-2129: -- Attachment: SOLR-2129-version-6.patch Each UIMAException (wrapping both ResourceInitializationException and AnalysisEngineProcessException) is now thrown, embedded in a RuntimeException (the processAdd method signature has to be aligned with the super class one so not declaring the UIMAException in the UIMAUpdateRequestProcessor method signature). Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA --- Key: SOLR-2129 URL: https://issues.apache.org/jira/browse/SOLR-2129 Project: Solr Issue Type: New Feature Reporter: Tommaso Teofili Assignee: Robert Muir Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129-version-5.patch, SOLR-2129-version-6.patch, SOLR-2129-version2.patch, SOLR-2129-version3.patch, SOLR-2129.patch, SOLR-2129.patch Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents. The purpose of this is to get unstructured information inside a document and create structured metadata (as fields) to enrich each document. Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents. The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
[ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tommaso Teofili updated SOLR-2129: -- Attachment: SOLR-2129-version-5.patch Changes are: # drop StringBuffer for StringBuilder in UIMAUpdateRequestProcessor # make the getAE method in OverridingParamAEProvider synchronized to support concurrent requests to the provider # make the getAEProvider method in AEProviderFactory synchronized and make the cache core aware, each core has now an AEProvider for each analysis engine's path # the UIMAUpdateRequestProcessor constructor accepts SolrCore as a parameter instead of a SolrConfig object I tested it with multiple cores and concurrent updates for each core. Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA --- Key: SOLR-2129 URL: https://issues.apache.org/jira/browse/SOLR-2129 Project: Solr Issue Type: New Feature Reporter: Tommaso Teofili Assignee: Robert Muir Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129-version-5.patch, SOLR-2129-version2.patch, SOLR-2129-version3.patch, SOLR-2129.patch, SOLR-2129.patch Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents. The purpose of this is to get unstructured information inside a document and create structured metadata (as fields) to enrich each document. Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents. The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
[ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-2129: -- Attachment: SOLR-2129.patch patch synced to trunk. i also adjusted some minor things: doesn't rely on CWD for running tests, added an assume in tests in case you have no internet connection, with a set timeout, removed troublesome xml includes as this is dependent on CWD, etc. I reviewed the code, I have no problem committing this to contrib so future iterations can be from svn. any objections? Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA --- Key: SOLR-2129 URL: https://issues.apache.org/jira/browse/SOLR-2129 Project: Solr Issue Type: New Feature Reporter: Tommaso Teofili Assignee: Robert Muir Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129-version2.patch, SOLR-2129-version3.patch, SOLR-2129.patch, SOLR-2129.patch Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents. The purpose of this is to get unstructured information inside a document and create structured metadata (as fields) to enrich each document. Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents. The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
[ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tommaso Teofili updated SOLR-2129: -- Attachment: SOLR-2129-version3.patch Here is a new patch with updated contrib/uima/build.xml to include resources in the generated package. Also there is small README inside to guide configuration. Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA --- Key: SOLR-2129 URL: https://issues.apache.org/jira/browse/SOLR-2129 Project: Solr Issue Type: New Feature Reporter: Tommaso Teofili Assignee: Robert Muir Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129-version2.patch, SOLR-2129-version3.patch, SOLR-2129.patch Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents. The purpose of this is to get unstructured information inside a document and create structured metadata (as fields) to enrich each document. Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents. The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
[ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tommaso Teofili updated SOLR-2129: -- Attachment: SOLR-2129-version2.patch Huge Solr-UIMA refactoring, including injecting the following information from uimaConfig tag inside solrconfig: 1. added dynamic field mapping with the following syntax: fieldMapping type name=org.apache.uima.jcas.tcas.Annotation map feature=coveredText field=tag/ /type type name=org.apache.uima.jcas.tcas.AnotherAnnotationType map feature=featureName field=anotherField/ /type /fieldMapping 2. added AnalysisEngine descriptor path (must be inside the classpath) analysisEngine/org/apache/uima/desc/OverridingParamsExtServicesAE.xml/analysisEngine 3. added fields' values to be analyzed, eventually merging their values to make UIMA run only once: analyzeFields merge=falsetext,title/analyzeFields Runtime parameters for defining overriding parameters for delegate AEs remains the same: runtimeParameters keyword_apikeyVALID_ALCHEMYAPI_KEY/keyword_apikey concept_apikeyVALID_ALCHEMYAPI_KEY/concept_apikey lang_apikeyVALID_ALCHEMYAPI_KEY/lang_apikey cat_apikeyVALID_ALCHEMYAPI_KEY/cat_apikey oc_licenseIDVALID_OPENCALAIS_KEY/oc_licenseID /runtimeParameters These changes should make the use of such a module much easier and flexible. Looking forward for your feedback. Tommaso Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA --- Key: SOLR-2129 URL: https://issues.apache.org/jira/browse/SOLR-2129 Project: Solr Issue Type: New Feature Reporter: Tommaso Teofili Assignee: Robert Muir Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129-version2.patch, SOLR-2129.patch Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents. The purpose of this is to get unstructured information inside a document and create structured metadata (as fields) to enrich each document. Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents. The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
[ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tommaso Teofili updated SOLR-2129: -- Attachment: SOLR-2129.patch Patch to port solr-uima GC project as a solr/contrib module Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA --- Key: SOLR-2129 URL: https://issues.apache.org/jira/browse/SOLR-2129 Project: Solr Issue Type: New Feature Reporter: Tommaso Teofili Attachments: SOLR-2129.patch Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents. The purpose of this is to get unstructured information inside a document and create structured metadata (as fields) to enrich each document. Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents. The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
[ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tommaso Teofili updated SOLR-2129: -- Attachment: SOLR-2129-asf-headers.patch Same patch plus required ASF headers on code and xml Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA --- Key: SOLR-2129 URL: https://issues.apache.org/jira/browse/SOLR-2129 Project: Solr Issue Type: New Feature Reporter: Tommaso Teofili Attachments: SOLR-2129-asf-headers.patch, SOLR-2129.patch Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents. The purpose of this is to get unstructured information inside a document and create structured metadata (as fields) to enrich each document. Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents. The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org