[jira] Updated: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA

2011-01-23 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili updated SOLR-2129:
--

Description: 
Provide components to enable Apache UIMA automatic metadata extraction to be 
exploited when indexing documents.
The purpose of this is to get unstructured information inside a document and 
create structured metadata (as fields) to enrich each document.

Basically this can be done with a custom UpdateRequestProcessor which triggers 
UIMA while indexing documents.
The basic UIMA implementation of UpdateRequestProcessor extracts sentences 
(with a tokenizer and an hidden Markov model tagger), named entities, language, 
suggested category, keywords and concepts (exploiting external services from 
OpenCalais and AlchemyAPI). Such an implementation can be easily extended 
adding or selecting different UIMA analysis engines, both from UIMA 
repositories on the web or creating new ones from scratch.

More information can be found on dedicated the wiki page: 
http://wiki.apache.org/solr/SolrUIMA

  was:
Provide components to enable Apache UIMA automatic metadata extraction to be 
exploited when indexing documents.
The purpose of this is to get unstructured information inside a document and 
create structured metadata (as fields) to enrich each document.

Basically this can be done with a custom UpdateRequestProcessor which triggers 
UIMA while indexing documents.
The basic UIMA implementation of UpdateRequestProcessor extracts sentences 
(with a tokenizer and an hidden Markov model tagger), named entities, language, 
suggested category, keywords and concepts (exploiting external services from 
OpenCalais and AlchemyAPI). Such an implementation can be easily extended 
adding or selecting different UIMA analysis engines, both from UIMA 
repositories on the web or creating new ones from scratch.


 Provide a Solr module for dynamic metadata extraction/indexing with Apache 
 UIMA
 ---

 Key: SOLR-2129
 URL: https://issues.apache.org/jira/browse/SOLR-2129
 Project: Solr
  Issue Type: New Feature
Reporter: Tommaso Teofili
Assignee: Robert Muir
 Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, 
 SOLR-2129-version-5.patch, SOLR-2129-version-6.patch, 
 SOLR-2129-version2.patch, SOLR-2129-version3.patch, SOLR-2129.patch, 
 SOLR-2129.patch


 Provide components to enable Apache UIMA automatic metadata extraction to be 
 exploited when indexing documents.
 The purpose of this is to get unstructured information inside a document 
 and create structured metadata (as fields) to enrich each document.
 Basically this can be done with a custom UpdateRequestProcessor which 
 triggers UIMA while indexing documents.
 The basic UIMA implementation of UpdateRequestProcessor extracts sentences 
 (with a tokenizer and an hidden Markov model tagger), named entities, 
 language, suggested category, keywords and concepts (exploiting external 
 services from OpenCalais and AlchemyAPI). Such an implementation can be 
 easily extended adding or selecting different UIMA analysis engines, both 
 from UIMA repositories on the web or creating new ones from scratch.
 More information can be found on dedicated the wiki page: 
 http://wiki.apache.org/solr/SolrUIMA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA

2011-01-23 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili updated SOLR-2129:
--

Description: 
Provide components to enable Apache UIMA automatic metadata extraction to be 
exploited when indexing documents.
The purpose of this is to get unstructured information inside a document and 
create structured metadata (as fields) to enrich each document.

Basically this can be done with a custom UpdateRequestProcessor which triggers 
UIMA while indexing documents.
The basic UIMA implementation of UpdateRequestProcessor extracts sentences 
(with a tokenizer and an hidden Markov model tagger), named entities, language, 
suggested category, keywords and concepts (exploiting external services from 
OpenCalais and AlchemyAPI). Such an implementation can be easily extended 
adding or selecting different UIMA analysis engines, both from UIMA 
repositories on the web or creating new ones from scratch.

More information can be found on the dedicated wiki page: 
http://wiki.apache.org/solr/SolrUIMA

  was:
Provide components to enable Apache UIMA automatic metadata extraction to be 
exploited when indexing documents.
The purpose of this is to get unstructured information inside a document and 
create structured metadata (as fields) to enrich each document.

Basically this can be done with a custom UpdateRequestProcessor which triggers 
UIMA while indexing documents.
The basic UIMA implementation of UpdateRequestProcessor extracts sentences 
(with a tokenizer and an hidden Markov model tagger), named entities, language, 
suggested category, keywords and concepts (exploiting external services from 
OpenCalais and AlchemyAPI). Such an implementation can be easily extended 
adding or selecting different UIMA analysis engines, both from UIMA 
repositories on the web or creating new ones from scratch.

More information can be found on dedicated the wiki page: 
http://wiki.apache.org/solr/SolrUIMA


 Provide a Solr module for dynamic metadata extraction/indexing with Apache 
 UIMA
 ---

 Key: SOLR-2129
 URL: https://issues.apache.org/jira/browse/SOLR-2129
 Project: Solr
  Issue Type: New Feature
Reporter: Tommaso Teofili
Assignee: Robert Muir
 Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, 
 SOLR-2129-version-5.patch, SOLR-2129-version-6.patch, 
 SOLR-2129-version2.patch, SOLR-2129-version3.patch, SOLR-2129.patch, 
 SOLR-2129.patch


 Provide components to enable Apache UIMA automatic metadata extraction to be 
 exploited when indexing documents.
 The purpose of this is to get unstructured information inside a document 
 and create structured metadata (as fields) to enrich each document.
 Basically this can be done with a custom UpdateRequestProcessor which 
 triggers UIMA while indexing documents.
 The basic UIMA implementation of UpdateRequestProcessor extracts sentences 
 (with a tokenizer and an hidden Markov model tagger), named entities, 
 language, suggested category, keywords and concepts (exploiting external 
 services from OpenCalais and AlchemyAPI). Such an implementation can be 
 easily extended adding or selecting different UIMA analysis engines, both 
 from UIMA repositories on the web or creating new ones from scratch.
 More information can be found on the dedicated wiki page: 
 http://wiki.apache.org/solr/SolrUIMA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA

2011-01-23 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-2129:
--

Fix Version/s: 4.0
   3.1

Tommaso, thanks for resolving all the items brought up in comments.


 Provide a Solr module for dynamic metadata extraction/indexing with Apache 
 UIMA
 ---

 Key: SOLR-2129
 URL: https://issues.apache.org/jira/browse/SOLR-2129
 Project: Solr
  Issue Type: New Feature
Reporter: Tommaso Teofili
Assignee: Robert Muir
 Fix For: 3.1, 4.0

 Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, 
 SOLR-2129-version-5.patch, SOLR-2129-version-6.patch, 
 SOLR-2129-version2.patch, SOLR-2129-version3.patch, SOLR-2129.patch, 
 SOLR-2129.patch


 Provide components to enable Apache UIMA automatic metadata extraction to be 
 exploited when indexing documents.
 The purpose of this is to get unstructured information inside a document 
 and create structured metadata (as fields) to enrich each document.
 Basically this can be done with a custom UpdateRequestProcessor which 
 triggers UIMA while indexing documents.
 The basic UIMA implementation of UpdateRequestProcessor extracts sentences 
 (with a tokenizer and an hidden Markov model tagger), named entities, 
 language, suggested category, keywords and concepts (exploiting external 
 services from OpenCalais and AlchemyAPI). Such an implementation can be 
 easily extended adding or selecting different UIMA analysis engines, both 
 from UIMA repositories on the web or creating new ones from scratch.
 More information can be found on the dedicated wiki page: 
 http://wiki.apache.org/solr/SolrUIMA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA

2011-01-11 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili updated SOLR-2129:
--

Attachment: SOLR-2129-version-6.patch

Each UIMAException (wrapping both ResourceInitializationException and 
AnalysisEngineProcessException) is now thrown, embedded in a RuntimeException 
(the processAdd method signature has to be aligned with the super class one so 
not declaring the UIMAException in the UIMAUpdateRequestProcessor method 
signature).

 Provide a Solr module for dynamic metadata extraction/indexing with Apache 
 UIMA
 ---

 Key: SOLR-2129
 URL: https://issues.apache.org/jira/browse/SOLR-2129
 Project: Solr
  Issue Type: New Feature
Reporter: Tommaso Teofili
Assignee: Robert Muir
 Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, 
 SOLR-2129-version-5.patch, SOLR-2129-version-6.patch, 
 SOLR-2129-version2.patch, SOLR-2129-version3.patch, SOLR-2129.patch, 
 SOLR-2129.patch


 Provide components to enable Apache UIMA automatic metadata extraction to be 
 exploited when indexing documents.
 The purpose of this is to get unstructured information inside a document 
 and create structured metadata (as fields) to enrich each document.
 Basically this can be done with a custom UpdateRequestProcessor which 
 triggers UIMA while indexing documents.
 The basic UIMA implementation of UpdateRequestProcessor extracts sentences 
 (with a tokenizer and an hidden Markov model tagger), named entities, 
 language, suggested category, keywords and concepts (exploiting external 
 services from OpenCalais and AlchemyAPI). Such an implementation can be 
 easily extended adding or selecting different UIMA analysis engines, both 
 from UIMA repositories on the web or creating new ones from scratch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA

2011-01-08 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili updated SOLR-2129:
--

Attachment: SOLR-2129-version-5.patch

Changes are:
# drop StringBuffer for StringBuilder in UIMAUpdateRequestProcessor
# make the getAE method in OverridingParamAEProvider synchronized to support 
concurrent requests to the provider
# make the getAEProvider method in AEProviderFactory synchronized and make the 
cache core aware, each core has now an AEProvider for each analysis engine's 
path
# the UIMAUpdateRequestProcessor constructor accepts SolrCore as a parameter 
instead of a SolrConfig object

I tested it with multiple cores and concurrent updates for each core.

 Provide a Solr module for dynamic metadata extraction/indexing with Apache 
 UIMA
 ---

 Key: SOLR-2129
 URL: https://issues.apache.org/jira/browse/SOLR-2129
 Project: Solr
  Issue Type: New Feature
Reporter: Tommaso Teofili
Assignee: Robert Muir
 Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, 
 SOLR-2129-version-5.patch, SOLR-2129-version2.patch, 
 SOLR-2129-version3.patch, SOLR-2129.patch, SOLR-2129.patch


 Provide components to enable Apache UIMA automatic metadata extraction to be 
 exploited when indexing documents.
 The purpose of this is to get unstructured information inside a document 
 and create structured metadata (as fields) to enrich each document.
 Basically this can be done with a custom UpdateRequestProcessor which 
 triggers UIMA while indexing documents.
 The basic UIMA implementation of UpdateRequestProcessor extracts sentences 
 (with a tokenizer and an hidden Markov model tagger), named entities, 
 language, suggested category, keywords and concepts (exploiting external 
 services from OpenCalais and AlchemyAPI). Such an implementation can be 
 easily extended adding or selecting different UIMA analysis engines, both 
 from UIMA repositories on the web or creating new ones from scratch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA

2011-01-03 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-2129:
--

Attachment: SOLR-2129.patch

patch synced to trunk.

i also adjusted some minor things: doesn't rely on CWD for running tests, added 
an assume in tests in case you have no internet connection, with a set timeout, 
removed troublesome xml includes as this is dependent on CWD, etc.

I reviewed the code, I have no problem committing this to contrib so future 
iterations can be from svn. any objections?


 Provide a Solr module for dynamic metadata extraction/indexing with Apache 
 UIMA
 ---

 Key: SOLR-2129
 URL: https://issues.apache.org/jira/browse/SOLR-2129
 Project: Solr
  Issue Type: New Feature
Reporter: Tommaso Teofili
Assignee: Robert Muir
 Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, 
 SOLR-2129-version2.patch, SOLR-2129-version3.patch, SOLR-2129.patch, 
 SOLR-2129.patch


 Provide components to enable Apache UIMA automatic metadata extraction to be 
 exploited when indexing documents.
 The purpose of this is to get unstructured information inside a document 
 and create structured metadata (as fields) to enrich each document.
 Basically this can be done with a custom UpdateRequestProcessor which 
 triggers UIMA while indexing documents.
 The basic UIMA implementation of UpdateRequestProcessor extracts sentences 
 (with a tokenizer and an hidden Markov model tagger), named entities, 
 language, suggested category, keywords and concepts (exploiting external 
 services from OpenCalais and AlchemyAPI). Such an implementation can be 
 easily extended adding or selecting different UIMA analysis engines, both 
 from UIMA repositories on the web or creating new ones from scratch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA

2010-12-08 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili updated SOLR-2129:
--

Attachment: SOLR-2129-version3.patch

Here is a new patch with updated contrib/uima/build.xml to include resources in 
the generated package.
Also there is small README inside to guide configuration.

 Provide a Solr module for dynamic metadata extraction/indexing with Apache 
 UIMA
 ---

 Key: SOLR-2129
 URL: https://issues.apache.org/jira/browse/SOLR-2129
 Project: Solr
  Issue Type: New Feature
Reporter: Tommaso Teofili
Assignee: Robert Muir
 Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, 
 SOLR-2129-version2.patch, SOLR-2129-version3.patch, SOLR-2129.patch


 Provide components to enable Apache UIMA automatic metadata extraction to be 
 exploited when indexing documents.
 The purpose of this is to get unstructured information inside a document 
 and create structured metadata (as fields) to enrich each document.
 Basically this can be done with a custom UpdateRequestProcessor which 
 triggers UIMA while indexing documents.
 The basic UIMA implementation of UpdateRequestProcessor extracts sentences 
 (with a tokenizer and an hidden Markov model tagger), named entities, 
 language, suggested category, keywords and concepts (exploiting external 
 services from OpenCalais and AlchemyAPI). Such an implementation can be 
 easily extended adding or selecting different UIMA analysis engines, both 
 from UIMA repositories on the web or creating new ones from scratch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA

2010-11-14 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili updated SOLR-2129:
--

Attachment: SOLR-2129-version2.patch

Huge Solr-UIMA refactoring, including injecting the following information from 
uimaConfig tag inside solrconfig:

1. added dynamic field mapping with the following syntax:
fieldMapping
type name=org.apache.uima.jcas.tcas.Annotation
  map feature=coveredText field=tag/
/type
type name=org.apache.uima.jcas.tcas.AnotherAnnotationType
  map feature=featureName field=anotherField/
/type
/fieldMapping

2. added AnalysisEngine descriptor path (must be inside the classpath)
analysisEngine/org/apache/uima/desc/OverridingParamsExtServicesAE.xml/analysisEngine

3. added fields' values to be analyzed, eventually merging their values to make 
UIMA run only once:
 analyzeFields merge=falsetext,title/analyzeFields

Runtime parameters for defining overriding parameters for delegate AEs remains 
the same:
runtimeParameters
keyword_apikeyVALID_ALCHEMYAPI_KEY/keyword_apikey
concept_apikeyVALID_ALCHEMYAPI_KEY/concept_apikey
lang_apikeyVALID_ALCHEMYAPI_KEY/lang_apikey
cat_apikeyVALID_ALCHEMYAPI_KEY/cat_apikey
oc_licenseIDVALID_OPENCALAIS_KEY/oc_licenseID
/runtimeParameters

These changes should make the use of such a module much easier and flexible.
Looking forward for your feedback.
Tommaso

 Provide a Solr module for dynamic metadata extraction/indexing with Apache 
 UIMA
 ---

 Key: SOLR-2129
 URL: https://issues.apache.org/jira/browse/SOLR-2129
 Project: Solr
  Issue Type: New Feature
Reporter: Tommaso Teofili
Assignee: Robert Muir
 Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, 
 SOLR-2129-version2.patch, SOLR-2129.patch


 Provide components to enable Apache UIMA automatic metadata extraction to be 
 exploited when indexing documents.
 The purpose of this is to get unstructured information inside a document 
 and create structured metadata (as fields) to enrich each document.
 Basically this can be done with a custom UpdateRequestProcessor which 
 triggers UIMA while indexing documents.
 The basic UIMA implementation of UpdateRequestProcessor extracts sentences 
 (with a tokenizer and an hidden Markov model tagger), named entities, 
 language, suggested category, keywords and concepts (exploiting external 
 services from OpenCalais and AlchemyAPI). Such an implementation can be 
 easily extended adding or selecting different UIMA analysis engines, both 
 from UIMA repositories on the web or creating new ones from scratch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA

2010-09-24 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili updated SOLR-2129:
--

Attachment: SOLR-2129.patch

Patch to port solr-uima GC project as a solr/contrib module

 Provide a Solr module for dynamic metadata extraction/indexing with Apache 
 UIMA
 ---

 Key: SOLR-2129
 URL: https://issues.apache.org/jira/browse/SOLR-2129
 Project: Solr
  Issue Type: New Feature
Reporter: Tommaso Teofili
 Attachments: SOLR-2129.patch


 Provide components to enable Apache UIMA automatic metadata extraction to be 
 exploited when indexing documents.
 The purpose of this is to get unstructured information inside a document 
 and create structured metadata (as fields) to enrich each document.
 Basically this can be done with a custom UpdateRequestProcessor which 
 triggers UIMA while indexing documents.
 The basic UIMA implementation of UpdateRequestProcessor extracts sentences 
 (with a tokenizer and an hidden Markov model tagger), named entities, 
 language, suggested category, keywords and concepts (exploiting external 
 services from OpenCalais and AlchemyAPI). Such an implementation can be 
 easily extended adding or selecting different UIMA analysis engines, both 
 from UIMA repositories on the web or creating new ones from scratch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA

2010-09-24 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili updated SOLR-2129:
--

Attachment: SOLR-2129-asf-headers.patch

Same patch plus required ASF headers on code and xml

 Provide a Solr module for dynamic metadata extraction/indexing with Apache 
 UIMA
 ---

 Key: SOLR-2129
 URL: https://issues.apache.org/jira/browse/SOLR-2129
 Project: Solr
  Issue Type: New Feature
Reporter: Tommaso Teofili
 Attachments: SOLR-2129-asf-headers.patch, SOLR-2129.patch


 Provide components to enable Apache UIMA automatic metadata extraction to be 
 exploited when indexing documents.
 The purpose of this is to get unstructured information inside a document 
 and create structured metadata (as fields) to enrich each document.
 Basically this can be done with a custom UpdateRequestProcessor which 
 triggers UIMA while indexing documents.
 The basic UIMA implementation of UpdateRequestProcessor extracts sentences 
 (with a tokenizer and an hidden Markov model tagger), named entities, 
 language, suggested category, keywords and concepts (exploiting external 
 services from OpenCalais and AlchemyAPI). Such an implementation can be 
 easily extended adding or selecting different UIMA analysis engines, both 
 from UIMA repositories on the web or creating new ones from scratch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org