Solr UIMA Notes

Eric Pugh Fri, 10 Aug 2012 13:41:20 -0700

Hi all,

I've been working through the SolrUIMA demo, and have some changes to propose 
based on going through it to make the UIMA stuff more accessible to a new user. 
 Since JIRA is down, I thought I would email my notes to the list and see if 
anyone can clarify my questions.


Eric


1) The class org.apache.lucene.analysis.uima.ae.OverridingParamsAEProvider 
specifically mentions that it is used to take params supplied by Solr's 
solrconfig.xml and feed them into the AnalysisEngine.  While no Solr imports 
exist, so it could be used with anything, it seems odd that the phrasing for a 
Lucene class refers to Solr.  Changing the phrasing from "injecting runtime 
parameters defined in the solrconfig.xml Solr configuration file" to "injecting 
runtime parameters such as those defined in the Solr solrconfig.xml 
configuration file" might make the intent clearer and explain why it isn't in a 
 Solr package, even though we have a Solr contrib module for UIMA.

2) The tests org.apache.solr.uima.analysis.UIMAAnnotationsTokenizerFactoryTest 
and UIMATypeAwareAnnotationsTokenizerFactoryTest test code that is in the 
o.a.lucene structure, but with all the overhead of using Solr.  There is no 
corresponding test in the o.a.lucene path for those factory classes.  

3) When going through the http://wiki.apache.org/solr/SolrUIMA/ tutorial, it's 
very odd that you flip from the wiki page to content that is stored in SVN and 
back as you follow the directions.  Especially since the bits of sample config 
in SVN aren't used by tests or anything else.  I'd like to move them to just 
the wiki, so they are easier to edit and keep up to date.

4) When looking at the test files we have annotation engines with names like 
"org.apache.solr.uima.ts.SentimentAnnotation".  However, they don't exist as 
classes in the main source tree!  And when you go down the rabbit hole, you 
eventually end up at a Java class called 
org.apache.solr.uima.processor.an.DummySentimentAnnotator that actually is the 
aforementioned annotator!  I'd like to change the test code so that we actually 
are at least using something called  
"org.apache.solr.uima.ts.DummySentimentAnnotation" or even 
"org.apache.solr.uima.processor.an.DummySentimentAnnotator"!    I got very 
excited that out of the box demo had sentiment analysis, and it really didn't, 
just some mock code.

5) It appears that when you pass a multivalued field through to UIMA, only the 
first value is actually submitted to Solr.  If my XML (solr.xml from example 
docs) looks like:

  <field name="features">Advanced Full-Text Search Capabilities using 
Lucene</field>
  <field name="features">Optimized for High Volume Web Traffic</field>

Then what gets processed is only the text "Advanced Full-Text Search 
Capabilities using Lucene"!  I have a separate patch I will submit that uses 
getFieldValues() instead of getFieldValue() method on a SolrInputDocument.

6) You need to bump your memory allocation!  -Xmx1024m -Xms512m, or it WILL run 
out of heap space when running tests.

7) I'd like to move the UIMA xml files etc into the /conf directory, instead of 
accessing the files that are inside the JAR file.  Much easier to hack on.  I 
copied solr/contrib/uima/src/resources/*.xml into 
solr/example/solr/collection1/conf/uima, and access it via:
        <!--str 
name="analysisEngine">/org/apache/uima/desc/OverridingParamsExtServicesAE.xml</str-->
           
        <str 
name="analysisEngine">solr/${solr.core.instanceDir}/conf/uima/OverridingParamsExtServicesAE.xml</str>

8) It appears like for each annotation, I can only use the last "feature" 
defined.   This doesn't work:
          <lst name="type">
            <str 
name="name">org.apache.uima.alchemy.ts.language.LanguageFS</str>
            <lst name="mapping">
              <str name="feature">language</str>
              <str name="field">language</str>
            </lst>
          </lst>                                  
          <lst name="type">
            <str 
name="name">org.apache.uima.alchemy.ts.language.LanguageFS</str>
            <lst name="mapping">
              <str name="feature">wikipedia</str>
              <str name="field">language_wikipedia</str>
            </lst>
          </lst>


Okay, figured it out finally,  and it has to look like this inside a type 
definition:
            <lst name="mapping">
              <str name="feature">wikipedia</str>
              <str name="field">language_wikipedia</str>
            </lst>
            <lst name="mapping">
              <str name="feature">language</str>
              <str name="field">language</str>
            </lst>
            <lst name="mapping">
              <str name="feature">ethnologue</str>
                          <str name="fieldNameFeature">language</str>
              <str name="dynamicField">*_sm</str>
            </lst>
                


9) I'd like to patch the default solrconfig.xml to include the UIMA jars, and 
move the config files over to /conf/uima, and then just comment out the 
example.  Do we think that this is a good thing? Since you have to have an 
AlchemyAPI key, we could just have the code do the sentence parsing as the 
example, and comment out the alchemyAPI keys in solrconfig.xml.  Or, just leave 
them in the source tree, and document the steps?





-----------------------------------------------------
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com
Co-Author: Apache Solr 3 Enterprise Search Server available from 
http://www.packtpub.com/apache-solr-3-enterprise-search-server/book    
This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.












---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Solr UIMA Notes

Reply via email to