Re: Can Solr analyze content and find dates and places

SUJIT PAL Mon, 11 Feb 2013 13:34:16 -0800

Cool! Thanks for the update, this will help if I ever go all the way with UIMA 
and Solr.


-sujit

On Feb 11, 2013, at 12:13 PM, jazz wrote:

> Hi Sujit,
> 
> Thanks for your help! I moved the RoomNumberAnnotator.xml to the top level of 
> the jar and used the same solrconfig.xml (with the /). Now it works perfect.
> 
> Best regards, Bart
> 
> 
> On 11 Feb 2013, at 20:13, SUJIT PAL wrote:
> 
>> Hi Bart,
>> 
>> Like I said, I didn't actually hook my UIMA stuff into Solr, content and 
>> queries are annotated before they reach Solr. What you describe sounds like 
>> a classpath problem (but of course you already knew that :-)). Since I 
>> haven't actually done what you are trying to do, here are some suggestions, 
>> they may or may not work...
>> 
>> 1) package up the XML files into your custom JAR at the top level, that way 
>> you don't need to specify it as /RoomNumberAnnotator.xml.
>> 2) if you are using solr4, then you should drop your custom JAR into 
>> $SOLR_HOME/collection1/lib, not $SOLR_HOME/lib.
>> 
>> -sujit
>> 
>> On Feb 11, 2013, at 9:40 AM, jazz wrote:
>> 
>>> Hi Sujit and others who answered my question,
>>> 
>>> I have been working on the UIMA path which seems great with the available 
>>> Eclipse tooling and this:
>>> 
>>> http://sujitpal.blogspot.nl/2011/03/smart-query-parsing-with-uima.html
>>> 
>>> Now I worked through the UIMA tutorial of the RoomNumberAnnotator: 
>>> http://uima.apache.org/doc-uima-annotator.html
>>> And I am able to test it using the UIMA CAS Virtuall Debugger. So far so 
>>> good.
>>> 
>>> But, now I want to use the new RoomNumberAnnotator with Solr, but it cannot 
>>> find the xml file and the Java class (they are in the correct lib 
>>> directories, because the WhitespaceTokenizer works fine).
>>> 
>>> <updateRequestProcessorChain name="uima">
>>>    <processor 
>>> class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">
>>>      <lst name="uimaConfig">
>>>        <lst name="runtimeParameters">
>>>        </lst>
>>>        <str name="analysisEngine">/RoomNumberAnnotator.xml</str>
>>>        <bool name="ignoreErrors">false</bool>
>>>        <lst name="analyzeFields">
>>>          <bool name="merge">false</bool>
>>>          <arr name="fields">
>>>            <str>content</str>
>>>          </arr>
>>>        </lst>
>>>        <lst name="fieldMappings">
>>>          <lst name="type">
>>>            <str name="name">org.apache.uima.tutorial.RoomNumber</str>
>>>            <lst name="mapping">
>>>              <str name="feature">building</str>
>>>              <str name="field">UIMAname</str>
>>>            </lst>
>>>          </lst>
>>>        </lst>
>>>      </lst>
>>>    </processor>
>>>    <processor class="solr.LogUpdateProcessorFactory" />
>>>    <processor class="solr.RunUpdateProcessorFactory" />
>>> 
>>> On the Wiki (http://wiki.apache.org/solr/SolrUIMA) this is mentioned but it 
>>> fails:
>>> Deploy new jars inside one of the lib directories
>>> 
>>> Run 'ant clean dist' (or 'mvn clean package') from the solr/contrib/uima 
>>> path.
>>> 
>>> Is it needed to deploy the new jar (RoomAnnotator.jar)? If yes, which 
>>> branch can I checkout? This is the Stable release I am running:
>>> 
>>> Solr 4.1.0 1434440 - sarowe - 2013-01-16 17:21:36
>>> 
>>> Regards, Bart
>>> 
>>> 
>>> On 8 Feb 2013, at 22:11, SUJIT PAL wrote:
>>> 
>>>> Hi Bart,
>>>> 
>>>> I did some work with UIMA but this was to annotate the data before it goes 
>>>> to Lucene/Solr, ie not built as a UpdateRequestProcessor. I just looked 
>>>> through the SolrUima wiki page [http://wiki.apache.org/solr/SolrUIMA] and 
>>>> I believe you will have to set up your own aggregate analysis chain in 
>>>> place of the one currently configured.
>>>> 
>>>> Writing UIMA annotators is very simple (there is a tutorial here:  
>>>> [http://uima.apache.org/downloads/releaseDocs/2.1.0-incubating/docs/html/tutorials_and_users_guides/tutorials_and_users_guides.html]).
>>>>  You provide the XML description for the annotation and let UIMA generate 
>>>> the annotation bean. You write Java code for the annotator and also the 
>>>> annotator XML descriptor. UIMA uses the annotator XML descriptor to 
>>>> instantiate and run your annotator. Overall, sounds really complicated but 
>>>> its actually quite simple.
>>>> 
>>>> The tutorial has quite a few examples that you will find useful, but in 
>>>> case you need more, I have some on this github repository:
>>>> [https://github.com/sujitpal/tgni/tree/master/src/main/java/com/mycompany/tgni/analysis/uima]
>>>> 
>>>> The dictionary and pattern annotators may be similar to what you are 
>>>> looking for (date and city annotators).
>>>> 
>>>> Best regards,
>>>> Sujit
>>>> 
>>>> On Feb 8, 2013, at 8:50 AM, Bart Rijpers wrote:
>>>> 
>>>>> Hi Alex,
>>>>> 
>>>>> Indeed that is exactly what I am trying to achieve using wordcities. Date 
>>>>> will be simple: 16-Jan becomes 16-Jan-2013 in a new dynamic field. But 
>>>>> how do I integrate the Java library as UIMA? The documentation about 
>>>>> changing schema.xml and solr.xml is not very detailed. 
>>>>> 
>>>>> Regards, Bart
>>>>> 
>>>>> On 8 Feb 2013, at 16:57, Alexandre Rafalovitch <arafa...@gmail.com> wrote:
>>>>> 
>>>>>> Hi Bart,
>>>>>> 
>>>>>> I haven't done any UIMA work (I used other stuff for my NLP phase), so 
>>>>>> not
>>>>>> sure I can help much further. But in general, you are venturing into pure
>>>>>> research territory here.
>>>>>> 
>>>>>> Even for dates, what do you actually mean? Just fixed expression? 
>>>>>> Relative
>>>>>> dates (e.g. last tuesday?). What about times (7pm?).
>>>>>> 
>>>>>> Same with cities. If you want it offline, you need the gazetteer and
>>>>>> disambiguation modules. Gazetteer for cities (worldwide) is huge and has 
>>>>>> a
>>>>>> lot of duplicate names (Paris, Ontario is apparently a short drive from
>>>>>> London, Ontario eh?). Something like
>>>>>> http://www.maxmind.com/en/worldcities? And disambiguation usually
>>>>>> requires training corpus that is similar to
>>>>>> what your text will look like.
>>>>>> 
>>>>>> Online services like OpenCalais are backed by gigantic databases and some
>>>>>> serious corpus-training Machine Language disambiguation algorithms.
>>>>>> 
>>>>>> So, no plug-and-play solution here. If you really need to get this done, 
>>>>>> I
>>>>>> would recommend narrowing down the specification of exactly what you will
>>>>>> settle for and looking for software that can do it. Once you have that,
>>>>>> integration with Solr is your next - and smaller - concern.
>>>>>> 
>>>>>> Regards,
>>>>>> Alex.
>>>>>> 
>>>>>> Personal blog: http://blog.outerthoughts.com/
>>>>>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>>>>>> - Time is the quality of nature that keeps events from happening all at
>>>>>> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
>>>>>> 
>>>>>> 
>>>>>> On Fri, Feb 8, 2013 at 10:41 AM, jazz <jazzsa...@me.com> wrote:
>>>>>> 
>>>>>>> Thanks Alex,
>>>>>>> 
>>>>>>> I checked the documentation but it seems there is only a webservice
>>>>>>> (OpenCalais) available to extract dates and places.
>>>>>>> 
>>>>>>> http://uima.apache.org/sandbox.html
>>>>>>> 
>>>>>>> Do you know is there is a Solr Compatible UIMA add-on which detects 
>>>>>>> dates
>>>>>>> and places (cities) without a webservice? If not, how do you write one?
>>>>>>> 
>>>>>>> Regards, Bart
>>>>>>> 
>>>>>>> On 8 Feb 2013, at 15:29, Alexandre Rafalovitch wrote:
>>>>>>> 
>>>>>>>> Yes, it is possible. You are looking at UIMA or OpenNLP integration, 
>>>>>>>> most
>>>>>>>> probably in Update Request Processor pipeline.
>>>>>>>> 
>>>>>>>> Have a look here as a start: https://wiki.apache.org/solr/SolrUIMA
>>>>>>>> 
>>>>>>>> You will have to put some serious work into this, it is not all tied
>>>>>>>> together and packaged. Mostly because the Natural Language Processing
>>>>>>> (the
>>>>>>>> field you are getting into) is kind of messy all of its own.
>>>>>>>> 
>>>>>>>> Good luck,
>>>>>>>> Alex.
>>>>>>>> 
>>>>>>>> Personal blog: http://blog.outerthoughts.com/
>>>>>>>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>>>>>>>> - Time is the quality of nature that keeps events from happening all at
>>>>>>>> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD 
>>>>>>>> book)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Fri, Feb 8, 2013 at 9:24 AM, jazz <jazzsa...@me.com> wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> I want to know if Solr can analyze text and recoginze dates and 
>>>>>>>>> places.
>>>>>>> If
>>>>>>>>> yes, is it then possible to create new dynamic fields with these dates
>>>>>>> and
>>>>>>>>> places (e.g. city).
>>>>>>>>> 
>>>>>>>>> Thanks, Bart
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>> 
>>> 
>> 
>

Re: Can Solr analyze content and find dates and places

Reply via email to