Bart,

For Apache Nutch we built a date extractor that relies on some regular 
expressions to extract sequences that resemble dates and pass the extracted 
candidates through a list of Java date formats together with the identified 
language (DateFormat is locale aware). With it we can extract many exotic dates 
from arbitrary text in many languages.

An older but working patch with example date formats and regular expressions 
exists for Apache Nutch. The relevant parts of the code should be easy to 
implement in your application if you're using Java.

https://issues.apache.org/jira/browse/NUTCH-1414

If you're doing multiple languages locale information is very imporant. That 
counts for an UIMA annotator as well.

Cheers,
Markus
 
 
-----Original message-----
> From:Bart Rijpers <jazzsa...@me.com>
> Sent: Fri 08-Feb-2013 17:51
> To: solr-user@lucene.apache.org
> Subject: Re: Can Solr analyze content and find dates and places
> 
> Hi Alex,
> 
> Indeed that is exactly what I am trying to achieve using wordcities. Date 
> will be simple: 16-Jan becomes 16-Jan-2013 in a new dynamic field. But how do 
> I integrate the Java library as UIMA? The documentation about changing 
> schema.xml and solr.xml is not very detailed. 
> 
> Regards, Bart
> 
> On 8 Feb 2013, at 16:57, Alexandre Rafalovitch <arafa...@gmail.com> wrote:
> 
> > Hi Bart,
> > 
> > I haven't done any UIMA work (I used other stuff for my NLP phase), so not
> > sure I can help much further. But in general, you are venturing into pure
> > research territory here.
> > 
> > Even for dates, what do you actually mean? Just fixed expression? Relative
> > dates (e.g. last tuesday?). What about times (7pm?).
> > 
> > Same with cities. If you want it offline, you need the gazetteer and
> > disambiguation modules. Gazetteer for cities (worldwide) is huge and has a
> > lot of duplicate names (Paris, Ontario is apparently a short drive from
> > London, Ontario eh?). Something like
> > http://www.maxmind.com/en/worldcities? And disambiguation usually
> > requires training corpus that is similar to
> > what your text will look like.
> > 
> > Online services like OpenCalais are backed by gigantic databases and some
> > serious corpus-training Machine Language disambiguation algorithms.
> > 
> > So, no plug-and-play solution here. If you really need to get this done, I
> > would recommend narrowing down the specification of exactly what you will
> > settle for and looking for software that can do it. Once you have that,
> > integration with Solr is your next - and smaller - concern.
> > 
> > Regards,
> >   Alex.
> > 
> > Personal blog: http://blog.outerthoughts.com/
> > LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> > - Time is the quality of nature that keeps events from happening all at
> > once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
> > 
> > 
> > On Fri, Feb 8, 2013 at 10:41 AM, jazz <jazzsa...@me.com> wrote:
> > 
> >> Thanks Alex,
> >> 
> >> I checked the documentation but it seems there is only a webservice
> >> (OpenCalais) available to extract dates and places.
> >> 
> >> http://uima.apache.org/sandbox.html
> >> 
> >> Do you know is there is a Solr Compatible UIMA add-on which detects dates
> >> and places (cities) without a webservice? If not, how do you write one?
> >> 
> >> Regards, Bart
> >> 
> >> On 8 Feb 2013, at 15:29, Alexandre Rafalovitch wrote:
> >> 
> >>> Yes, it is possible. You are looking at UIMA or OpenNLP integration, most
> >>> probably in Update Request Processor pipeline.
> >>> 
> >>> Have a look here as a start: https://wiki.apache.org/solr/SolrUIMA
> >>> 
> >>> You will have to put some serious work into this, it is not all tied
> >>> together and packaged. Mostly because the Natural Language Processing
> >> (the
> >>> field you are getting into) is kind of messy all of its own.
> >>> 
> >>> Good luck,
> >>>   Alex.
> >>> 
> >>> Personal blog: http://blog.outerthoughts.com/
> >>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> >>> - Time is the quality of nature that keeps events from happening all at
> >>> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
> >>> 
> >>> 
> >>> On Fri, Feb 8, 2013 at 9:24 AM, jazz <jazzsa...@me.com> wrote:
> >>> 
> >>>> Hi,
> >>>> 
> >>>> I want to know if Solr can analyze text and recoginze dates and places.
> >> If
> >>>> yes, is it then possible to create new dynamic fields with these dates
> >> and
> >>>> places (e.g. city).
> >>>> 
> >>>> Thanks, Bart
> >>>> 
> >> 
> >> 
> 

Reply via email to