Hello all, I would like to propose the development of a Temporal Extraction
addon. In the industry I work in, there is a need to support search of
documents/entities by location and date mentions within the document text.
I feel pretty good about the GeoEntityLinker addon for providing geocoding,
but now I need to do date extraction.

This addon I propose would take text, and return a real java.util.Date,
with a precision, likely stored in an extended Span object. Initially, I
would like it to deal with year, seasonal, month, and day level references,
and return a real Date and a precision. Don't care so much about days of
week mentions and such, this is geared more towards supporting search and
other datetime related analytics.

I have done this before to some degree a while back, and I have done
research that leads to a couple different approaches:
1. All regex based extraction, and then a series of rules for cleaning the
results.
pros: no training, simple configuration, predictable output
cons: regexes are confusing as they mature, regexes are not context specific
2. Machine learning (like the current opennlp model/NER can do pretty well)
pros: based on user data (if trained on it), adaptive etc
cons:unpredictable strings as a result, hard to deal with.
3. A combination of Regex extraction and ML, in which the regex results are
highly specific and used for sentence annotation for building a model.
pros: model based on regex results on user data, adaptive, more recall than
option 1, more predicatble results than option 2
cons:laborious processing (run regex extraction , produce annotations,
build a model etc), still deal with unpredictable results

My recommendation is option 3. I would like to write a regex based
extractor that stands alone, but also write an impl for the
modelbuilder-addon that would use the regex based extractor to create
annotations for the model building process that occurs in the
modelbuilder-addon (which automates annotation and model building based on
user defined "known entities" and sentences). Option three would also
provide "simple" and "advanced" versions of temporal extraction.

this is a complex process, let us know if you see utility in this, and
please provide any insights.

sorry for the long email

thanks
Mark G

Reply via email to