Hello all, I would like to propose the development of a Temporal Extraction addon. In the industry I work in, there is a need to support search of documents/entities by location and date mentions within the document text. I feel pretty good about the GeoEntityLinker addon for providing geocoding, but now I need to do date extraction.
This addon I propose would take text, and return a real java.util.Date, with a precision, likely stored in an extended Span object. Initially, I would like it to deal with year, seasonal, month, and day level references, and return a real Date and a precision. Don't care so much about days of week mentions and such, this is geared more towards supporting search and other datetime related analytics. I have done this before to some degree a while back, and I have done research that leads to a couple different approaches: 1. All regex based extraction, and then a series of rules for cleaning the results. pros: no training, simple configuration, predictable output cons: regexes are confusing as they mature, regexes are not context specific 2. Machine learning (like the current opennlp model/NER can do pretty well) pros: based on user data (if trained on it), adaptive etc cons:unpredictable strings as a result, hard to deal with. 3. A combination of Regex extraction and ML, in which the regex results are highly specific and used for sentence annotation for building a model. pros: model based on regex results on user data, adaptive, more recall than option 1, more predicatble results than option 2 cons:laborious processing (run regex extraction , produce annotations, build a model etc), still deal with unpredictable results My recommendation is option 3. I would like to write a regex based extractor that stands alone, but also write an impl for the modelbuilder-addon that would use the regex based extractor to create annotations for the model building process that occurs in the modelbuilder-addon (which automates annotation and model building based on user defined "known entities" and sentences). Option three would also provide "simple" and "advanced" versions of temporal extraction. this is a complex process, let us know if you see utility in this, and please provide any insights. sorry for the long email thanks Mark G
