Hello Mark,

that is an issue which will go into opennlp-tools for the next release.

To make it show up in our automatically generated issue list the fix version has
to be set to 1.6.0.

Can you please reopen the issue, set the "Fix Version" to 1.6.0 and close it again?

Thanks,
Jörn

On 03/11/2014 11:53 AM, Mark Giaconia (JIRA) wrote:
      [ 
https://issues.apache.org/jira/browse/OPENNLP-643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Giaconia closed OPENNLP-643.
---------------------------------

     Resolution: Fixed

Initial impl committed, continuous improvement will occur over time.

Provide default rule based (regex) name finders (phone num, url, email, coords)
-------------------------------------------------------------------------------

                 Key: OPENNLP-643
                 URL: https://issues.apache.org/jira/browse/OPENNLP-643
             Project: OpenNLP
          Issue Type: New Feature
          Components: Name Finder
    Affects Versions: 1.6.0
            Reporter: Mark Giaconia
            Assignee: Mark Giaconia
            Priority: Minor

It would be nice if OpenNLP came with some basic rule based namefinders 
(RegexNameFinders) for basic types. Initially I would like to create an engine 
that runs phonenum,. email,  url, MGRS, and DD Lat Lon.
Also, we need a framework for loading additional regexes other than the 
defaults mentioned above.
Here is my initial thought... a class that has a set of default types and 
patterns in a map that runs the RegexNameFinder, with optional constructors to 
override the map, or read from a config file.
Let me know what you think...
/**
  *
  * Constructs a set of RegexNameFinders from configuration or from a provided 
Map
  */
public class RuleBasedEntityFinderEngine {
   private static final String PHONE_REGEX = "";
   private static final String EMAIL_REGEX = "";
   private static final String URL_REGEX = "";
   private static final String MGRS_REGEX = "";
   private static final String DDLATLON_REGEX = "";
   private static final String PHONE_REGEX_TYPE = "phone number";
   private static final String EMAIL_REGEX_TYPE = "email";
   private static final String URL_REGEX_TYPE = "url";
   private static final String MGRS_REGEX_TYPE = "MGRS coord";
   private static final String DDLATLON_REGEX_TYPE = "DD coord";
   private Map<String, Pattern[]> typePatternMap = new HashMap<>();
   Properties properties;
   /**
    * Loads a set of patterns via configuration. The file should have the entity
    * type with no spaces, followed by the regex. For types that have multiple
    * regexes, duplicate the type on each line. for example: phone_num <phonenum
    * regex1>
    * phone_num <phonenum regex2>
    * email <regex1>
    * Each entry will be loaded in order from top to bottom of file, so if order
    * matters list regexes accordingly from top to bottom
    *
    * @param properties      the inputStream of props from which to load the
    *                        regexes from
    * @param includeDefaults when true, adds the defaults to the map. if there 
is
    *                        key collision in the map, the default will 
override.
    * @throws IOException
    */
   public RuleBasedEntityFinderEngine(InputStream properties, boolean 
includeDefaults) throws IOException {
     this.properties = new Properties();
     this.properties.load(properties);
     init();
   }
   /**
    *
    * @param typePatternMap  a map of name types (i.e. phone number, email...) 
to
    *                        an array of regex Patterns. This map is the basis
    *                        for instantiating regexnamefinders
    * @param includeDefaults when true, add the defaults to the map. if there is
    *                        key collision in the map, the default will 
override.
    */
   public RuleBasedEntityFinderEngine(Map<String, Pattern[]> typePatternMap, 
boolean includeDefaults) {
     this.typePatternMap = typePatternMap;
     if (includeDefaults) {
       init();
     }
   }
   /**
    * loads default regexs and types into the map
    */
   private void init() {
     if (properties != null) {
       //get the regexes from config somewhere
       /**
        *TODO
        */
     } else {
       typePatternMap.put(PHONE_REGEX_TYPE, new 
Pattern[]{Pattern.compile(PHONE_REGEX)});
       typePatternMap.put(EMAIL_REGEX_TYPE, new 
Pattern[]{Pattern.compile(EMAIL_REGEX)});
       typePatternMap.put(URL_REGEX_TYPE, new 
Pattern[]{Pattern.compile(URL_REGEX)});
       typePatternMap.put(MGRS_REGEX_TYPE, new 
Pattern[]{Pattern.compile(MGRS_REGEX)});
       typePatternMap.put(DDLATLON_REGEX_TYPE, new 
Pattern[]{Pattern.compile(DDLATLON_REGEX)});
       //load the default regexes
     }
   }
   public Map<String, Span[]> find(String[] tokens) {
     Map<String, Span[]> outSpans = new HashMap<>();
     if (typePatternMap != null) {
       for (Map.Entry<String, Pattern[]> finder : typePatternMap.entrySet()) {
         RegexNameFinder nf = new RegexNameFinder(finder.getValue(), 
finder.getKey());
         Span[] spans = nf.find(tokens);
         outSpans.put(finder.getKey(), spans);
       }
     }
     return outSpans;
   }
   public Map<String, Pattern[]> getTypePatternMap() {
     init();
     return typePatternMap;
   }
}


--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to