[ 
https://issues.apache.org/jira/browse/OPENNLP-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13885550#comment-13885550
 ] 

Joern Kottmann commented on OPENNLP-643:
----------------------------------------

I think it is a really good idea to offer default regexs for the mentioned 
types.

The existing RegexNameFinder could be extended to support multiple types. Or a 
user could use an ensemble of them to detect multiple types. In my opinion we 
should support both.

You are right, we should add some support to instantiate the RegexNameFinder 
from a some kind of files which contains the patterns, instead of forcing the 
user to do that (again, both should be supported).

To offer defaults we could create a factory which configures the desired 
RegexNameFinder.

For example:
RegexNameFinder.createDefaultNameFinder(DefaultPatterns.EMAIL, 
DefaultPatterns.URL, DefaultPatterns.PHONE)

What do you think about that?

> Provide default rule based (regex) name finders (phone num, url, email, 
> coords)
> -------------------------------------------------------------------------------
>
>                 Key: OPENNLP-643
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-643
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Name Finder
>    Affects Versions: 1.6.0
>            Reporter: Mark Giaconia
>            Assignee: Mark Giaconia
>            Priority: Minor
>
> It would be nice if OpenNLP came with some basic rule based namefinders 
> (RegexNameFinders) for basic types. Initially I would like to create an 
> engine that runs phonenum,. email,  url, MGRS, and DD Lat Lon.
> Also, we need a framework for loading additional regexes other than the 
> defaults mentioned above.
> Here is my initial thought... a class that has a set of default types and 
> patterns in a map that runs the RegexNameFinder, with optional constructors 
> to override the map, or read from a config file.
> Let me know what you think...
> /**
>  *
>  * Constructs a set of RegexNameFinders from configuration or from a provided 
> Map
>  */
> public class RuleBasedEntityFinderEngine {
>   private static final String PHONE_REGEX = "";
>   private static final String EMAIL_REGEX = "";
>   private static final String URL_REGEX = "";
>   private static final String MGRS_REGEX = "";
>   private static final String DDLATLON_REGEX = "";
>   private static final String PHONE_REGEX_TYPE = "phone number";
>   private static final String EMAIL_REGEX_TYPE = "email";
>   private static final String URL_REGEX_TYPE = "url";
>   private static final String MGRS_REGEX_TYPE = "MGRS coord";
>   private static final String DDLATLON_REGEX_TYPE = "DD coord";
>   private Map<String, Pattern[]> typePatternMap = new HashMap<>();
>   Properties properties;
>   /**
>    * Loads a set of patterns via configuration. The file should have the 
> entity
>    * type with no spaces, followed by the regex. For types that have multiple
>    * regexes, duplicate the type on each line. for example: phone_num 
> <phonenum
>    * regex1>
>    * phone_num <phonenum regex2>
>    * email <regex1>
>    * Each entry will be loaded in order from top to bottom of file, so if 
> order
>    * matters list regexes accordingly from top to bottom
>    *
>    * @param properties      the inputStream of props from which to load the
>    *                        regexes from
>    * @param includeDefaults when true, adds the defaults to the map. if there 
> is
>    *                        key collision in the map, the default will 
> override.
>    * @throws IOException
>    */
>   public RuleBasedEntityFinderEngine(InputStream properties, boolean 
> includeDefaults) throws IOException {
>     this.properties = new Properties();
>     this.properties.load(properties);
>     init();
>   }
>   /**
>    *
>    * @param typePatternMap  a map of name types (i.e. phone number, email...) 
> to
>    *                        an array of regex Patterns. This map is the basis
>    *                        for instantiating regexnamefinders
>    * @param includeDefaults when true, add the defaults to the map. if there 
> is
>    *                        key collision in the map, the default will 
> override.
>    */
>   public RuleBasedEntityFinderEngine(Map<String, Pattern[]> typePatternMap, 
> boolean includeDefaults) {
>     this.typePatternMap = typePatternMap;
>     if (includeDefaults) {
>       init();
>     }
>   }
>   /**
>    * loads default regexs and types into the map
>    */
>   private void init() {
>     if (properties != null) {
>       //get the regexes from config somewhere
>       /**
>        *TODO
>        */
>     } else {
>       typePatternMap.put(PHONE_REGEX_TYPE, new 
> Pattern[]{Pattern.compile(PHONE_REGEX)});
>       typePatternMap.put(EMAIL_REGEX_TYPE, new 
> Pattern[]{Pattern.compile(EMAIL_REGEX)});
>       typePatternMap.put(URL_REGEX_TYPE, new 
> Pattern[]{Pattern.compile(URL_REGEX)});
>       typePatternMap.put(MGRS_REGEX_TYPE, new 
> Pattern[]{Pattern.compile(MGRS_REGEX)});
>       typePatternMap.put(DDLATLON_REGEX_TYPE, new 
> Pattern[]{Pattern.compile(DDLATLON_REGEX)});
>       //load the default regexes
>     }
>   }
>   public Map<String, Span[]> find(String[] tokens) {
>     Map<String, Span[]> outSpans = new HashMap<>();
>     if (typePatternMap != null) {
>       for (Map.Entry<String, Pattern[]> finder : typePatternMap.entrySet()) {
>         RegexNameFinder nf = new RegexNameFinder(finder.getValue(), 
> finder.getKey());
>         Span[] spans = nf.find(tokens);
>         outSpans.put(finder.getKey(), spans);
>       }
>     }
>     return outSpans;
>   }
>   public Map<String, Pattern[]> getTypePatternMap() {
>     init();
>     return typePatternMap;
>   }
> }



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to