[ https://issues.apache.org/jira/browse/OPENNLP-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653128#comment-17653128 ]
ASF GitHub Bot commented on OPENNLP-1428: ----------------------------------------- jzonthemtn commented on code in PR #473: URL: https://github.com/apache/opennlp/pull/473#discussion_r1059381083 ########## opennlp-tools/src/main/java/opennlp/tools/util/DownloadUtil.java: ########## @@ -174,4 +143,82 @@ public static <T extends BaseModel> T downloadModel(URL url, Class<T> type) thro } } + @Internal + static class DownloadParser { + + private static final Pattern LINK_PATTERN = Pattern.compile("<a href=\\\"(.*?)\\\">(.*?)</a>", Pattern.CASE_INSENSITIVE | Pattern.DOTALL); + private final URL indexUrl; + + DownloadParser(URL indexUrl) { + Objects.requireNonNull(indexUrl); + this.indexUrl = indexUrl; + } + + Map<String, Map<ModelType, String>> getAvailableModels() { + + final Matcher matcher = LINK_PATTERN.matcher(fetchPageIndex()); + + final List<String> links = new ArrayList<>(); + while (matcher.find()) { + links.add(matcher.group(1)); + } + + return toMap(links); + } + + private Map<String, Map<ModelType, String>> toMap(List<String> links) { + + final Map<String, Map<ModelType, String>> result = new HashMap<>(); + + for (String link : links) { Review Comment: Do you have any suggestions on where we should document the structure of the model file names? > Enhance DownloadUtil to avoid the use of hard-coded model urls > -------------------------------------------------------------- > > Key: OPENNLP-1428 > URL: https://issues.apache.org/jira/browse/OPENNLP-1428 > Project: OpenNLP > Issue Type: Improvement > Reporter: Richard Zowalla > Assignee: Richard Zowalla > Priority: Major > > As pointed out in https://github.com/apache/opennlp/pull/472, we should not > rely on hard-coded URLs in DownloadUtil. > Instead we can parse the content of > https://dlcdn.apache.org/opennlp/models/ud-models-1.0/ and automatically > derive the related model files from it. -- This message was sent by Atlassian Jira (v8.20.10#820010)