[ http://issues.apache.org/jira/browse/NUTCH-20?page=all ]
Stephan Strittmatter updated NUTCH-20: -------------------------------------- Attachment: OutlinkExtractor.java anchor "null" causes NPE. changed to anchor as empty String. > Extract urls from plain texts > ------------------------------ > > Key: NUTCH-20 > URL: http://issues.apache.org/jira/browse/NUTCH-20 > Project: Nutch > Type: Improvement > Components: fetcher > Reporter: Stefan Grroschupf > Priority: Trivial > Attachments: OutlinkExtractor.java, OutlinkExtractor.java, > OutlinkExtractor.java, TestOutlink.java, TestOutlink.java, patch.txt > > Some parsers have no Outlinks returned. E.g. the Word-Parser. > This class is able to extract (absolute) hyperlinks from a plain String > (content) and generates outlinks from them. > This would be very usful for parser which have no explicite extraction of > hyperlinks. > Excample: > Outlink[] links = OutlinkExtractor.getOutlinks("Nutch is located at > http://www.apache.org and ..."); > Will return an array of Outlinks containing the one element of > "http://www.apache.org". > ---- > transfered from: > http://sourceforge.net/tracker/index.php?func=detail&aid=1109328&group_id=59548&atid=491356 > submitted by: Stephan Strittmatter -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira