During fetching, OutlinkExtractor.getOutlinks() finds lots of junk, such as the following: rdf:about= xmlns:pdf= http://ns.adobe.com/pdf/1.3/ pdf:Producer pdf:Producer rdf:Description rdf:Description rdf:about= xmlns:xap= http://ns.adobe.com/xap/1.0/ xap:CreatorTool xap:CreatorTool xap:ModifyDate T14:43:23-07:00
This is because the defined URL_PATTERN matches things that are not web links. Is there a fix for it? Is there a way to set protocols (e.g. http, https) for the desired outlinks? This way, only links containing the specified protocols will be considered as "outlink". I'm using 0.9-devcode. Thanks, -- AJ Chen, PhD http://web2express.org