Hi,
take a look into org.apache.droids.robot.crawler.CrawlingWorker.
All new tasks, retrieved by your parser, gets checked by the
getFilteredOutlinks method.
If the filters accept your url, they will be added as new tasks.
If you just want accept URLs comming from the specified host, you can
also use the HostFilter.
Tobias
On 05/16/2012 09:47 PM, Mansour Al Akeel wrote:
assuming I need to accept urls under http://www.dmoz.org/Arts/ and not go
anywhere else, in DroidFactory I would do this:
public static URLFiltersFactory createDefaultURLFiltersFactory() {
URLFiltersFactory filtersFactory = new URLFiltersFactory();
URLFilter defaultURLFilter = new URLFilter() {
final private String prefix = "http://www.dmoz.org/Arts/";
public String filter(String urlString) {
if (urlString.startsWith(prefix))
return urlString;
return null;
}
};
filtersFactory.getMap().put("default", defaultURLFilter);
return filtersFactory;
}
Then would add it to the droid in the unit testing:
private final CrawlingDroid createDroid(final Queue<Link> queue) {
final CrawlingDroid droid = new SysoutCrawlingDroid(queue, null);
final ProtocolFactory protocolFactory = DroidsFactory
.createDefaultProtocolFactory();
droid.setProtocolFactory(protocolFactory);
URLFiltersFactory filtersFactory = DroidsFactory
.createDefaultURLFiltersFactory();
droid.setFiltersFactory(filtersFactory);
final ParserFactory parserFactory = parserSetup();
droid.setParserFactory(parserFactory);
return droid;
}
@Test
public void execute_linkIsParsed() throws DroidsException, IOException,
URISyntaxException {
final Link link = new LinkTask(null, new URI(searchUrl), 1);
this.instance.execute(link);
Mockito.verify(htmlParser).parse(Matchers.any(ContentEntity.class),
Matchers.any(Link.class));
}
However, iterating through the code, doesn't show it's being invoked. Is
there anything else I need to do to make sure this is being invoked
properly ??