[Nutch-general] Wildly different crawl results depending on environment...

Briggs Sat, 31 Mar 2007 06:11:03 -0800

nutch 0.7.2

I have 2 scenarios (both using the exact same configurations):


1) Running the crawl tool from the command line:

    ./bin/nutch crawl -local urlfile.txt -dir /tmp/somedir -depth 5

2) Running the crawl tool from a web app somewhere in code like:

    final String[] args = new String[]{
        "-local", "/tmp/urlfile.txt",
        "-dir", "/tmp/somedir",
        "-depth", "5"};

    CrawlTool.main(args);


When I run the first scenario, I may get thousands of pages, but when
I run the second scenario my results vary wildly.  I mean, I get
perhaps 0,1,10+, 100+.  But, I rarely ever get a good crawl from
within a web application.  So, there are many things that could be
going wrong here....

1) Is there some sort of parsing issue?  An xml parser, regex,
timeouts... something?  Not sure.  But, it just won't crawl as well as
the 'standalone mode'.

2) Is it a bad idea to use many concurrent CrawlTools, or even reusing
a crawl tool (more than once) within a instance of a JVM?  It seems to
have problems doing this. I am thinking there are some static
references that don't really like handling such use. But this is just
a wild accusation that I am not sure of.



-- 
"Conscious decisions by conscious minds are what make reality real"

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Wildly different crawl results depending on environment...

Reply via email to