Thanks, I'll look into it. Though, I have never really tried that
level of granularity.  So, i'll have to figure out what you just told
me to do!  hah.



On 4/2/07, Enis Soztutar <[EMAIL PROTECTED]> wrote:
> Briggs wrote:
> > nutch 0.7.2
> >
> > I have 2 scenarios (both using the exact same configurations):
> >
> > 1) Running the crawl tool from the command line:
> >
> >    ./bin/nutch crawl -local urlfile.txt -dir /tmp/somedir -depth 5
> >
> > 2) Running the crawl tool from a web app somewhere in code like:
> >
> >    final String[] args = new String[]{
> >        "-local", "/tmp/urlfile.txt",
> >        "-dir", "/tmp/somedir",
> >        "-depth", "5"};
> >
> >    CrawlTool.main(args);
> >
> >
> > When I run the first scenario, I may get thousands of pages, but when
> > I run the second scenario my results vary wildly.  I mean, I get
> > perhaps 0,1,10+, 100+.  But, I rarely ever get a good crawl from
> > within a web application.  So, there are many things that could be
> > going wrong here....
> >
> > 1) Is there some sort of parsing issue?  An xml parser, regex,
> > timeouts... something?  Not sure.  But, it just won't crawl as well as
> > the 'standalone mode'.
> >
> > 2) Is it a bad idea to use many concurrent CrawlTools, or even reusing
> > a crawl tool (more than once) within a instance of a JVM?  It seems to
> > have problems doing this. I am thinking there are some static
> > references that don't really like handling such use. But this is just
> > a wild accusation that I am not sure of.
> >
> >
> >
> Checking out the logs might help in this case. From my experience, i can
> say that there can be some classloading problem with the crawl running
> in a servlet container. I suggest you also try running the crawl step
> wise, by first running inject, generate, fetch. etc.
>
>
>
>


-- 
"Concious decisions by concious minds are what make reality real"

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to