I've testing Tom's nice S3/EC2 patch on couple of EC2/S3 machine. Injector
fails to inject urls, because fs.rename() in line 145 of
CrawlDb.javadeletes the whole content and only renames the parent
folder from x to
current. Basiclly,. crawl_dir/crawldb/current will an empty folder after
renami
trackers still fetch together though I have only
> 3 sites in the fetchlist.
>
> The task trackers fetch the same pages...
>
> I have used latest build from hadoop trunk.
>
> Gal.
>
>
> On Fri, 2006-02-24 at 14:15 -0800, Doug Cutting wrote:
> > Mike Smith wrote:
>
Hi,
This problem is killer! I've been strugelling with this for about a month!
It doesn't happen all the time, because of this problem the largest crawl I
could ever done is about 1 million pages. I have three machines, 3
datanode, 1 data replicate, 1 job tracker, here is what I get:
nameserver
[
http://issues.apache.org/jira/browse/NUTCH-136?page=comments#action_12363587 ]
Mike Smith commented on NUTCH-136:
--
I have had the same problem. Florent suggested to use "protocol-http" instead
of "protocol-httpclient", this fixed t