Ciao,
in the meanwhile I've done some other test using nutch 2.1 with
DefaultFetchSchedule where I've put:
modifiedTime = fetchTime;
instead of:
if (modifiedTime <= 0) modifiedTime = fetchTime;
I don't know if this is correct (probably not) but at least 304 seems to be
handled. In particular, in the protocol-file (File.getProtocolOutput) I've
added a special case for 304:
if (code == 304) { // got a not modified response
return new ProtocolOutput(response.toContent(),
ProtocolStatusUtils.makeStatus(ProtocolStatusCodes.NOTMODIFIED));
}
I suppose this is NOT the right solution :-)
Anyway, this is another problem I have with protocol-file. I have the seed:
file://localhost/tmp/files/
this directory contains a couple of files, aa.txt and bbbbb.txt
If a file is deleted, recrawl, readded, it is ignored. I mean:
./nutch crawl urls -depth 2 -topN 5
rm /tmp/files/bbbbb.txt
./nutch crawl urls -depth 2 -topN 5
echo "saaaszzz" >/tmp/files/bbbbb.txt
./nutch crawl urls -depth 2 -topN 5
...
Skipping file://localhost/tmp/files/bbbbb.txt; different batch id (null)
...
and the dump sticks with
...
baseUrl: file://localhost/tmp/files/bbbbb.txt
status: 1 (status_unfetched)
...
protocolStatus: EXCEPTION, args=[org.apache.nutch.protocol.file.FileError:
File Error: 404]
what am I doing wrong?
Thanks a lot!
On Thu, Nov 15, 2012 at 7:25 PM, Sebastian Nagel <[email protected]
> wrote:
> Hi Cesare,
>
> hmhh... Good catch!
>
> The modifiedTime is also set in CrawlDbReducer.reduce
> right after FetchSchedule.setFetchSchedule is called and the signature
> hasn't changed compared to the previous fetch, cf. NUTCH-1341.
>
> At a first glance, it looks like the modifiedTime is indeed never set
> with DefaultFetchSchedule.
> I'll have a more detailed look at this and come back soon.
>
> Thanks,
> Sebastian
>
> On 11/15/2012 12:33 PM, Cesare Zavattari wrote:
> > Hi all,
> > the AdaptiveFetchSchedure has the following line:
> >
> > if (modifiedTime <= 0) modifiedTime = fetchTime;
> >
> > that DefaultFetchSchedule has not. This seems to
> > prevent DefaultFetchSchedule handle correctly possible 403 responses
> > (modifiedTime seems to be always zero and HttpRequest.java doesn't
> > set If-Modified-Since request part).
> >
> > This is true for both nutch 1.x and 2.x.
> >
> > Is this the expected behaviour?
> >
> > Thanks
> > Bye
> >
>
>
--
Cesare