Jp Mutch wrote:
>
> My questions are regarding crawling and testing/searching:
> Due to my local requirements, initially I just need to run all of nutch
> on a single machine in its local filesystem, without really needing
> Hadoop or DFS [I don't mind if they are running "under the hood"].
>
>From looking at the code, it doesn't look like anyone is setting the
>modifiedTime in the CrawlDatum. Is this a bug? I guess we can kinda derive
>the modifiedTime by looking at the fetchTime and possibly fetchInterval based
>on status. But if the modifiedTime field is there in CrawlDatum I
[ http://issues.apache.org/jira/browse/NUTCH-367?page=all ]
Sami Siren resolved NUTCH-367.
--
Fix Version/s: 0.9.0
Resolution: Fixed
Assignee: Sami Siren
I just committed a fix for this together with testcase, thanks for reporting it.
> Distr
Andrzej Bialecki (JIRA) wrote:
> [
> http://issues.apache.org/jira/browse/NUTCH-368?page=comments#action_12435710
> ]
>
> Andrzej Bialecki commented on NUTCH-368:
> -
>
>> IMO a place for stuff like this is in hadoop more than nutch and
[
http://issues.apache.org/jira/browse/NUTCH-364?page=comments#action_12435945 ]
Doug Cook commented on NUTCH-364:
-
I've been looking into this a little bit. I see two problems:
(1) The current "two pass" heuristic URL-like string extractor has
[ http://issues.apache.org/jira/browse/NUTCH-105?page=all ]
Sami Siren resolved NUTCH-105.
--
Resolution: Fixed
This is now committed, thanks!
> Network error during robots.txt fetch causes file to be ignored
> ---
[
http://issues.apache.org/jira/browse/NUTCH-368?page=comments#action_12435710 ]
Andrzej Bialecki commented on NUTCH-368:
-
> IMO a place for stuff like this is in hadoop more than nutch and i would like
> to see this implemented there.