Hi all,
The following issues need to be discussed and appropriate action taken
before the 0.9 release:
Blocker
========
* NUTCH-400 (Update & add missing license headers) - I believe this is
fixed and should be closed
* NUTCH-353 (pages that serverside forwards will be refetched every
time) - this was partially fixed in NUTCH-273, but a more complete
solution would require significant changes to LinkDb. As there are no
patches implementing this, I left it open, but it's no longer as
critical as it was before. I propose to move it to "Major" and address
it in the next release.
* NUTCH-233 (wrong regular expression hang reduce process for ever) - I
propose to apply the fix provided by Sean Dean and close this issue for now.
Critical
========
* NUTCH-436 (Incorrect handling of relative paths when the embedded URL
path is empty). There is no patch available yet. If someone could
contribute a patch I'd like to see this fixed before the release.
* NUTCH-427 (protocol-smb). This relies on a LGPL library, and it's
certainly not critical (as this is an optional new feature). I propose
to change it to Major, and make a decision - do we want another plugin
like parse-mp3 or parse-rtf, or not.
* NUTCH-381 (Ignore external link not work as expected) - I'll try to
reproduce it, and if I find an easy fix I'd like to apply it before the
release.
* NUTCH-277 (Fetcher dies because of "max. redirects") - I wasn't able
to reproduce it. If there is no updated information on this I propose to
close it with "Can't reproduce".
* NUTCH-167 (Observation of <META NAME="ROBOTS" CONTENT="NOARCHIVE">) -
there's a patch which I tested in a limited production env. If there are
no objections I'd like to apply it before the release.
Major
=====
There are 84 major issues, but some of them are either invalid, or
should be "minor", or no longer apply and should be closed. Please
review them if you can and provide some comments or recommendations if
you think you have some new information.
One decision also that we need to make is which version of Hadoop should
be included in the release. Current trunk uses 0.10.1, I have a set of
production-tested patches that use 0.11.2, and today the Hadoop team
released 0.12.0 (to be followed shortly by a 0.12.1, most likely in time
before our release). The most conservative option is to stay with
0.10.1, but by the time people start using Nutch this will be a fairly
old version already. I propose to upgrade to 0.11.2. We could use 0.12.1
- but in this case with the expectation that we release less than stable
version of Nutch to be soon followed by a minor stable release ...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com