Stefan Groschupf wrote:

Hi,
I counted the votes manually, I hope I didn't oversee something. I didn't filter out issues that are 0.8 related, since it is good to know community wishes anyway. :-)


Shouldn't the period for voting be a bit longer? I didn't have time to vote yet... Anyway, my take on this:


NUTCH-140 Add alias capability in parse-plugins.xml file that allows mimeType->extensionId mapping
1
NUTCH-139    Standard metadata property names in the ParseData metadata
2

+1

NUTCH-138    non-Latin-1 characters cannot be submitted for search
1
NUTCH-3 multi values of header discarded 1


+1


NUTCH-134 Summarizer doesn't select the best snippets 1


+1
I have some patches, which use Lucene Highlighter package instead.

NUTCH-98    RobotRulesParser interprets robots.txt incorrectly
1
NUTCH-120 one "bad" link on a page kills parsing 3
NUTCH-127    uncorrect values using -du, or ls does not return items
2


+1

NUTCH-126    Fetching via https does not work with a proxy (patch)
1
NUTCH-125 OpenOffice Parser plugin 2


+1. Ready to commit, I'll do it tomorrow.

NUTCH-110    OpenSearchServlet outputs illegal xml characters
1
NUTCH-36 Chinese in Nutch 1
NUTCH-123    Cache.jsp some times generate NullPointerException
1
NUTCH-121 SegmentReader for mapred 2


Nearly ready to commit, I can do it probably by the end of the week. However, this is valid only for the mapred branch, so it doesn't affect the release.

NUTCH-119 Regexp to extract outlinks incorrect 1 NUTCH-115 jobtracker.jsp shows too much information 1
NUTCH-108    tasktracker crashs when reconnecting to a new jobtracker.
1
NUTCH-113    Disable permanent DNS-to-IP caching for JVM 1.4
1
NUTCH-111 ndfs.replication is not documented within the nutch- default.xml configuration file.
1
NUTCH-100 New plugin urlfilter-db 1 NUTCH-106 Datanode corruption 1
NUTCH-95    DeleteDuplicates depends on the order of input segments
1


+1

NUTCH-92 DistributedSearch incorrectly scores results 2


+1. However, solving this correctly is _hard_ ... it's a very similar problem to the MultiSearcher in Lucene, and it took that group quite some time to reach an acceptable solution...

NUTCH-91 empty encoding causes exception 1 NUTCH-52 Parser plugin for MS Excel files 1 NUTCH-74 French Analyzer Plugin 1 NUTCH-64 no results after a restart of a search--server (without tomcat restart)
1
NUTCH-68 A tool to generate arbitrary fetchlists 1 NUTCH-62 Add html META tag information into metaData in index-more plugin
1
NUTCH-61    Adaptive re-fetch interval. Detecting umodified content
1

+1. I think this is an important feature. I have some patches, which need to be updated. However, I wouldn't be so bold as to commit them just before a release. There are quite a few subtle issues with the segment handling if you use this.

NUTCH-13    If dns points to 127.0.0.1, the url is also crawled
1
NUTCH-48    "Did you mean" query enhancement/refignment feature request
1
NUTCH-45 Log corrupt segments in SegmentMergeTool 1 NUTCH-24 Cannot handle incorrectly cased Content-Type 1


Isn't this solved already?

NUTCH-16 boost documents matching a url pattern 1





--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to