[jira] Created: (NUTCH-294) Topic-maps of related searchwords

2006-06-02 Thread Stefan Neufeind (JIRA)
Topic-maps of related searchwords - Key: NUTCH-294 URL: http://issues.apache.org/jira/browse/NUTCH-294 Project: Nutch Type: New Feature Components: searcher Reporter: Stefan Neufeind Would it be possible to offer a user

[jira] Commented: (NUTCH-282) Showing too few results on a page (Paging not correct)

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-282?page=comments#action_12414435 ] Stefan Groschupf commented on NUTCH-282: Is that related to host grouping we discussed? Can we in this case close this bug? Showing too few results on a page (Paging

[jira] Commented: (NUTCH-286) Handling common error-pages as 404

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-286?page=comments#action_12414439 ] Stefan Groschupf commented on NUTCH-286: This is difficult to realize since the http error code is readed from response in the fetcher and setted into the protocol

[jira] Commented: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-292?page=comments#action_12414443 ] Stefan Groschupf commented on NUTCH-292: +1, Can someone create a clean patch file? OpenSearchServlet: OutOfMemoryError: Java heap space

[jira] Commented: (NUTCH-291) OpenSearchServlet should return date as well as lastModified

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-291?page=comments#action_12414445 ] Stefan Groschupf commented on NUTCH-291: lastModified will be only indexed if you switch on the index-more plugin. If you think you should change the way lastmodified

[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414448 ] Stefan Groschupf commented on NUTCH-290: If a parser throws an exeption: Fetcher, 261: try { parse = this.parseUtil.parse(content); parseStatus =

[jira] Updated: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space

2006-06-02 Thread Stefan Neufeind (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-292?page=all ] Stefan Neufeind updated NUTCH-292: -- Attachment: NUTCH-292-summarizer08.diff As per demand, here is the patch. Please note that it has not throughly been testeed by myself. But the patch

[jira] Closed: (NUTCH-287) Exception when searching with sort

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-287?page=all ] Stefan Groschupf closed NUTCH-287: -- Resolution: Won't Fix http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04696.html Exception when searching with sort

[jira] Closed: (NUTCH-284) NullPointerException during index

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-284?page=all ] Stefan Groschupf closed NUTCH-284: -- Resolution: Won't Fix Yes, I was missing index-basic. NullPointerException during index - Key: NUTCH-284

[jira] Commented: (NUTCH-284) NullPointerException during index

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-284?page=comments#action_12414453 ] Stefan Groschupf commented on NUTCH-284: Please try discuss such things first in the user mailing list than open a issue. Maintaining the issue tracking is very time

[jira] Commented: (NUTCH-281) cached.jsp: base-href needs to be outside comments

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-281?page=comments#action_12414454 ] Stefan Groschupf commented on NUTCH-281: Can you submit a patch file? cached.jsp: base-href needs to be outside comments

[jira] Commented: (NUTCH-274) Empty row in/at end of URL-list results in error

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-274?page=comments#action_12414457 ] Stefan Groschupf commented on NUTCH-274: Should we fix this in TextInputFormat of Hadoop to ignore emthy lines or in the Injector? Empty row in/at end of URL-list

[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-06-02 Thread Stefan Neufeind (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414458 ] Stefan Neufeind commented on NUTCH-290: --- But if one plugin fails in 0.8-dev, isn't the next used? I understand that in the default-config the text-parser would be used

[jira] Commented: (NUTCH-291) OpenSearchServlet should return date as well as lastModified

2006-06-02 Thread Stefan Neufeind (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-291?page=comments#action_12414466 ] Stefan Neufeind commented on NUTCH-291: --- Which way is most favorable? To always set lastModified although it was not returned from the webserver (maybe unclean) or

[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414469 ] Stefan Groschupf commented on NUTCH-290: As far I understand the code, the next parser is only used if the previous parser return with a unsuccessfully paring status.

[jira] Closed: (NUTCH-286) Handling common error-pages as 404

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-286?page=all ] Stefan Groschupf closed NUTCH-286: -- Resolution: Won't Fix I hope everybody agree with the statement: We can not detect http response codes based on responded html content. Prune the

[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-06-02 Thread Stefan Neufeind (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414477 ] Stefan Neufeind commented on NUTCH-290: --- But to my understanding of the plugin it still extracts as much as possible (meta-data) from the PDF. So if indexing is not

[jira] Created: (NUTCH-295) More description for fetcher.threads.fetch property

2006-06-02 Thread Dennis Kubes (JIRA)
More description for fetcher.threads.fetch property --- Key: NUTCH-295 URL: http://issues.apache.org/jira/browse/NUTCH-295 Project: Nutch Type: Improvement Components: fetcher Versions: 0.8-dev

[jira] Updated: (NUTCH-295) More description for fetcher.threads.fetch property

2006-06-02 Thread Dennis Kubes (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-295?page=all ] Dennis Kubes updated NUTCH-295: --- Attachment: fetcher_threads_desc.patch More description for fetcher.threads.fetch property as relating to running in distributed mode. More description for