[jira] Commented: (NUTCH-284) NullPointerException during index
[ http://issues.apache.org/jira/browse/NUTCH-284?page=comments#action_12413227 ] Marko Bauhardt commented on NUTCH-284: -- I think the index-basic plugin is not included? Because Line 111: doc.getField(url).stringValue() The BasicIndexingFilter index the field url. Verify your Logfile or the nutch-default.xml (or nutch-site.xml). Marko NullPointerException during index - Key: NUTCH-284 URL: http://issues.apache.org/jira/browse/NUTCH-284 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: Stefan Neufeind For quite a few this reduce sort has been going on. Then it fails. What could be wrong with this? 060524 212613 reduce sort 060524 212614 reduce sort 060524 212615 reduce sort 060524 212615 found resource common-terms.utf8 at file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8 060524 212615 found resource common-terms.utf8 at file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8 060524 212619 Optimizing index. 060524 212619 job_jlbhhm java.lang.NullPointerException at org.apache.nutch.indexer.Indexer$OutputFormat$1.write(Indexer.java:111) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:269) at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:253) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:282) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:114) Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341) at org.apache.nutch.indexer.Indexer.index(Indexer.java:287) at org.apache.nutch.indexer.Indexer.main(Indexer.java:304) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-284) NullPointerException during index
[ http://issues.apache.org/jira/browse/NUTCH-284?page=comments#action_12413231 ] Gal Nitzan commented on NUTCH-284: -- I just had somthing similar. Try the following: run ant on each of your tasktrackers machines: % ant than restart your nutch and try again. I think there is a problem with the classpath NullPointerException during index - Key: NUTCH-284 URL: http://issues.apache.org/jira/browse/NUTCH-284 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: Stefan Neufeind For quite a few this reduce sort has been going on. Then it fails. What could be wrong with this? 060524 212613 reduce sort 060524 212614 reduce sort 060524 212615 reduce sort 060524 212615 found resource common-terms.utf8 at file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8 060524 212615 found resource common-terms.utf8 at file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8 060524 212619 Optimizing index. 060524 212619 job_jlbhhm java.lang.NullPointerException at org.apache.nutch.indexer.Indexer$OutputFormat$1.write(Indexer.java:111) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:269) at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:253) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:282) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:114) Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341) at org.apache.nutch.indexer.Indexer.index(Indexer.java:287) at org.apache.nutch.indexer.Indexer.main(Indexer.java:304) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-284) NullPointerException during index
[ http://issues.apache.org/jira/browse/NUTCH-284?page=comments#action_12413240 ] Stefan Neufeind commented on NUTCH-284: --- Yes, I was missing index-basic. Please apologize. I needed the extra fields of index-more and thought it would do the basic fields as well. The same thing occured in NUTCH-51. Would it be possible to maybe demand that index-basic is loaded (same like well, you need a scoring-plugin etc.)? What if somebody writes his own index-basic2-plugin - then he'd have to be able to put an provides index-basic into his plugin to notify that he indexes the basic fields or so. Maybe something like this could avoid trouble / searching for some people like me :-) NullPointerException during index - Key: NUTCH-284 URL: http://issues.apache.org/jira/browse/NUTCH-284 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: Stefan Neufeind For quite a few this reduce sort has been going on. Then it fails. What could be wrong with this? 060524 212613 reduce sort 060524 212614 reduce sort 060524 212615 reduce sort 060524 212615 found resource common-terms.utf8 at file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8 060524 212615 found resource common-terms.utf8 at file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8 060524 212619 Optimizing index. 060524 212619 job_jlbhhm java.lang.NullPointerException at org.apache.nutch.indexer.Indexer$OutputFormat$1.write(Indexer.java:111) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:269) at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:253) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:282) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:114) Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341) at org.apache.nutch.indexer.Indexer.index(Indexer.java:287) at org.apache.nutch.indexer.Indexer.main(Indexer.java:304) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-287) Exception when searching with sort
Exception when searching with sort -- Key: NUTCH-287 URL: http://issues.apache.org/jira/browse/NUTCH-287 Project: Nutch Type: Bug Components: searcher Versions: 0.8-dev Reporter: Stefan Neufeind Priority: Critical Running a search with sort=url works. But when usingsort=title I get the following exception. 2006-05-25 14:04:25 StandardWrapperValve[jsp]: Servlet.service() for servlet jsp threw exception java.lang.RuntimeException: Unknown sort value type! at org.apache.nutch.searcher.IndexSearcher.translateHits(IndexSearcher.java:157) at org.apache.nutch.searcher.IndexSearcher.search(IndexSearcher.java:95) at org.apache.nutch.searcher.NutchBean.search(NutchBean.java:239) at org.apache.jsp.search_jsp._jspService(search_jsp.java:257) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929) at org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705) at org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684) at java.lang.Thread.run(Thread.java:595) What is in those lines is: WritableComparable sortValue; // convert value to writable if (sortField == null) { sortValue = new FloatWritable(scoreDocs[i].score); } else { Object raw = ((FieldDoc)scoreDocs[i]).fields[0]; if (raw instanceof Integer) { sortValue = new IntWritable(((Integer)raw).intValue()); } else if (raw instanceof Float) { sortValue = new FloatWritable(((Float)raw).floatValue()); } else if (raw instanceof String) { sortValue = new UTF8((String)raw); } else { throw new RuntimeException(Unknown sort value type!); } } So I thought that maybe raw is an instance of something strange and tried raw.getClass().getName() or also raw.toString() to track the cause down - but that always resulted in a NullPointerException. So it seems I'm having raw being null for some strange reason. When I try with title2 (or something none-existing) I get a different error that title2 is unknown / not indexed. So I suspect that title should be fine here ... If there is any information I can help out with, let me know. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see:
[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ] Stefan Neufeind updated NUTCH-110: -- Attachment: fixIllegalXmlChars08.patch Since original patch didn't cleanly apply for me on 0.8-dev (nightly-2006-05-20) I re-did it for 0.8 ... With this patch the XML is fine. Without I had big trouble parsing the RSS-feed in another application. OpenSearchServlet outputs illegal xml characters Key: NUTCH-110 URL: http://issues.apache.org/jira/browse/NUTCH-110 Project: Nutch Type: Bug Components: searcher Versions: 0.7 Environment: linux, jdk 1.5 Reporter: [EMAIL PROTECTED] Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, fixIllegalXmlChars08.patch OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 -- the produced XML will show the FF character as '#12;' The character/entity '#12;' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-288) hitsPerSite-functionality flawed: problems writing a page-navigation
[ http://issues.apache.org/jira/browse/NUTCH-288?page=comments#action_12413272 ] Doug Cutting commented on NUTCH-288: Is there a performant way of doing deduplication and knowing for sure how many documents are available to view? No. But we should probably handle this situation without throwing exceptions. For example, look at the following: http://www.google.com/search?q=emacs+%22doug+cutting%22start=90 Click on the page 19 link at the bottom. It takes you to page 16, the last page after deduplication. hitsPerSite-functionality flawed: problems writing a page-navigation -- Key: NUTCH-288 URL: http://issues.apache.org/jira/browse/NUTCH-288 Project: Nutch Type: Bug Components: web gui Versions: 0.8-dev Reporter: Stefan Neufeind The deduplication-functionality on a per-site-basis (hitsPerSite = 3) leads to problems when trying to offer a page-navigation (e.g. allow the user to jump to page 10). This is because dedup is done after fetching. RSS shows a maximum number of 7763 documents (that is without dedup!), I set it to display 10 items per page. My naive approach was to estimate I have 7763/10 = 777 pages. But already when moving to page 3 I got no more searchresults (I guess because of dedup). And when moving to page 10 I got an exception (see below). 2006-05-25 16:24:43 StandardWrapperValve[OpenSearch]: Servlet.service() for servlet OpenSearch threw exception java.lang.NegativeArraySizeException at org.apache.nutch.searcher.Hits.getHits(Hits.java:65) at org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:149) at javax.servlet.http.HttpServlet.service(HttpServlet.java:689) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929) at org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705) at org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684) at java.lang.Thread.run(Thread.java:595) Only workaround I see for the moment: Fetching RSS without duplication, dedup myself and cache the RSS-result to improve performance. But a cleaner solution would imho be nice. Is there a performant way of doing deduplication and knowing for sure how many documents are available to view? For sure this would mean to dedup all search-results first ... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-288) hitsPerSite-functionality flawed: problems writing a page-navigation
[ http://issues.apache.org/jira/browse/NUTCH-288?page=comments#action_12413275 ] Stefan Neufeind commented on NUTCH-288: --- How do they do that? Right, I'm transfered to page 16. But if I click on page 14 this also seems to be the last page in order? Something looks strange there, too ... And using Nutch: How should I know (using the RSS-feed) on which page I am? I'm getting the above exception - no reply, and no new start-value so I could compute on which page I actually am. Is there a quickfix possible somehow? hitsPerSite-functionality flawed: problems writing a page-navigation -- Key: NUTCH-288 URL: http://issues.apache.org/jira/browse/NUTCH-288 Project: Nutch Type: Bug Components: web gui Versions: 0.8-dev Reporter: Stefan Neufeind The deduplication-functionality on a per-site-basis (hitsPerSite = 3) leads to problems when trying to offer a page-navigation (e.g. allow the user to jump to page 10). This is because dedup is done after fetching. RSS shows a maximum number of 7763 documents (that is without dedup!), I set it to display 10 items per page. My naive approach was to estimate I have 7763/10 = 777 pages. But already when moving to page 3 I got no more searchresults (I guess because of dedup). And when moving to page 10 I got an exception (see below). 2006-05-25 16:24:43 StandardWrapperValve[OpenSearch]: Servlet.service() for servlet OpenSearch threw exception java.lang.NegativeArraySizeException at org.apache.nutch.searcher.Hits.getHits(Hits.java:65) at org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:149) at javax.servlet.http.HttpServlet.service(HttpServlet.java:689) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929) at org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705) at org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684) at java.lang.Thread.run(Thread.java:595) Only workaround I see for the moment: Fetching RSS without duplication, dedup myself and cache the RSS-result to improve performance. But a cleaner solution would imho be nice. Is there a performant way of doing deduplication and knowing for sure how many documents are available to view? For sure this would mean to dedup all search-results first ... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-288) hitsPerSite-functionality flawed: problems writing a page-navigation
[ http://issues.apache.org/jira/browse/NUTCH-288?page=comments#action_12413305 ] Doug Cutting commented on NUTCH-288: Is there a quickfix possible somehow? Someone needs to fix the OpenSearch servlet. It looks like just changing line 146 of OpenSearchServlet.java, replacing: Hit[] show = hits.getHits(start, end-start); with: Hit[] show = hits.getHits(start, length 0 ? length : 0); Give this a try. hitsPerSite-functionality flawed: problems writing a page-navigation -- Key: NUTCH-288 URL: http://issues.apache.org/jira/browse/NUTCH-288 Project: Nutch Type: Bug Components: web gui Versions: 0.8-dev Reporter: Stefan Neufeind The deduplication-functionality on a per-site-basis (hitsPerSite = 3) leads to problems when trying to offer a page-navigation (e.g. allow the user to jump to page 10). This is because dedup is done after fetching. RSS shows a maximum number of 7763 documents (that is without dedup!), I set it to display 10 items per page. My naive approach was to estimate I have 7763/10 = 777 pages. But already when moving to page 3 I got no more searchresults (I guess because of dedup). And when moving to page 10 I got an exception (see below). 2006-05-25 16:24:43 StandardWrapperValve[OpenSearch]: Servlet.service() for servlet OpenSearch threw exception java.lang.NegativeArraySizeException at org.apache.nutch.searcher.Hits.getHits(Hits.java:65) at org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:149) at javax.servlet.http.HttpServlet.service(HttpServlet.java:689) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929) at org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705) at org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684) at java.lang.Thread.run(Thread.java:595) Only workaround I see for the moment: Fetching RSS without duplication, dedup myself and cache the RSS-result to improve performance. But a cleaner solution would imho be nice. Is there a performant way of doing deduplication and knowing for sure how many documents are available to view? For sure this would mean to dedup all search-results first ... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-288) hitsPerSite-functionality flawed: problems writing a page-navigation
[ http://issues.apache.org/jira/browse/NUTCH-288?page=all ] Stefan Neufeind updated NUTCH-288: -- Attachment: NUTCH-288-OpenSearch-fix.patch This patch includes Doug's one-line fix to prevent an exception. Also it does go back page by page until you get to the last result-page. The start-value returned in the RSS-feed is correct afterwards(!). This easily allows you to check whether the requested result-start and the one received are identical - otherwise you are on the last page and were redirected - and now know that you don't need to display any pages in your page-navigation following this one :-) Applies and works fine for me. hitsPerSite-functionality flawed: problems writing a page-navigation -- Key: NUTCH-288 URL: http://issues.apache.org/jira/browse/NUTCH-288 Project: Nutch Type: Bug Components: web gui Versions: 0.8-dev Reporter: Stefan Neufeind Attachments: NUTCH-288-OpenSearch-fix.patch The deduplication-functionality on a per-site-basis (hitsPerSite = 3) leads to problems when trying to offer a page-navigation (e.g. allow the user to jump to page 10). This is because dedup is done after fetching. RSS shows a maximum number of 7763 documents (that is without dedup!), I set it to display 10 items per page. My naive approach was to estimate I have 7763/10 = 777 pages. But already when moving to page 3 I got no more searchresults (I guess because of dedup). And when moving to page 10 I got an exception (see below). 2006-05-25 16:24:43 StandardWrapperValve[OpenSearch]: Servlet.service() for servlet OpenSearch threw exception java.lang.NegativeArraySizeException at org.apache.nutch.searcher.Hits.getHits(Hits.java:65) at org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:149) at javax.servlet.http.HttpServlet.service(HttpServlet.java:689) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929) at org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705) at org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684) at java.lang.Thread.run(Thread.java:595) Only workaround I see for the moment: Fetching RSS without duplication, dedup myself and cache the RSS-result to improve performance. But a cleaner solution would imho be nice. Is there a performant way of doing deduplication and knowing for sure how many documents are available to view? For sure this would mean to dedup all search-results first ... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: