[ http://issues.apache.org/jira/browse/NUTCH-288?page=comments#action_12413305 ]
Doug Cutting commented on NUTCH-288: ------------------------------------ > Is there a quickfix possible somehow? Someone needs to fix the OpenSearch servlet. It looks like just changing line 146 of OpenSearchServlet.java, replacing: Hit[] show = hits.getHits(start, end-start); with: Hit[] show = hits.getHits(start, length > 0 ? length : 0); Give this a try. > hitsPerSite-functionality "flawed": problems writing a page-navigation > ---------------------------------------------------------------------- > > Key: NUTCH-288 > URL: http://issues.apache.org/jira/browse/NUTCH-288 > Project: Nutch > Type: Bug > Components: web gui > Versions: 0.8-dev > Reporter: Stefan Neufeind > > The deduplication-functionality on a per-site-basis (hitsPerSite = 3) leads > to problems when trying to offer a page-navigation (e.g. allow the user to > jump to page 10). This is because dedup is done after fetching. > RSS shows a maximum number of 7763 documents (that is without dedup!), I set > it to display 10 items per page. My "naive" approach was to estimate I have > 7763/10 = 777 pages. But already when moving to page 3 I got no more > searchresults (I guess because of dedup). And when moving to page 10 I got > an exception (see below). > 2006-05-25 16:24:43 StandardWrapperValve[OpenSearch]: Servlet.service() for > servlet OpenSearch threw exception > java.lang.NegativeArraySizeException > at org.apache.nutch.searcher.Hits.getHits(Hits.java:65) > at > org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:149) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:689) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173) > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214) > at > org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) > at > org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) > at > org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198) > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152) > at > org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) > at > org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137) > at > org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118) > at > org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102) > at > org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) > at > org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) > at > org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) > at > org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929) > at > org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160) > at > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799) > at > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705) > at > org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577) > at > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684) > at java.lang.Thread.run(Thread.java:595) > Only workaround I see for the moment: Fetching RSS without duplication, dedup > myself and cache the RSS-result to improve performance. But a cleaner > solution would imho be nice. Is there a performant way of doing deduplication > and knowing for sure how many documents are available to view? For sure this > would mean to dedup all search-results first ... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira