[jira] Commented: (NUTCH-284) NullPointerException during index

2006-05-25 Thread Marko Bauhardt (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-284?page=comments#action_12413227 ] 

Marko Bauhardt commented on NUTCH-284:
--

I think the index-basic plugin is not included? Because
Line 111:  doc.getField(url).stringValue() 

The BasicIndexingFilter index the field url.

 Verify your Logfile or the nutch-default.xml (or nutch-site.xml).

Marko



 NullPointerException during index
 -

  Key: NUTCH-284
  URL: http://issues.apache.org/jira/browse/NUTCH-284
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Stefan Neufeind


 For  quite a few this reduce  sort has been going on. Then it fails. What 
 could be wrong with this?
 060524 212613 reduce  sort
 060524 212614 reduce  sort
 060524 212615 reduce  sort
 060524 212615 found resource common-terms.utf8 at 
 file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
 060524 212615 found resource common-terms.utf8 at 
 file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
 060524 212619 Optimizing index.
 060524 212619 job_jlbhhm
 java.lang.NullPointerException
 at 
 org.apache.nutch.indexer.Indexer$OutputFormat$1.write(Indexer.java:111)
 at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:269)
 at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:253)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:282)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:114)
 Exception in thread main java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
 at org.apache.nutch.indexer.Indexer.index(Indexer.java:287)
 at org.apache.nutch.indexer.Indexer.main(Indexer.java:304)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-284) NullPointerException during index

2006-05-25 Thread Gal Nitzan (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-284?page=comments#action_12413231 ] 

Gal Nitzan commented on NUTCH-284:
--

I just had somthing similar.

Try the following:

run ant on each of your tasktrackers machines:

% ant

than restart your nutch and try again.

I think there is a problem with the classpath

 NullPointerException during index
 -

  Key: NUTCH-284
  URL: http://issues.apache.org/jira/browse/NUTCH-284
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Stefan Neufeind


 For  quite a few this reduce  sort has been going on. Then it fails. What 
 could be wrong with this?
 060524 212613 reduce  sort
 060524 212614 reduce  sort
 060524 212615 reduce  sort
 060524 212615 found resource common-terms.utf8 at 
 file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
 060524 212615 found resource common-terms.utf8 at 
 file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
 060524 212619 Optimizing index.
 060524 212619 job_jlbhhm
 java.lang.NullPointerException
 at 
 org.apache.nutch.indexer.Indexer$OutputFormat$1.write(Indexer.java:111)
 at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:269)
 at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:253)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:282)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:114)
 Exception in thread main java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
 at org.apache.nutch.indexer.Indexer.index(Indexer.java:287)
 at org.apache.nutch.indexer.Indexer.main(Indexer.java:304)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-284) NullPointerException during index

2006-05-25 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-284?page=comments#action_12413240 ] 

Stefan Neufeind commented on NUTCH-284:
---

Yes, I was missing index-basic. Please apologize. I needed the extra fields of 
index-more and thought it would do the basic fields as well.
The same thing occured in NUTCH-51.

Would it be possible to maybe demand that index-basic is loaded (same like 
well, you need a scoring-plugin etc.)? What if somebody writes his own 
index-basic2-plugin - then he'd have to be able to put an provides 
index-basic into his plugin to notify that he indexes the basic fields or so. 
Maybe something like this could avoid trouble / searching for some people like 
me :-)

 NullPointerException during index
 -

  Key: NUTCH-284
  URL: http://issues.apache.org/jira/browse/NUTCH-284
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Stefan Neufeind


 For  quite a few this reduce  sort has been going on. Then it fails. What 
 could be wrong with this?
 060524 212613 reduce  sort
 060524 212614 reduce  sort
 060524 212615 reduce  sort
 060524 212615 found resource common-terms.utf8 at 
 file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
 060524 212615 found resource common-terms.utf8 at 
 file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
 060524 212619 Optimizing index.
 060524 212619 job_jlbhhm
 java.lang.NullPointerException
 at 
 org.apache.nutch.indexer.Indexer$OutputFormat$1.write(Indexer.java:111)
 at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:269)
 at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:253)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:282)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:114)
 Exception in thread main java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
 at org.apache.nutch.indexer.Indexer.index(Indexer.java:287)
 at org.apache.nutch.indexer.Indexer.main(Indexer.java:304)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-287) Exception when searching with sort

2006-05-25 Thread Stefan Neufeind (JIRA)
Exception when searching with sort
--

 Key: NUTCH-287
 URL: http://issues.apache.org/jira/browse/NUTCH-287
 Project: Nutch
Type: Bug

  Components: searcher  
Versions: 0.8-dev
Reporter: Stefan Neufeind
Priority: Critical


Running a search with  sort=url works.
But when usingsort=title   I get the following exception.

2006-05-25 14:04:25 StandardWrapperValve[jsp]: Servlet.service() for servlet 
jsp threw exception
java.lang.RuntimeException: Unknown sort value type!
at 
org.apache.nutch.searcher.IndexSearcher.translateHits(IndexSearcher.java:157)
at org.apache.nutch.searcher.IndexSearcher.search(IndexSearcher.java:95)
at org.apache.nutch.searcher.NutchBean.search(NutchBean.java:239)
at org.apache.jsp.search_jsp._jspService(search_jsp.java:257)
at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324)
at 
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
at 
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
at 
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
at 
org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152)
at 
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
at 
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137)
at 
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118)
at 
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102)
at 
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
at 
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929)
at 
org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799)
at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705)
at 
org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577)
at 
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
at java.lang.Thread.run(Thread.java:595)

What is in those lines is:

  WritableComparable sortValue;   // convert value to writable
  if (sortField == null) {
sortValue = new FloatWritable(scoreDocs[i].score);
  } else {
Object raw = ((FieldDoc)scoreDocs[i]).fields[0];
if (raw instanceof Integer) {
  sortValue = new IntWritable(((Integer)raw).intValue());
} else if (raw instanceof Float) {
  sortValue = new FloatWritable(((Float)raw).floatValue());
} else if (raw instanceof String) {
  sortValue = new UTF8((String)raw);
} else {
  throw new RuntimeException(Unknown sort value type!);
}
  }


So I thought that maybe raw is an instance of something strange and tried 
raw.getClass().getName() or also raw.toString() to track the cause down - but 
that always resulted in a NullPointerException. So it seems I'm having raw 
being null for some strange reason.

When I try with title2 (or something none-existing) I get a different error 
that title2 is unknown / not indexed. So I suspect that title should be fine 
here ...

If there is any information I can help out with, let me know.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   

[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2006-05-25 Thread Stefan Neufeind (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]

Stefan Neufeind updated NUTCH-110:
--

Attachment: fixIllegalXmlChars08.patch

Since original patch didn't cleanly apply for me on 0.8-dev 
(nightly-2006-05-20) I re-did it for 0.8 ...

With this patch the XML is fine. Without I had big trouble parsing the RSS-feed 
in another application.

 OpenSearchServlet outputs illegal xml characters
 

  Key: NUTCH-110
  URL: http://issues.apache.org/jira/browse/NUTCH-110
  Project: Nutch
 Type: Bug

   Components: searcher
 Versions: 0.7
  Environment: linux, jdk 1.5
 Reporter: [EMAIL PROTECTED]
  Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, 
 fixIllegalXmlChars08.patch

 OpenSearchServlet does not check text-to-output for illegal xml characters; 
 dependent on  search result, its possible for OSS to output xml that is not 
 well-formed.  For example, if text has the character FF character in it -- -- 
 i.e. the ascii character at position (decimal) 12 --  the produced XML will 
 show the FF character as '#12;' The character/entity '#12;' is not legal in 
 XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-288) hitsPerSite-functionality flawed: problems writing a page-navigation

2006-05-25 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-288?page=comments#action_12413272 ] 

Doug Cutting commented on NUTCH-288:


 Is there a performant way of doing deduplication and knowing for sure how 
 many documents are available to view?

No.  But we should probably handle this situation without throwing exceptions.

For example, look at the following:

http://www.google.com/search?q=emacs+%22doug+cutting%22start=90

Click on the page 19 link at the bottom.  It takes you to page 16, the last 
page after deduplication.


 hitsPerSite-functionality flawed: problems writing a page-navigation
 --

  Key: NUTCH-288
  URL: http://issues.apache.org/jira/browse/NUTCH-288
  Project: Nutch
 Type: Bug

   Components: web gui
 Versions: 0.8-dev
 Reporter: Stefan Neufeind


 The deduplication-functionality on a per-site-basis (hitsPerSite = 3) leads 
 to problems when trying to offer a page-navigation (e.g. allow the user to 
 jump to page 10). This is because dedup is done after fetching.
 RSS shows a maximum number of 7763 documents (that is without dedup!), I set 
 it to display 10 items per page. My naive approach was to estimate I have 
 7763/10 = 777 pages. But already when moving to page 3 I got no more 
 searchresults (I guess because of dedup). And when moving to page 10 I  got 
 an exception (see below).
 2006-05-25 16:24:43 StandardWrapperValve[OpenSearch]: Servlet.service() for 
 servlet OpenSearch threw exception
 java.lang.NegativeArraySizeException
 at org.apache.nutch.searcher.Hits.getHits(Hits.java:65)
 at 
 org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:149)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929)
 at 
 org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160)
 at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799)
 at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705)
 at 
 org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577)
 at 
 org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
 at java.lang.Thread.run(Thread.java:595)
 Only workaround I see for the moment: Fetching RSS without duplication, dedup 
 myself and cache the RSS-result to improve performance. But a cleaner 
 solution would imho be nice. Is there a performant way of doing deduplication 
 and knowing for sure how many documents are available to view? For sure this 
 would mean to dedup all search-results first ...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-288) hitsPerSite-functionality flawed: problems writing a page-navigation

2006-05-25 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-288?page=comments#action_12413275 ] 

Stefan Neufeind commented on NUTCH-288:
---

How do they do that? Right, I'm transfered to page 16. But if I click on page 
14 this also seems to be the last page in order? Something looks strange there, 
too ...

And using Nutch: How should I know (using the RSS-feed) on which page I am? I'm 
getting the above exception - no reply, and no new start-value so I could 
compute on which page I actually am. Is there a quickfix possible somehow?

 hitsPerSite-functionality flawed: problems writing a page-navigation
 --

  Key: NUTCH-288
  URL: http://issues.apache.org/jira/browse/NUTCH-288
  Project: Nutch
 Type: Bug

   Components: web gui
 Versions: 0.8-dev
 Reporter: Stefan Neufeind


 The deduplication-functionality on a per-site-basis (hitsPerSite = 3) leads 
 to problems when trying to offer a page-navigation (e.g. allow the user to 
 jump to page 10). This is because dedup is done after fetching.
 RSS shows a maximum number of 7763 documents (that is without dedup!), I set 
 it to display 10 items per page. My naive approach was to estimate I have 
 7763/10 = 777 pages. But already when moving to page 3 I got no more 
 searchresults (I guess because of dedup). And when moving to page 10 I  got 
 an exception (see below).
 2006-05-25 16:24:43 StandardWrapperValve[OpenSearch]: Servlet.service() for 
 servlet OpenSearch threw exception
 java.lang.NegativeArraySizeException
 at org.apache.nutch.searcher.Hits.getHits(Hits.java:65)
 at 
 org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:149)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929)
 at 
 org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160)
 at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799)
 at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705)
 at 
 org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577)
 at 
 org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
 at java.lang.Thread.run(Thread.java:595)
 Only workaround I see for the moment: Fetching RSS without duplication, dedup 
 myself and cache the RSS-result to improve performance. But a cleaner 
 solution would imho be nice. Is there a performant way of doing deduplication 
 and knowing for sure how many documents are available to view? For sure this 
 would mean to dedup all search-results first ...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-288) hitsPerSite-functionality flawed: problems writing a page-navigation

2006-05-25 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-288?page=comments#action_12413305 ] 

Doug Cutting commented on NUTCH-288:


  Is there a quickfix possible somehow?

Someone needs to fix the OpenSearch servlet.

It looks like just changing line 146 of OpenSearchServlet.java, replacing:

Hit[] show = hits.getHits(start, end-start);

with:

Hit[] show = hits.getHits(start, length  0 ? length : 0);

Give this a try.

 hitsPerSite-functionality flawed: problems writing a page-navigation
 --

  Key: NUTCH-288
  URL: http://issues.apache.org/jira/browse/NUTCH-288
  Project: Nutch
 Type: Bug

   Components: web gui
 Versions: 0.8-dev
 Reporter: Stefan Neufeind


 The deduplication-functionality on a per-site-basis (hitsPerSite = 3) leads 
 to problems when trying to offer a page-navigation (e.g. allow the user to 
 jump to page 10). This is because dedup is done after fetching.
 RSS shows a maximum number of 7763 documents (that is without dedup!), I set 
 it to display 10 items per page. My naive approach was to estimate I have 
 7763/10 = 777 pages. But already when moving to page 3 I got no more 
 searchresults (I guess because of dedup). And when moving to page 10 I  got 
 an exception (see below).
 2006-05-25 16:24:43 StandardWrapperValve[OpenSearch]: Servlet.service() for 
 servlet OpenSearch threw exception
 java.lang.NegativeArraySizeException
 at org.apache.nutch.searcher.Hits.getHits(Hits.java:65)
 at 
 org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:149)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929)
 at 
 org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160)
 at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799)
 at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705)
 at 
 org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577)
 at 
 org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
 at java.lang.Thread.run(Thread.java:595)
 Only workaround I see for the moment: Fetching RSS without duplication, dedup 
 myself and cache the RSS-result to improve performance. But a cleaner 
 solution would imho be nice. Is there a performant way of doing deduplication 
 and knowing for sure how many documents are available to view? For sure this 
 would mean to dedup all search-results first ...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-288) hitsPerSite-functionality flawed: problems writing a page-navigation

2006-05-25 Thread Stefan Neufeind (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-288?page=all ]

Stefan Neufeind updated NUTCH-288:
--

Attachment: NUTCH-288-OpenSearch-fix.patch

This patch includes Doug's one-line fix to prevent an exception.
Also it does go back page by page until you get to the last result-page. The 
start-value returned in the RSS-feed is correct afterwards(!). This easily 
allows you to check whether the requested result-start and the one received are 
identical - otherwise you are on the last page and were redirected - and now 
know that you don't need to display any pages in your page-navigation following 
this one :-)

Applies and works fine for me.

 hitsPerSite-functionality flawed: problems writing a page-navigation
 --

  Key: NUTCH-288
  URL: http://issues.apache.org/jira/browse/NUTCH-288
  Project: Nutch
 Type: Bug

   Components: web gui
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
  Attachments: NUTCH-288-OpenSearch-fix.patch

 The deduplication-functionality on a per-site-basis (hitsPerSite = 3) leads 
 to problems when trying to offer a page-navigation (e.g. allow the user to 
 jump to page 10). This is because dedup is done after fetching.
 RSS shows a maximum number of 7763 documents (that is without dedup!), I set 
 it to display 10 items per page. My naive approach was to estimate I have 
 7763/10 = 777 pages. But already when moving to page 3 I got no more 
 searchresults (I guess because of dedup). And when moving to page 10 I  got 
 an exception (see below).
 2006-05-25 16:24:43 StandardWrapperValve[OpenSearch]: Servlet.service() for 
 servlet OpenSearch threw exception
 java.lang.NegativeArraySizeException
 at org.apache.nutch.searcher.Hits.getHits(Hits.java:65)
 at 
 org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:149)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929)
 at 
 org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160)
 at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799)
 at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705)
 at 
 org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577)
 at 
 org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
 at java.lang.Thread.run(Thread.java:595)
 Only workaround I see for the moment: Fetching RSS without duplication, dedup 
 myself and cache the RSS-result to improve performance. But a cleaner 
 solution would imho be nice. Is there a performant way of doing deduplication 
 and knowing for sure how many documents are available to view? For sure this 
 would mean to dedup all search-results first ...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: