Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

Andrzej Bialecki Wed, 04 Apr 2007 03:48:05 -0700

Dennis Kubes wrote:
> Ok, I ran some bigger test crawls > 150K with the 0.9RC.  Everything 
> worked fine (inject, generate, fetch, updatedb, readdb, linkdb, 
> mergesegs, mergdb, merge, index) except delete duplicates on which I am 
> getting this error when running against segment indexes on the DFS.
> 
> Because of the way I am automating some of my crawls (sorting names by 
> alpha and only running part of the list), only one segment part-xxxxx 
> had results and then others had 0 results.  I don't know if that would 
> cause this and I don't think this bug is critical for the 0.9 release 
> but I wanted to bring it up.


Please try the patch included at the end.


> 
> My guess would be that this is a small bug within the lucene libraries 
> when the directories have 0 results.  What is everyone's opinion on this 
> in terms of the release?  My vote would be to move forward with the 
> release.

I think we should move forward.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Index: DeleteDuplicates.java
===================================================================
--- DeleteDuplicates.java       (revision 521176)
+++ DeleteDuplicates.java       (working copy)
@@ -158,19 +158,28 @@
      public class DDRecordReader implements RecordReader {

        private IndexReader indexReader;
-      private int maxDoc;
-      private int doc;
+      private int maxDoc = 0;
+      private int doc = 0;
        private Text index;

        public DDRecordReader(FileSplit split, JobConf job,
            Text index) throws IOException {
-        indexReader = IndexReader.open(new 
FsDirectory(FileSystem.get(job), split.getPath(), false, job));
-        maxDoc = indexReader.maxDoc();
+        try {
+          indexReader = IndexReader.open(new 
FsDirectory(FileSystem.get(job), split.getPath(), false, job));
+          maxDoc = indexReader.maxDoc();
+        } catch (IOException ioe) {
+          LOG.warn("Can't open index at " + split + ", skipping. (" + 
ioe.getMessage() + ")");
+          indexReader = null;
+        }
          this.index = index;
        }

        public boolean next(Writable key, Writable value)
          throws IOException {
+
+        // skip empty indexes
+        if (indexReader == null || maxDoc <= 0)
+          return false;

          // skip deleted documents
          while (indexReader.isDeleted(doc) && doc < maxDoc) doc++;

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

Reply via email to