I have not looked into this deeply, but this change would make me nervous, too. The main reason for that is that I have never seen this error, and the error makes me think that something is simply giving the SegmentMerger wrong/bad input.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Lincoln Ritter <[EMAIL PROTECTED]> > To: nutch-dev@lucene.apache.org > Sent: Wednesday, June 11, 2008 6:25:48 PM > Subject: SegmentMerger "no input paths" problem and "special > files/directories" > > Greetings, > > I'm running nutch trunk with the patch for hadoop 0.17 from NUTCH-634 > (http://issues.apache.org/jira/browse/NUTCH-634) > > I've run into a problem merging segments: > > $ ./bin/nutch mergesegs crawl/segments_merge -dir crawl/segments/ > 08/06/11 14:32:35 INFO segment.SegmentMerger: Merging 3 segments to > crawl/segments_merge/20080611143235 > 08/06/11 14:32:35 INFO segment.SegmentMerger: SegmentMerger: adding > hdfs://localhost:54310/user/lritter/crawl/segments/20080611135945 > 08/06/11 14:32:35 INFO segment.SegmentMerger: SegmentMerger: adding > hdfs://localhost:54310/user/lritter/crawl/segments/20080611141414 > 08/06/11 14:32:35 INFO segment.SegmentMerger: SegmentMerger: adding > hdfs://localhost:54310/user/lritter/crawl/segments/_logs > 08/06/11 14:32:35 INFO segment.SegmentMerger: SegmentMerger: using > segment data from: > java.io.IOException: No input paths specified in input > at > org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:173) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:705) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973) > at org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:605) > at org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:648) > > This looks to be the same (or similar) issue as: > http://www.mail-archive.com/[EMAIL PROTECTED]/msg10999.html > > In my case, the merger seems to think that the '_log' directory is > valid fodder for merging. This is "clearly" not the case. In this > case, I assume that underscore-prefixed names are "reserved" by nutch. > With this assumption, I can make a filter that screens these out. I > have done this and attached a patch against trunk below. > > While the patch fixes my immediate problem it makes me a little > nervous that I am designating underscore-prefixed stuff as "special" > in a pretty adhoc way. Is there any "real" way to determine whether or > not a directory contains segment information? > > Thanks! > > -lincoln > > -- > lincolnritter.com > > --- PATCH --- > > Index: src/java/org/apache/nutch/segment/SegmentMerger.java > =================================================================== > --- src/java/org/apache/nutch/segment/SegmentMerger.java (revision 666871) > +++ src/java/org/apache/nutch/segment/SegmentMerger.java (working copy) > @@ -626,7 +626,7 @@ > boolean normalize = false; > for (int i = 1; i < args.length; i++) { > if (args[i].equals("-dir")) { > - Path[] files = fs.listPaths(new Path(args[++i]), > HadoopFSUtil.getPassDirectoriesFilter(fs)); > + Path[] files = fs.listPaths(new Path(args[++i]), > HadoopFSUtil.getPassNormalDirectoriesFilter(fs)); > for (int j = 0; j < files.length; j++) > segs.add(files[j]); > } else if (args[i].equals("-filter")) { > Index: src/java/org/apache/nutch/util/HadoopFSUtil.java > =================================================================== > --- src/java/org/apache/nutch/util/HadoopFSUtil.java (revision 666871) > +++ src/java/org/apache/nutch/util/HadoopFSUtil.java (working copy) > @@ -51,6 +51,23 @@ > > }; > } > + > + /** > + * Returns PathFilter that passes directories that are not > "special" through. > + */ > + public static PathFilter getPassNormalDirectoriesFilter(final > FileSystem fs) { > + return new PathFilter() { > + public boolean accept(final Path path) { > + try { > + FileStatus status = > fs.getFileStatus(path); > + return status.isDir() && > !status.getPath().getName().startsWith("_"); > + } catch (IOException ioe) { > + return false; > + } > + } > + > + }; > + } > > /** > * Turns an array of FileStatus into an array of Paths. > > --- END PATCH ---