Re: SegmentMerger "no input paths" problem and "special files/directories"

ogjunk-nutch Wed, 11 Jun 2008 21:00:52 -0700

I have not looked into this deeply, but this change would make me nervous, too. 
 The main reason for that is that I have never seen this error, and the error 
makes me think that something is simply giving the SegmentMerger wrong/bad 
input.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: Lincoln Ritter <[EMAIL PROTECTED]>
> To: nutch-dev@lucene.apache.org
> Sent: Wednesday, June 11, 2008 6:25:48 PM
> Subject: SegmentMerger "no input paths" problem and "special 
> files/directories"
> 
> Greetings,
> 
> I'm running nutch trunk with the patch for hadoop 0.17 from NUTCH-634
> (http://issues.apache.org/jira/browse/NUTCH-634)
> 
> I've run into a problem merging segments:
> 
> $ ./bin/nutch mergesegs crawl/segments_merge -dir crawl/segments/
> 08/06/11 14:32:35 INFO segment.SegmentMerger: Merging 3 segments to
> crawl/segments_merge/20080611143235
> 08/06/11 14:32:35 INFO segment.SegmentMerger: SegmentMerger:   adding
> hdfs://localhost:54310/user/lritter/crawl/segments/20080611135945
> 08/06/11 14:32:35 INFO segment.SegmentMerger: SegmentMerger:   adding
> hdfs://localhost:54310/user/lritter/crawl/segments/20080611141414
> 08/06/11 14:32:35 INFO segment.SegmentMerger: SegmentMerger:   adding
> hdfs://localhost:54310/user/lritter/crawl/segments/_logs
> 08/06/11 14:32:35 INFO segment.SegmentMerger: SegmentMerger: using
> segment data from:
> java.io.IOException: No input paths specified in input
>     at 
> org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:173)
>     at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:705)
>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
>     at org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:605)
>     at org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:648)
> 
> This looks to be the same (or similar) issue as:
> http://www.mail-archive.com/[EMAIL PROTECTED]/msg10999.html
> 
> In my case, the merger seems to think that the '_log' directory is
> valid fodder for merging.  This is "clearly" not the case.  In this
> case, I assume that underscore-prefixed names are "reserved" by nutch.
> With this assumption, I can make a filter that screens these out.  I
> have done this and attached a patch against trunk below.
> 
> While the patch fixes my immediate problem it makes me a little
> nervous that I am designating underscore-prefixed stuff as "special"
> in a pretty adhoc way. Is there any "real" way to determine whether or
> not a directory contains segment information?
> 
> Thanks!
> 
> -lincoln
> 
> --
> lincolnritter.com
> 
> --- PATCH ---
> 
> Index: src/java/org/apache/nutch/segment/SegmentMerger.java
> ===================================================================
> --- src/java/org/apache/nutch/segment/SegmentMerger.java    (revision 666871)
> +++ src/java/org/apache/nutch/segment/SegmentMerger.java    (working copy)
> @@ -626,7 +626,7 @@
>      boolean normalize = false;
>      for (int i = 1; i < args.length; i++) {
>        if (args[i].equals("-dir")) {
> -        Path[] files = fs.listPaths(new Path(args[++i]),
> HadoopFSUtil.getPassDirectoriesFilter(fs));
> +        Path[] files = fs.listPaths(new Path(args[++i]),
> HadoopFSUtil.getPassNormalDirectoriesFilter(fs));
>          for (int j = 0; j < files.length; j++)
>            segs.add(files[j]);
>        } else if (args[i].equals("-filter")) {
> Index: src/java/org/apache/nutch/util/HadoopFSUtil.java
> ===================================================================
> --- src/java/org/apache/nutch/util/HadoopFSUtil.java    (revision 666871)
> +++ src/java/org/apache/nutch/util/HadoopFSUtil.java    (working copy)
> @@ -51,6 +51,23 @@
> 
>          };
>      }
> +
> +    /**
> +     * Returns PathFilter that passes directories that are not
> "special" through.
> +     */
> +    public static PathFilter getPassNormalDirectoriesFilter(final
> FileSystem fs) {
> +        return new PathFilter() {
> +            public boolean accept(final Path path) {
> +                try {
> +                                        FileStatus status = 
> fs.getFileStatus(path);
> +                    return status.isDir() &&
> !status.getPath().getName().startsWith("_");
> +                } catch (IOException ioe) {
> +                    return false;
> +                }
> +            }
> +
> +        };
> +    }
> 
>      /**
>       * Turns an array of FileStatus into an array of Paths.
> 
> --- END PATCH ---

Re: SegmentMerger "no input paths" problem and "special files/directories"

Reply via email to