[ https://issues.apache.org/jira/browse/NUTCH-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15050089#comment-15050089 ]
Hudson commented on NUTCH-2183: ------------------------------- FAILURE: Integrated in Nutch-trunk #3327 (See [https://builds.apache.org/job/Nutch-trunk/3327/]) NUTCH-2183 Improvement to SegmentChecker for skipping non-segments present in segments directory (lewismc: [http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1719006]) * trunk/CHANGES.txt * trunk/src/java/org/apache/nutch/indexer/IndexingJob.java * trunk/src/java/org/apache/nutch/segment/SegmentChecker.java > Improvement to SegmentChecker for skipping non-segments present in segments > directory > ------------------------------------------------------------------------------------- > > Key: NUTCH-2183 > URL: https://issues.apache.org/jira/browse/NUTCH-2183 > Project: Nutch > Issue Type: Improvement > Components: indexer, segment > Affects Versions: 1.11 > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2183.patch > > > The scenario is that you have a bunch of Nutch data which has been gathered > over some period of time. Some of the data structures are present, some are > not. In segments directory for example there is .zip files (don't ask why) > and in other directories there are .tar.gz files, etc. > This patch improves the SegmentChecker to skip directories or files present > within the segments directory which are not 14 characters in length as ALL > segments are. It also uses this check for individual segments if used by the > IndexingJob. This means that we can prevent the Indexer blowing up if it is > run on one segment (e.g. without -dir option) and detects some arbitrary > directory present within segments/ which actually turns out not to be a > segment afterall. -- This message was sent by Atlassian JIRA (v6.3.4#6332)