[ 
https://issues.apache.org/jira/browse/NUTCH-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15050089#comment-15050089
 ] 

Hudson commented on NUTCH-2183:
-------------------------------

FAILURE: Integrated in Nutch-trunk #3327 (See 
[https://builds.apache.org/job/Nutch-trunk/3327/])
NUTCH-2183 Improvement to SegmentChecker for skipping non-segments present in 
segments directory (lewismc: 
[http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1719006])
* trunk/CHANGES.txt
* trunk/src/java/org/apache/nutch/indexer/IndexingJob.java
* trunk/src/java/org/apache/nutch/segment/SegmentChecker.java


> Improvement to SegmentChecker for skipping non-segments present in segments 
> directory
> -------------------------------------------------------------------------------------
>
>                 Key: NUTCH-2183
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2183
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, segment
>    Affects Versions: 1.11
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 1.12
>
>         Attachments: NUTCH-2183.patch
>
>
> The scenario is that you have a bunch of Nutch data which has been gathered 
> over some period of time. Some of the data structures are present, some are 
> not. In segments directory for example there is .zip files (don't ask why) 
> and in other directories there are .tar.gz files, etc.
> This patch improves the SegmentChecker to skip directories or files present 
> within the segments directory which are not 14 characters in length as ALL 
> segments are. It also uses this check for individual segments if used by the 
> IndexingJob. This means that we can prevent the Indexer blowing up if it is 
> run on one segment (e.g. without -dir option) and detects some arbitrary 
> directory present within segments/ which actually turns out not to be a 
> segment afterall.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to