[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398026#comment-16398026
 ] 

ASF GitHub Bot commented on NUTCH-2517:
---------------------------------------

lewismc commented on issue #293: NUTCH-2517 mergesegs corrupts segment data
URL: https://github.com/apache/nutch/pull/293#issuecomment-372889814
 
 
   Ni semantic changes here folks, will commit in 24hrs unless objections.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> mergesegs corrupts segment data
> -------------------------------
>
>                 Key: NUTCH-2517
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2517
>             Project: Nutch
>          Issue Type: Bug
>          Components: segment
>    Affects Versions: 1.15
>         Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>            Reporter: Marco Ebbinghaus
>            Assignee: Lewis John McGibbney
>            Priority: Blocker
>              Labels: mapreduce, mergesegs
>             Fix For: 1.15
>
>         Attachments: Screenshot_2018-03-03_18-09-28.png, 
> Screenshot_2018-03-07_07-50-05.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
>      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
> So as it seems mapreduce corrupts the segment folder during mergesegs command.
>  
> Pay attention to the fact that this issue is not related on trying to merge a 
> single segment like described above. As you can see on the attached 
> screenshot that problem also appears when executing multiple bin/nutch 
> generate/fetch/parse/updatedb commands before executing mergesegs - resulting 
> in a segment count > 1.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to