[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel resolved NUTCH-2517. ------------------------------------ Resolution: Fixed Thanks, [~mebbinghaus]! Thanks, [~lewismc]! With [PR #312|https://github.com/apache/nutch/pull/321] the directory structure of a merged segment looks correct: {noformat} .../mergedsegs/20180426130537 |-- content | `-- part-r-00000 | |-- data | `-- index |-- crawl_fetch | `-- part-r-00000 | |-- data | `-- index |-- crawl_generate | `-- part-r-00000 |-- crawl_parse | `-- part-r-00000 |-- parse_data | `-- part-r-00000 | |-- data | `-- index `-- parse_text `-- part-r-00000 |-- data `-- index {noformat} Tested in local and pseudo-distributed mode. I've also verified that the merged segment can be read, see [test script|https://github.com/sebastian-nagel/nutch-test-single-node-cluster/blob/master/test_nutch_tools.sh#L67]. Next week I'll plan to test on a real Hadoop cluster. > mergesegs corrupts segment data > ------------------------------- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment > Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST > Reporter: Marco Ebbinghaus > Assignee: Lewis John McGibbney > Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png, > Screenshot_2018-03-07_07-50-05.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` > So as it seems mapreduce corrupts the segment folder during mergesegs command. > > Pay attention to the fact that this issue is not related on trying to merge a > single segment like described above. As you can see on the attached > screenshot that problem also appears when executing multiple bin/nutch > generate/fetch/parse/updatedb commands before executing mergesegs - resulting > in a segment count > 1. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)