[jira] [Updated] (NUTCH-2666) increase default value for http.content.limit
[ https://issues.apache.org/jira/browse/NUTCH-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Ebbinghaus updated NUTCH-2666: Description: The default value for http.content.limit in nutch-default.xml (The length limit for downloaded content using the http:// protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the file.content.limit setting.) is set to 64kb. Maybe this default value should be increased as many pages today are greater than 64kb. This fact hit me when trying to crawl a single website whose pages are much greater than 64kb and because of that with every crawl cycle the count of db_unfetched urls decreased until it hit zero and the crawler became inactive (because the first 64 kB contained always the same set of navigation links) The description might also be updated as this is not only the case for the http protocol, but also for https. was: The default value for http.content.limit in nutch-default.xml (The length limit for downloaded content using the http:// protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the file.content.limit setting.) is set to 64kb. Maybe this default value should be increased as many pages today are greater than 64kb. The description might also be updated as this is not only the case for the http protocol, but also for https. > increase default value for http.content.limit > - > > Key: NUTCH-2666 > URL: https://issues.apache.org/jira/browse/NUTCH-2666 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 1.15 >Reporter: Marco Ebbinghaus >Priority: Minor > > The default value for http.content.limit in nutch-default.xml (The length > limit for downloaded content using the http:// > protocol, in bytes. If this value is nonnegative (>=0), content longer > than it will be truncated; otherwise, no truncation at all. Do not > confuse this setting with the file.content.limit setting.) is set to 64kb. > Maybe this default value should be increased as many pages today are greater > than 64kb. > This fact hit me when trying to crawl a single website whose pages are much > greater than 64kb and because of that with every crawl cycle the count of > db_unfetched urls decreased until it hit zero and the crawler became inactive > (because the first 64 kB contained always the same set of navigation links) > The description might also be updated as this is not only the case for the > http protocol, but also for https. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2666) increase default value for http.content.limit
[ https://issues.apache.org/jira/browse/NUTCH-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Ebbinghaus updated NUTCH-2666: Description: The default value for http.content.limit in nutch-default.xml (The length limit for downloaded content using the http:// protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the file.content.limit setting.) is set to 64kb. Maybe this default value should be increased as many pages today are greater than 64kb. The description might also be updated as this is not only the case for the http protocol, but also for https. was: The default value for http.content.limit (The length limit for downloaded content using the http:// protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the file.content.limit setting.) is set to 64kb. Maybe this default value should be increased as many pages today are greater than 64kb. The description might also be updated as this is not only the case for the http protocol, but also for https. > increase default value for http.content.limit > - > > Key: NUTCH-2666 > URL: https://issues.apache.org/jira/browse/NUTCH-2666 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 1.15 >Reporter: Marco Ebbinghaus >Priority: Minor > > The default value for http.content.limit in nutch-default.xml (The length > limit for downloaded content using the http:// > protocol, in bytes. If this value is nonnegative (>=0), content longer > than it will be truncated; otherwise, no truncation at all. Do not > confuse this setting with the file.content.limit setting.) is set to 64kb. > Maybe this default value should be increased as many pages today are greater > than 64kb. > The description might also be updated as this is not only the case for the > http protocol, but also for https. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2666) increase default value for http.content.limit
Marco Ebbinghaus created NUTCH-2666: --- Summary: increase default value for http.content.limit Key: NUTCH-2666 URL: https://issues.apache.org/jira/browse/NUTCH-2666 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.15 Reporter: Marco Ebbinghaus The default value for http.content.limit (The length limit for downloaded content using the http:// protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the file.content.limit setting.) is set to 64kb. Maybe this default value should be increased as many pages today are greater than 64kb. The description might also be updated as this is not only the case for the http protocol, but also for https. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391830#comment-16391830 ] Marco Ebbinghaus edited comment on NUTCH-2517 at 3/8/18 7:59 PM: - I double checked the behaviors of version 1.14 and the current master (via Docker containers). *If you do a single crawling cycle* _1.14_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is creating a merged segment (out of one segment) containing 2 folders: crawl_generate and crawl_parse _master_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is creating a merged segment (out of one segment) containing 1 folder: crawl_generate _(But it might be that doing a mergesegs after one single crawling cycle is a missuse of the software anyway (idk), so let's have a look on doing multiple crawling cycles, which works better.)_ *If you do two crawling cycles:* _1.14_: bin/nutch inject > +generate > fetch > parse > updatedb+ > +generate > fetch > parse > updatedb+ > mergesegs is creating a merged segment (out of two segments) containing 6 folders: content, crawl_generate, crawl_fetch, crawl_parse, parse_data, parse_text _master_: bin/nutch inject > +generate > fetch > parse > updatedb+ > +generate >fetch > parse > updatedb+ > mergesegs is creating a merged segment (out of two segments) containing 2 folders: crawl_generate and crawl_parse I am not sure how invertlinks works, but I can imagine it depends of ALL segment sub folders. The SegmentMerger says: {quote}SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text {quote} And I think all of these folders are still needed (by eg invertlinks), even if multiple segments are merged. So it works for 1.14, and no longer for master. was (Author: mebbinghaus): I double checked the behaviors of version 1.14 and the current master (via Docker containers). *If you do a single crawling cycle* _1.14_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is creating a merged segment (out of one segment) containing 2 folders: crawl_generate and crawl_parse _master_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is creating a merged segment (out of one segment) containing 1 folder: crawl_generate _(But it might be that doing a mergesegs after one single crawling cycle is a missuse of the software anyway (idk), so let's have a look on doing multiple crawling cycles, which works better.)_ *If you do two crawling cycles:* _1.14_: bin/nutch inject > +generate > fetch > parse > updatedb+ > +generate > fetch > parse > updatedb+ > mergesegs is creating a merged segment (out of two segments) containing 6 folders: crawl_generate, crawl_fetch, crawl_parse, parse_data, parse_text _master_: bin/nutch inject > +generate > fetch > parse > updatedb+ > +generate >fetch > parse > updatedb+ > mergesegs is creating a merged segment (out of two segments) containing 2 folders: crawl_generate and crawl_parse I am not sure how invertlinks works, but I can imagine it depends of ALL segment sub folders. The SegmentMerger says: {quote}SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text {quote} And I think all of these folders are still needed (by eg invertlinks), even if multiple segments are merged. So it works for 1.14, and no longer for master. > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Assignee: Lewis John McGibbney >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png, > Screenshot_2018-03-07_07-50-05.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data,
[jira] [Comment Edited] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391830#comment-16391830 ] Marco Ebbinghaus edited comment on NUTCH-2517 at 3/8/18 7:58 PM: - I double checked the behaviors of version 1.14 and the current master (via Docker containers). *If you do a single crawling cycle* _1.14_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is creating a merged segment (out of one segment) containing 2 folders: crawl_generate and crawl_parse _master_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is creating a merged segment (out of one segment) containing 1 folder: crawl_generate _(But it might be that doing a mergesegs after one single crawling cycle is a missuse of the software anyway (idk), so let's have a look on doing multiple crawling cycles, which works better.)_ *If you do two crawling cycles:* _1.14_: bin/nutch inject > +generate > fetch > parse > updatedb+ > +generate > fetch > parse > updatedb+ > mergesegs is creating a merged segment (out of two segments) containing 6 folders: crawl_generate, crawl_fetch, crawl_parse, parse_data, parse_text _master_: bin/nutch inject > +generate > fetch > parse > updatedb+ > +generate >fetch > parse > updatedb+ > mergesegs is creating a merged segment (out of two segments) containing 2 folders: crawl_generate and crawl_parse I am not sure how invertlinks works, but I can imagine it depends of ALL segment sub folders. The SegmentMerger says: {quote}SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text {quote} And I think all of these folders are still needed (by eg invertlinks), even if multiple segments are merged. So it works for 1.14, and no longer for master. was (Author: mebbinghaus): I double checked the behaviors of version 1.14 and the current master (via Docker containers). *If you do a single crawling cycle* _1.14_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is creating a merged segment (out of one segment) containing 2 folders: crawl_generate and crawl_parse _master_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is creating a merged segment (out of one segment) containing 1 folder: crawl_generate _(But it might be that doing a mergesegs after one single crawling cycle is a missuse of the software anyway (idk), so let's have a look on doing multiple crawling cycles, which works better.)_ *If you do two crawling cycles:* _1.14_: bin/nutch inject->+{color:#33}generate->fetch->parse->updatedb-{color}+->+generate-+->fetch->parse->updatedb->mergesegs is creating a merged segment (out of two segments) containing 6 folders: crawl_generate, crawl_fetch, crawl_parse, parse_data, parse_text _master_: bin/nutch inject > generate > fetch > parse > updatedb > generate >fetch > parse > updatedb > mergesegs is creating a merged segment (out of two segments) containing 2 folders: crawl_generate and crawl_parse I am not sure how invertlinks works, but I can imagine it depends of ALL segment sub folders. The SegmentMerger says: {quote}SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text {quote} And I think all of these folders are still needed (by eg invertlinks), even if multiple segments are merged. So it works for 1.14, and no longer for master. > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Assignee: Lewis John McGibbney >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png, > Screenshot_2018-03-07_07-50-05.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse,
[jira] [Comment Edited] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391830#comment-16391830 ] Marco Ebbinghaus edited comment on NUTCH-2517 at 3/8/18 7:57 PM: - I double checked the behaviors of version 1.14 and the current master (via Docker containers). *If you do a single crawling cycle* _1.14_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is creating a merged segment (out of one segment) containing 2 folders: crawl_generate and crawl_parse _master_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is creating a merged segment (out of one segment) containing 1 folder: crawl_generate _(But it might be that doing a mergesegs after one single crawling cycle is a missuse of the software anyway (idk), so let's have a look on doing multiple crawling cycles, which works better.)_ *If you do two crawling cycles:* _1.14_: bin/nutch inject->+{color:#33}generate->fetch->parse->updatedb-{color}+->+generate-+->fetch->parse->updatedb->mergesegs is creating a merged segment (out of two segments) containing 6 folders: crawl_generate, crawl_fetch, crawl_parse, parse_data, parse_text _master_: bin/nutch inject > generate > fetch > parse > updatedb > generate >fetch > parse > updatedb > mergesegs is creating a merged segment (out of two segments) containing 2 folders: crawl_generate and crawl_parse I am not sure how invertlinks works, but I can imagine it depends of ALL segment sub folders. The SegmentMerger says: {quote}SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text {quote} And I think all of these folders are still needed (by eg invertlinks), even if multiple segments are merged. So it works for 1.14, and no longer for master. was (Author: mebbinghaus): I double checked the behaviors of version 1.14 and the current master (via Docker containers). *If you do a single crawling cycle* _1.14_: bin/nutch inject->generate->fetch->parse->updatedb->mergesegs is creating a merged segment (out of one segment) containing 2 folders: crawl_generate and crawl_parse _master_: bin/nutch inject->generate->fetch->parse->updatedb->mergesegs is creating a merged segment (out of one segment) containing 1 folder: crawl_generate _(But it might be that doing a mergesegs after one single crawling cycle is a missuse of the software anyway (idk), so let's have a look on doing multiple crawling cycles, which works better.)_ *If you do two crawling cycles:* _1.14_: bin/nutch inject->+{color:#33}generate->fetch->parse->updatedb{color}+->+generate->fetch->parse->updatedb+->mergesegs is creating a merged segment (out of two segments) containing 6 folders: crawl_generate, crawl_fetch, crawl_parse, parse_data, parse_text _master_: bin/nutch inject->+generate->fetch->parse->updatedb+->+generate->fetch->parse->updatedb+->mergesegs is creating a merged segment (out of two segments) containing 2 folders: crawl_generate and crawl_parse I am not sure how invertlinks works, but I can imagine it depends of ALL segment sub folders. The SegmentMerger says: {quote}SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text {quote} And I think all of these folders are still needed (by eg invertlinks), even if multiple segments are merged. So it works for 1.14, and no longer for master. > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Assignee: Lewis John McGibbney >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png, > Screenshot_2018-03-07_07-50-05.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data,
[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391830#comment-16391830 ] Marco Ebbinghaus commented on NUTCH-2517: - I double checked the behaviors of version 1.14 and the current master (via Docker containers). *If you do a single crawling cycle* _1.14_: bin/nutch inject->generate->fetch->parse->updatedb->mergesegs is creating a merged segment (out of one segment) containing 2 folders: crawl_generate and crawl_parse _master_: bin/nutch inject->generate->fetch->parse->updatedb->mergesegs is creating a merged segment (out of one segment) containing 1 folder: crawl_generate _(But it might be that doing a mergesegs after one single crawling cycle is a missuse of the software anyway (idk), so let's have a look on doing multiple crawling cycles, which works better.)_ *If you do two crawling cycles:* _1.14_: bin/nutch inject->+{color:#33}generate->fetch->parse->updatedb{color}+->+generate->fetch->parse->updatedb+->mergesegs is creating a merged segment (out of two segments) containing 6 folders: crawl_generate, crawl_fetch, crawl_parse, parse_data, parse_text _master_: bin/nutch inject->+generate->fetch->parse->updatedb+->+generate->fetch->parse->updatedb+->mergesegs is creating a merged segment (out of two segments) containing 2 folders: crawl_generate and crawl_parse I am not sure how invertlinks works, but I can imagine it depends of ALL segment sub folders. The SegmentMerger says: {quote}SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text {quote} And I think all of these folders are still needed (by eg invertlinks), even if multiple segments are merged. So it works for 1.14, and no longer for master. > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Assignee: Lewis John McGibbney >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png, > Screenshot_2018-03-07_07-50-05.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at >
[jira] [Updated] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Ebbinghaus updated NUTCH-2517: Attachment: Screenshot_2018-03-07_07-50-05.png > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Assignee: Lewis John McGibbney >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png, > Screenshot_2018-03-07_07-50-05.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` > So as it seems mapreduce corrupts the segment folder during mergesegs command. > > Pay attention to the fact that this issue is not related on trying to merge a > single segment like described above. As you can see on the attached > screenshot that problem also appears when executing multiple bin/nutch > generate/fetch/parse/updatedb commands before executing mergesegs - resulting > in a segment count > 1. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389143#comment-16389143 ] Marco Ebbinghaus commented on NUTCH-2517: - I can also reproduce this when NOT running this from a Docker container. I checked out the master on my desktop 30 minutes ago and did the exactly same workflow as described above and the result is the same: after mergesegs I only have one folder crawl_generate in the merged segment. For the log output please see the attached screenshot. In the meantime I am using the apache/nutch container with tag release-1.14, which is working as intended. > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Assignee: Lewis John McGibbney >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` > So as it seems mapreduce corrupts the segment folder during mergesegs command. > > Pay attention to the fact that this issue is not related on trying to merge a > single segment like described above. As you can see on the attached > screenshot that problem also appears when
[jira] [Updated] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Ebbinghaus updated NUTCH-2517: Description: The problem probably occurs since commit [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] How to reproduce: * create container from apache/nutch image (latest) * open terminal in that container * set http.agent.name * create crawldir and urls file * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) * run bin/nutch generate (bin/nutch generate mycrawl/crawldb mycrawl/segments 1) ** this results in a segment (e.g. 20180304134215) * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 -threads 2) * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 -threads 2) ** ls in the segment folder -> existing folders: content, crawl_fetch, crawl_generate, crawl_parse, parse_data, parse_text * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb mycrawl/segments/20180304134215) * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments mycrawl/segments/* -filter) ** console output: `SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text` ** resulting segment: 20180304134535 * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing folder: crawl_generate * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir mycrawl/MERGEDsegments) which results in a consequential error ** console output: `LinkDb: adding segment: [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` So as it seems mapreduce corrupts the segment folder during mergesegs command. Pay attention to the fact that this issue is not related on trying to merge a single segment like described above. As you can see on the attached screenshot that problem also appears when executing multiple bin/nutch generate/fetch/parse/updatedb commands before executing mergesegs - resulting in a segment count > 1. was: The problem probably occurs since commit [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] How to reproduce: * create container from apache/nutch image (latest) * open terminal in that container * set http.agent.name * create crawldir and urls file * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) * run bin/nutch generate (bin/nutch generate mycrawl/crawldb mycrawl/segments 1) ** this results in a segment (e.g. 20180304134215) * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 -threads 2) * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 -threads 2) ** ls in the segment folder -> existing folders: content, crawl_fetch, crawl_generate, crawl_parse, parse_data, parse_text * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb mycrawl/segments/20180304134215) * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments mycrawl/segments/* -filter) ** console output: `SegmentMerger: using segment data from: content crawl_generate crawl_fetch
[jira] [Updated] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Ebbinghaus updated NUTCH-2517: Description: The problem probably occurs since commit [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] How to reproduce: * create container from apache/nutch image (latest) * open terminal in that container * set http.agent.name * create crawldir and urls file * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) * run bin/nutch generate (bin/nutch generate mycrawl/crawldb mycrawl/segments 1) ** this results in a segment (e.g. 20180304134215) * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 -threads 2) * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 -threads 2) ** ls in the segment folder -> existing folders: content, crawl_fetch, crawl_generate, crawl_parse, parse_data, parse_text * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb mycrawl/segments/20180304134215) * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments mycrawl/segments/* -filter) ** console output: `SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text` ** resulting segment: 20180304134535 * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing folder: crawl_generate * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir mycrawl/MERGEDsegments) which results in a consequential error ** console output: `LinkDb: adding segment: [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` So as it seems mapreduce corrupts the segment folder during mergesegs command. Pay attention to the fact that this issue is not related on trying to merge a single segment like described above. As you can see on the appended screenshot that problem also appears when using bin/nutch generate with a topN > 1 - resulting in a segment count > 1. was: The problem probably occurs since commit [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] How to reproduce: * create container from apache/nutch image (latest) * open terminal in that container * set http.agent.name * create crawldir and urls file * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) * run bin/nutch generate (bin/nutch generate mycrawl/crawldb mycrawl/segments 1) ** this results in a segment (e.g. 20180304134215) * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 -threads 2) * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 -threads 2) ** ls in the segment folder -> existing folders: content, crawl_fetch, crawl_generate, crawl_parse, parse_data, parse_text * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb mycrawl/segments/20180304134215) * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments mycrawl/segments/* -filter) ** console output: `SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text` ** resulting
[jira] [Updated] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Ebbinghaus updated NUTCH-2517: Description: The problem probably occurs since commit [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] How to reproduce: * create container from apache/nutch image (latest) * open terminal in that container * set http.agent.name * create crawldir and urls file * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) * run bin/nutch generate (bin/nutch generate mycrawl/crawldb mycrawl/segments 1) ** this results in a segment (e.g. 20180304134215) * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 -threads 2) * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 -threads 2) ** ls in the segment folder -> existing folders: content, crawl_fetch, crawl_generate, crawl_parse, parse_data, parse_text * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb mycrawl/segments/20180304134215) * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments mycrawl/segments/* -filter) ** console output: `SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text` ** resulting segment: 20180304134535 * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing folder: crawl_generate * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir mycrawl/MERGEDsegments) which results in a consequential error ** console output: `LinkDb: adding segment: [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` So as it seems mapreduce corrupts the segment folder during mergesegs command. Pay attention to the fact that this issue is not related on trying to merge a single segment like described above. As you can see on the attached screenshot that problem also appears when using bin/nutch generate with a topN > 1 - resulting in a segment count > 1. was: The problem probably occurs since commit [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] How to reproduce: * create container from apache/nutch image (latest) * open terminal in that container * set http.agent.name * create crawldir and urls file * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) * run bin/nutch generate (bin/nutch generate mycrawl/crawldb mycrawl/segments 1) ** this results in a segment (e.g. 20180304134215) * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 -threads 2) * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 -threads 2) ** ls in the segment folder -> existing folders: content, crawl_fetch, crawl_generate, crawl_parse, parse_data, parse_text * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb mycrawl/segments/20180304134215) * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments mycrawl/segments/* -filter) ** console output: `SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text` ** resulting
[jira] [Created] (NUTCH-2517) mergesegs corrupts segment data
Marco Ebbinghaus created NUTCH-2517: --- Summary: mergesegs corrupts segment data Key: NUTCH-2517 URL: https://issues.apache.org/jira/browse/NUTCH-2517 Project: Nutch Issue Type: Bug Components: segment Affects Versions: 1.15 Environment: xubuntu 17.10, docker container of apache/nutch LATEST Reporter: Marco Ebbinghaus Attachments: Screenshot_2018-03-03_18-09-28.png The problem probably occurs since commit [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] How to reproduce: * create container from apache/nutch image (latest) * open terminal in that container * set http.agent.name * create crawldir and urls file * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) * run bin/nutch generate (bin/nutch generate mycrawl/crawldb mycrawl/segments 1) ** this results in a segment (e.g. 20180304134215) * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 -threads 2) * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 -threads 2) ** ls in the segment folder -> existing folders: content, crawl_fetch, crawl_generate, crawl_parse, parse_data, parse_text * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb mycrawl/segments/20180304134215) * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments mycrawl/segments/* -filter) ** console output: `SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text` ** resulting segment: 20180304134535 * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing folder: crawl_generate * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir mycrawl/MERGEDsegments) which results in a consequential error ** console output: `LinkDb: adding segment: file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535 LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` So as it seems mapreduce corrupts the segment folder during mergesegs command -- This message was sent by Atlassian JIRA (v7.6.3#76005)