[jira] [Comment Edited] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391830#comment-16391830 ] Marco Ebbinghaus edited comment on NUTCH-2517 at 3/8/18 7:59 PM: - I double checked the behaviors of version 1.14 and the current master (via Docker containers). *If you do a single crawling cycle* _1.14_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is creating a merged segment (out of one segment) containing 2 folders: crawl_generate and crawl_parse _master_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is creating a merged segment (out of one segment) containing 1 folder: crawl_generate _(But it might be that doing a mergesegs after one single crawling cycle is a missuse of the software anyway (idk), so let's have a look on doing multiple crawling cycles, which works better.)_ *If you do two crawling cycles:* _1.14_: bin/nutch inject > +generate > fetch > parse > updatedb+ > +generate > fetch > parse > updatedb+ > mergesegs is creating a merged segment (out of two segments) containing 6 folders: content, crawl_generate, crawl_fetch, crawl_parse, parse_data, parse_text _master_: bin/nutch inject > +generate > fetch > parse > updatedb+ > +generate >fetch > parse > updatedb+ > mergesegs is creating a merged segment (out of two segments) containing 2 folders: crawl_generate and crawl_parse I am not sure how invertlinks works, but I can imagine it depends of ALL segment sub folders. The SegmentMerger says: {quote}SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text {quote} And I think all of these folders are still needed (by eg invertlinks), even if multiple segments are merged. So it works for 1.14, and no longer for master. was (Author: mebbinghaus): I double checked the behaviors of version 1.14 and the current master (via Docker containers). *If you do a single crawling cycle* _1.14_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is creating a merged segment (out of one segment) containing 2 folders: crawl_generate and crawl_parse _master_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is creating a merged segment (out of one segment) containing 1 folder: crawl_generate _(But it might be that doing a mergesegs after one single crawling cycle is a missuse of the software anyway (idk), so let's have a look on doing multiple crawling cycles, which works better.)_ *If you do two crawling cycles:* _1.14_: bin/nutch inject > +generate > fetch > parse > updatedb+ > +generate > fetch > parse > updatedb+ > mergesegs is creating a merged segment (out of two segments) containing 6 folders: crawl_generate, crawl_fetch, crawl_parse, parse_data, parse_text _master_: bin/nutch inject > +generate > fetch > parse > updatedb+ > +generate >fetch > parse > updatedb+ > mergesegs is creating a merged segment (out of two segments) containing 2 folders: crawl_generate and crawl_parse I am not sure how invertlinks works, but I can imagine it depends of ALL segment sub folders. The SegmentMerger says: {quote}SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text {quote} And I think all of these folders are still needed (by eg invertlinks), even if multiple segments are merged. So it works for 1.14, and no longer for master. > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Assignee: Lewis John McGibbney >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png, > Screenshot_2018-03-07_07-50-05.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data,
[jira] [Comment Edited] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391830#comment-16391830 ] Marco Ebbinghaus edited comment on NUTCH-2517 at 3/8/18 7:58 PM: - I double checked the behaviors of version 1.14 and the current master (via Docker containers). *If you do a single crawling cycle* _1.14_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is creating a merged segment (out of one segment) containing 2 folders: crawl_generate and crawl_parse _master_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is creating a merged segment (out of one segment) containing 1 folder: crawl_generate _(But it might be that doing a mergesegs after one single crawling cycle is a missuse of the software anyway (idk), so let's have a look on doing multiple crawling cycles, which works better.)_ *If you do two crawling cycles:* _1.14_: bin/nutch inject > +generate > fetch > parse > updatedb+ > +generate > fetch > parse > updatedb+ > mergesegs is creating a merged segment (out of two segments) containing 6 folders: crawl_generate, crawl_fetch, crawl_parse, parse_data, parse_text _master_: bin/nutch inject > +generate > fetch > parse > updatedb+ > +generate >fetch > parse > updatedb+ > mergesegs is creating a merged segment (out of two segments) containing 2 folders: crawl_generate and crawl_parse I am not sure how invertlinks works, but I can imagine it depends of ALL segment sub folders. The SegmentMerger says: {quote}SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text {quote} And I think all of these folders are still needed (by eg invertlinks), even if multiple segments are merged. So it works for 1.14, and no longer for master. was (Author: mebbinghaus): I double checked the behaviors of version 1.14 and the current master (via Docker containers). *If you do a single crawling cycle* _1.14_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is creating a merged segment (out of one segment) containing 2 folders: crawl_generate and crawl_parse _master_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is creating a merged segment (out of one segment) containing 1 folder: crawl_generate _(But it might be that doing a mergesegs after one single crawling cycle is a missuse of the software anyway (idk), so let's have a look on doing multiple crawling cycles, which works better.)_ *If you do two crawling cycles:* _1.14_: bin/nutch inject->+{color:#33}generate->fetch->parse->updatedb-{color}+->+generate-+->fetch->parse->updatedb->mergesegs is creating a merged segment (out of two segments) containing 6 folders: crawl_generate, crawl_fetch, crawl_parse, parse_data, parse_text _master_: bin/nutch inject > generate > fetch > parse > updatedb > generate >fetch > parse > updatedb > mergesegs is creating a merged segment (out of two segments) containing 2 folders: crawl_generate and crawl_parse I am not sure how invertlinks works, but I can imagine it depends of ALL segment sub folders. The SegmentMerger says: {quote}SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text {quote} And I think all of these folders are still needed (by eg invertlinks), even if multiple segments are merged. So it works for 1.14, and no longer for master. > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Assignee: Lewis John McGibbney >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png, > Screenshot_2018-03-07_07-50-05.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse,
[jira] [Comment Edited] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391830#comment-16391830 ] Marco Ebbinghaus edited comment on NUTCH-2517 at 3/8/18 7:57 PM: - I double checked the behaviors of version 1.14 and the current master (via Docker containers). *If you do a single crawling cycle* _1.14_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is creating a merged segment (out of one segment) containing 2 folders: crawl_generate and crawl_parse _master_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is creating a merged segment (out of one segment) containing 1 folder: crawl_generate _(But it might be that doing a mergesegs after one single crawling cycle is a missuse of the software anyway (idk), so let's have a look on doing multiple crawling cycles, which works better.)_ *If you do two crawling cycles:* _1.14_: bin/nutch inject->+{color:#33}generate->fetch->parse->updatedb-{color}+->+generate-+->fetch->parse->updatedb->mergesegs is creating a merged segment (out of two segments) containing 6 folders: crawl_generate, crawl_fetch, crawl_parse, parse_data, parse_text _master_: bin/nutch inject > generate > fetch > parse > updatedb > generate >fetch > parse > updatedb > mergesegs is creating a merged segment (out of two segments) containing 2 folders: crawl_generate and crawl_parse I am not sure how invertlinks works, but I can imagine it depends of ALL segment sub folders. The SegmentMerger says: {quote}SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text {quote} And I think all of these folders are still needed (by eg invertlinks), even if multiple segments are merged. So it works for 1.14, and no longer for master. was (Author: mebbinghaus): I double checked the behaviors of version 1.14 and the current master (via Docker containers). *If you do a single crawling cycle* _1.14_: bin/nutch inject->generate->fetch->parse->updatedb->mergesegs is creating a merged segment (out of one segment) containing 2 folders: crawl_generate and crawl_parse _master_: bin/nutch inject->generate->fetch->parse->updatedb->mergesegs is creating a merged segment (out of one segment) containing 1 folder: crawl_generate _(But it might be that doing a mergesegs after one single crawling cycle is a missuse of the software anyway (idk), so let's have a look on doing multiple crawling cycles, which works better.)_ *If you do two crawling cycles:* _1.14_: bin/nutch inject->+{color:#33}generate->fetch->parse->updatedb{color}+->+generate->fetch->parse->updatedb+->mergesegs is creating a merged segment (out of two segments) containing 6 folders: crawl_generate, crawl_fetch, crawl_parse, parse_data, parse_text _master_: bin/nutch inject->+generate->fetch->parse->updatedb+->+generate->fetch->parse->updatedb+->mergesegs is creating a merged segment (out of two segments) containing 2 folders: crawl_generate and crawl_parse I am not sure how invertlinks works, but I can imagine it depends of ALL segment sub folders. The SegmentMerger says: {quote}SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text {quote} And I think all of these folders are still needed (by eg invertlinks), even if multiple segments are merged. So it works for 1.14, and no longer for master. > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Assignee: Lewis John McGibbney >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png, > Screenshot_2018-03-07_07-50-05.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data,
[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391830#comment-16391830 ] Marco Ebbinghaus commented on NUTCH-2517: - I double checked the behaviors of version 1.14 and the current master (via Docker containers). *If you do a single crawling cycle* _1.14_: bin/nutch inject->generate->fetch->parse->updatedb->mergesegs is creating a merged segment (out of one segment) containing 2 folders: crawl_generate and crawl_parse _master_: bin/nutch inject->generate->fetch->parse->updatedb->mergesegs is creating a merged segment (out of one segment) containing 1 folder: crawl_generate _(But it might be that doing a mergesegs after one single crawling cycle is a missuse of the software anyway (idk), so let's have a look on doing multiple crawling cycles, which works better.)_ *If you do two crawling cycles:* _1.14_: bin/nutch inject->+{color:#33}generate->fetch->parse->updatedb{color}+->+generate->fetch->parse->updatedb+->mergesegs is creating a merged segment (out of two segments) containing 6 folders: crawl_generate, crawl_fetch, crawl_parse, parse_data, parse_text _master_: bin/nutch inject->+generate->fetch->parse->updatedb+->+generate->fetch->parse->updatedb+->mergesegs is creating a merged segment (out of two segments) containing 2 folders: crawl_generate and crawl_parse I am not sure how invertlinks works, but I can imagine it depends of ALL segment sub folders. The SegmentMerger says: {quote}SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text {quote} And I think all of these folders are still needed (by eg invertlinks), even if multiple segments are merged. So it works for 1.14, and no longer for master. > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Assignee: Lewis John McGibbney >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png, > Screenshot_2018-03-07_07-50-05.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at >
[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391440#comment-16391440 ] Lewis John McGibbney commented on NUTCH-2517: - Can anyone else confirm the above ? > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Assignee: Lewis John McGibbney >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png, > Screenshot_2018-03-07_07-50-05.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` > So as it seems mapreduce corrupts the segment folder during mergesegs command. > > Pay attention to the fact that this issue is not related on trying to merge a > single segment like described above. As you can see on the attached > screenshot that problem also appears when executing multiple bin/nutch > generate/fetch/parse/updatedb commands before executing mergesegs - resulting > in a segment count > 1. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391439#comment-16391439 ] ASF GitHub Bot commented on NUTCH-2517: --- lewismc opened a new pull request #293: NUTCH-2517 mergesegs corrupts segment data URL: https://github.com/apache/nutch/pull/293 This is mostly a cleanup of the Classes concerned with https://issues.apache.org/jira/projects/NUTCH/issues/NUTCH-2517, namely SegmentMerger and LinkDb This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Assignee: Lewis John McGibbney >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png, > Screenshot_2018-03-07_07-50-05.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` > So as it seems mapreduce corrupts the segment folder during mergesegs command. > > Pay
[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391430#comment-16391430 ] Lewis John McGibbney commented on NUTCH-2517: - Hi [~mebbinghaus] I ran it from the Docker container and can reproduce some of your results, there is one nuance however. I'll explain below When I run mergesegs and inspect the data structures created within the mycrawl/MERGEDsegments/segment/... I see BOTH crawl_generate and crawl_parse. So there must be something wrong with your crawl cycle for you only to have generated on directory. I'll leave that down to you to confirm. The other issue however is that when I attempt to invertlinks using one of the merged segs, I end up with the same stack track as you so i am looking into the code right now. > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Assignee: Lewis John McGibbney >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png, > Screenshot_2018-03-07_07-50-05.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` > So as it seems mapreduce corrupts the segment folder during mergesegs
[jira] [Commented] (NUTCH-1541) Indexer plugin to write CSV
[ https://issues.apache.org/jira/browse/NUTCH-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391262#comment-16391262 ] Semyon Semyonov commented on NUTCH-1541: Hi [~wastl-nagel] Why wasn't this plugin merged with master? > Indexer plugin to write CSV > --- > > Key: NUTCH-1541 > URL: https://issues.apache.org/jira/browse/NUTCH-1541 > Project: Nutch > Issue Type: New Feature > Components: indexer >Affects Versions: 1.7 >Reporter: Sebastian Nagel >Priority: Minor > Attachments: NUTCH-1541-v1.patch, NUTCH-1541-v2.patch > > > With the new pluggable indexer a simple plugin would be handy to write > configurable fields into a CSV file - for further analysis or just for export. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2411) Index-metadata to support indexing multiple values for a field
[ https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391258#comment-16391258 ] Hudson commented on NUTCH-2411: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3506 (See [https://builds.apache.org/job/Nutch-trunk/3506/]) NUTCH-2411 Index-metadata to support indexing multiple values for a (markus: [https://github.com/apache/nutch/commit/9a77f43774b2c3cd70785895afb989e9ee2d8d5f]) * (edit) src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java * (edit) conf/nutch-default.xml > Index-metadata to support indexing multiple values for a field > --- > > Key: NUTCH-2411 > URL: https://issues.apache.org/jira/browse/NUTCH-2411 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.15 > > Attachments: NUTCH-2411-1.13.patch, NUTCH-2411-1.13.patch, > NUTCH-2411.patch, NUTCH-2411.patch, NUTCH-2411.patch, NUTCH-2411.patch > > > {code} > > index.metadata.separator > > >Separator to use if you want to index multiple values for a given field. > Leave empty to >treat each value as a single value. > > > > index.metadata.multivalued.fields > > > Comma separated list of fields that are multi valued. > > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (NUTCH-2411) Index-metadata to support indexing multiple values for a field
[ https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2411. -- Resolution: Fixed Committed for 1.15 bd70d2fe..9a77f437 master -> master > Index-metadata to support indexing multiple values for a field > --- > > Key: NUTCH-2411 > URL: https://issues.apache.org/jira/browse/NUTCH-2411 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.15 > > Attachments: NUTCH-2411-1.13.patch, NUTCH-2411-1.13.patch, > NUTCH-2411.patch, NUTCH-2411.patch, NUTCH-2411.patch, NUTCH-2411.patch > > > {code} > > index.metadata.separator > > >Separator to use if you want to index multiple values for a given field. > Leave empty to >treat each value as a single value. > > > > index.metadata.multivalued.fields > > > Comma separated list of fields that are multi valued. > > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2411) Index-metadata to support indexing multiple values for a field
[ https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391141#comment-16391141 ] Markus Jelsma commented on NUTCH-2411: -- Forgot the last time i threatened to commit, will try again. Will commit shortly unless objections! > Index-metadata to support indexing multiple values for a field > --- > > Key: NUTCH-2411 > URL: https://issues.apache.org/jira/browse/NUTCH-2411 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.15 > > Attachments: NUTCH-2411-1.13.patch, NUTCH-2411-1.13.patch, > NUTCH-2411.patch, NUTCH-2411.patch, NUTCH-2411.patch, NUTCH-2411.patch > > > {code} > > index.metadata.separator > > >Separator to use if you want to index multiple values for a given field. > Leave empty to >treat each value as a single value. > > > > index.metadata.multivalued.fields > > > Comma separated list of fields that are multi valued. > > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2525) Metadata indexer cannot handle uppercase parse metadata
[ https://issues.apache.org/jira/browse/NUTCH-2525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391139#comment-16391139 ] Markus Jelsma commented on NUTCH-2525: -- Any comments on this one? Julien did the initial work, and treated parseMD different than contentMD or crawlMD. The old code would never work, but he did it for some reason. > Metadata indexer cannot handle uppercase parse metadata > --- > > Key: NUTCH-2525 > URL: https://issues.apache.org/jira/browse/NUTCH-2525 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.15 > > Attachments: NUTCH-2525.patch > > > MetadataIndexer lowercases keys for parse metadata, making it impossible to > index metadata containing uppercase. -- This message was sent by Atlassian JIRA (v7.6.3#76005)