[jira] [Comment Edited] (NUTCH-2517) mergesegs corrupts segment data

2018-03-08 Thread Marco Ebbinghaus (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391830#comment-16391830
 ] 

Marco Ebbinghaus edited comment on NUTCH-2517 at 3/8/18 7:59 PM:
-

I double checked the behaviors of version 1.14 and the current master (via 
Docker containers).

*If you do a single crawling cycle*

_1.14_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is 
creating a merged segment (out of one segment) containing 2 folders: 
crawl_generate and crawl_parse

_master_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is 
creating a merged segment (out of one segment) containing 1 folder: 
crawl_generate

_(But it might be that doing a mergesegs after one single crawling cycle is a 
missuse of the software anyway (idk), so let's have a look on doing multiple 
crawling cycles, which works better.)_

*If you do two crawling cycles:*

_1.14_: bin/nutch inject > +generate > fetch > parse > updatedb+ > +generate > 
fetch > parse > updatedb+ > mergesegs is creating a merged segment (out of two 
segments) containing 6 folders: content, crawl_generate, crawl_fetch, 
crawl_parse, parse_data, parse_text

_master_: bin/nutch inject > +generate > fetch > parse > updatedb+ > +generate 
>fetch > parse > updatedb+ > mergesegs is creating a merged segment (out of two 
segments) containing 2 folders: crawl_generate and crawl_parse

I am not sure how invertlinks works, but I can imagine it depends of ALL 
segment sub folders. The SegmentMerger says:
{quote}SegmentMerger: using segment data from: content crawl_generate 
crawl_fetch crawl_parse parse_data parse_text
{quote}
And I think all of these folders are still needed (by eg invertlinks), even if 
multiple segments are merged. So it works for 1.14, and no longer for master.

 


was (Author: mebbinghaus):
I double checked the behaviors of version 1.14 and the current master (via 
Docker containers).

*If you do a single crawling cycle*

_1.14_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is 
creating a merged segment (out of one segment) containing 2 folders: 
crawl_generate and crawl_parse

_master_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is 
creating a merged segment (out of one segment) containing 1 folder: 
crawl_generate

_(But it might be that doing a mergesegs after one single crawling cycle is a 
missuse of the software anyway (idk), so let's have a look on doing multiple 
crawling cycles, which works better.)_

*If you do two crawling cycles:*

_1.14_: bin/nutch inject > +generate > fetch > parse > updatedb+ > +generate > 
fetch > parse > updatedb+ > mergesegs is creating a merged segment (out of two 
segments) containing 6 folders: crawl_generate, crawl_fetch, crawl_parse, 
parse_data, parse_text

_master_: bin/nutch inject > +generate > fetch > parse > updatedb+ > +generate 
>fetch > parse > updatedb+ > mergesegs is creating a merged segment (out of two 
segments) containing 2 folders: crawl_generate and crawl_parse

I am not sure how invertlinks works, but I can imagine it depends of ALL 
segment sub folders. The SegmentMerger says:
{quote}SegmentMerger: using segment data from: content crawl_generate 
crawl_fetch crawl_parse parse_data parse_text
{quote}
And I think all of these folders are still needed (by eg invertlinks), even if 
multiple segments are merged. So it works for 1.14, and no longer for master.

 

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png, 
> Screenshot_2018-03-07_07-50-05.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, 

[jira] [Comment Edited] (NUTCH-2517) mergesegs corrupts segment data

2018-03-08 Thread Marco Ebbinghaus (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391830#comment-16391830
 ] 

Marco Ebbinghaus edited comment on NUTCH-2517 at 3/8/18 7:58 PM:
-

I double checked the behaviors of version 1.14 and the current master (via 
Docker containers).

*If you do a single crawling cycle*

_1.14_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is 
creating a merged segment (out of one segment) containing 2 folders: 
crawl_generate and crawl_parse

_master_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is 
creating a merged segment (out of one segment) containing 1 folder: 
crawl_generate

_(But it might be that doing a mergesegs after one single crawling cycle is a 
missuse of the software anyway (idk), so let's have a look on doing multiple 
crawling cycles, which works better.)_

*If you do two crawling cycles:*

_1.14_: bin/nutch inject > +generate > fetch > parse > updatedb+ > +generate > 
fetch > parse > updatedb+ > mergesegs is creating a merged segment (out of two 
segments) containing 6 folders: crawl_generate, crawl_fetch, crawl_parse, 
parse_data, parse_text

_master_: bin/nutch inject > +generate > fetch > parse > updatedb+ > +generate 
>fetch > parse > updatedb+ > mergesegs is creating a merged segment (out of two 
segments) containing 2 folders: crawl_generate and crawl_parse

I am not sure how invertlinks works, but I can imagine it depends of ALL 
segment sub folders. The SegmentMerger says:
{quote}SegmentMerger: using segment data from: content crawl_generate 
crawl_fetch crawl_parse parse_data parse_text
{quote}
And I think all of these folders are still needed (by eg invertlinks), even if 
multiple segments are merged. So it works for 1.14, and no longer for master.

 


was (Author: mebbinghaus):
I double checked the behaviors of version 1.14 and the current master (via 
Docker containers).

*If you do a single crawling cycle*

_1.14_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is 
creating a merged segment (out of one segment) containing 2 folders: 
crawl_generate and crawl_parse

_master_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is 
creating a merged segment (out of one segment) containing 1 folder: 
crawl_generate

_(But it might be that doing a mergesegs after one single crawling cycle is a 
missuse of the software anyway (idk), so let's have a look on doing multiple 
crawling cycles, which works better.)_

*If you do two crawling cycles:*

_1.14_: bin/nutch 
inject->+{color:#33}generate->fetch->parse->updatedb-{color}+->+generate-+->fetch->parse->updatedb->mergesegs
 is creating a merged segment (out of two segments) containing 6 folders: 
crawl_generate, crawl_fetch, crawl_parse, parse_data, parse_text

_master_: bin/nutch inject > generate > fetch > parse > updatedb > generate 
>fetch > parse > updatedb > mergesegs is creating a merged segment (out of two 
segments) containing  2 folders: crawl_generate and crawl_parse

I am not sure how invertlinks works, but I can imagine it depends of ALL 
segment sub folders. The SegmentMerger says:
{quote}SegmentMerger: using segment data from: content crawl_generate 
crawl_fetch crawl_parse parse_data parse_text
{quote}
And I think all of these folders are still needed (by eg invertlinks), even if 
multiple segments are merged. So it works for 1.14, and no longer for master.

 

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png, 
> Screenshot_2018-03-07_07-50-05.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, 

[jira] [Comment Edited] (NUTCH-2517) mergesegs corrupts segment data

2018-03-08 Thread Marco Ebbinghaus (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391830#comment-16391830
 ] 

Marco Ebbinghaus edited comment on NUTCH-2517 at 3/8/18 7:57 PM:
-

I double checked the behaviors of version 1.14 and the current master (via 
Docker containers).

*If you do a single crawling cycle*

_1.14_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is 
creating a merged segment (out of one segment) containing 2 folders: 
crawl_generate and crawl_parse

_master_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is 
creating a merged segment (out of one segment) containing 1 folder: 
crawl_generate

_(But it might be that doing a mergesegs after one single crawling cycle is a 
missuse of the software anyway (idk), so let's have a look on doing multiple 
crawling cycles, which works better.)_

*If you do two crawling cycles:*

_1.14_: bin/nutch 
inject->+{color:#33}generate->fetch->parse->updatedb-{color}+->+generate-+->fetch->parse->updatedb->mergesegs
 is creating a merged segment (out of two segments) containing 6 folders: 
crawl_generate, crawl_fetch, crawl_parse, parse_data, parse_text

_master_: bin/nutch inject > generate > fetch > parse > updatedb > generate 
>fetch > parse > updatedb > mergesegs is creating a merged segment (out of two 
segments) containing  2 folders: crawl_generate and crawl_parse

I am not sure how invertlinks works, but I can imagine it depends of ALL 
segment sub folders. The SegmentMerger says:
{quote}SegmentMerger: using segment data from: content crawl_generate 
crawl_fetch crawl_parse parse_data parse_text
{quote}
And I think all of these folders are still needed (by eg invertlinks), even if 
multiple segments are merged. So it works for 1.14, and no longer for master.

 


was (Author: mebbinghaus):
I double checked the behaviors of version 1.14 and the current master (via 
Docker containers).

*If you do a single crawling cycle*

_1.14_: bin/nutch inject->generate->fetch->parse->updatedb->mergesegs is 
creating a merged segment (out of one segment) containing 2 folders: 
crawl_generate and crawl_parse

_master_: bin/nutch inject->generate->fetch->parse->updatedb->mergesegs is 
creating a merged segment (out of one segment) containing 1 folder: 
crawl_generate

_(But it might be that doing a mergesegs after one single crawling cycle is a 
missuse of the software anyway (idk), so let's have a look on doing multiple 
crawling cycles, which works better.)_

*If you do two crawling cycles:*

_1.14_: bin/nutch 
inject->+{color:#33}generate->fetch->parse->updatedb{color}+->+generate->fetch->parse->updatedb+->mergesegs
 is creating a merged segment (out of two segments) containing 6 folders: 
crawl_generate, crawl_fetch, crawl_parse, parse_data, parse_text

_master_: bin/nutch 
inject->+generate->fetch->parse->updatedb+->+generate->fetch->parse->updatedb+->mergesegs
 is creating a merged segment (out of two segments) containing 2 folders: 
crawl_generate and crawl_parse

I am not sure how invertlinks works, but I can imagine it depends of ALL 
segment sub folders. The SegmentMerger says:
{quote}SegmentMerger: using segment data from: content crawl_generate 
crawl_fetch crawl_parse parse_data parse_text
{quote}
And I think all of these folders are still needed (by eg invertlinks), even if 
multiple segments are merged. So it works for 1.14, and no longer for master.

 

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png, 
> Screenshot_2018-03-07_07-50-05.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, 

[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-03-08 Thread Marco Ebbinghaus (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391830#comment-16391830
 ] 

Marco Ebbinghaus commented on NUTCH-2517:
-

I double checked the behaviors of version 1.14 and the current master (via 
Docker containers).

*If you do a single crawling cycle*

_1.14_: bin/nutch inject->generate->fetch->parse->updatedb->mergesegs is 
creating a merged segment (out of one segment) containing 2 folders: 
crawl_generate and crawl_parse

_master_: bin/nutch inject->generate->fetch->parse->updatedb->mergesegs is 
creating a merged segment (out of one segment) containing 1 folder: 
crawl_generate

_(But it might be that doing a mergesegs after one single crawling cycle is a 
missuse of the software anyway (idk), so let's have a look on doing multiple 
crawling cycles, which works better.)_

*If you do two crawling cycles:*

_1.14_: bin/nutch 
inject->+{color:#33}generate->fetch->parse->updatedb{color}+->+generate->fetch->parse->updatedb+->mergesegs
 is creating a merged segment (out of two segments) containing 6 folders: 
crawl_generate, crawl_fetch, crawl_parse, parse_data, parse_text

_master_: bin/nutch 
inject->+generate->fetch->parse->updatedb+->+generate->fetch->parse->updatedb+->mergesegs
 is creating a merged segment (out of two segments) containing 2 folders: 
crawl_generate and crawl_parse

I am not sure how invertlinks works, but I can imagine it depends of ALL 
segment sub folders. The SegmentMerger says:
{quote}SegmentMerger: using segment data from: content crawl_generate 
crawl_fetch crawl_parse parse_data parse_text
{quote}
And I think all of these folders are still needed (by eg invertlinks), even if 
multiple segments are merged. So it works for 1.14, and no longer for master.

 

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png, 
> Screenshot_2018-03-07_07-50-05.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> 

[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-03-08 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391440#comment-16391440
 ] 

Lewis John McGibbney commented on NUTCH-2517:
-

Can anyone else confirm the above ?

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png, 
> Screenshot_2018-03-07_07-50-05.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
>      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
> So as it seems mapreduce corrupts the segment folder during mergesegs command.
>  
> Pay attention to the fact that this issue is not related on trying to merge a 
> single segment like described above. As you can see on the attached 
> screenshot that problem also appears when executing multiple bin/nutch 
> generate/fetch/parse/updatedb commands before executing mergesegs - resulting 
> in a segment count > 1.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391439#comment-16391439
 ] 

ASF GitHub Bot commented on NUTCH-2517:
---

lewismc opened a new pull request #293: NUTCH-2517 mergesegs corrupts segment 
data
URL: https://github.com/apache/nutch/pull/293
 
 
   This is mostly a cleanup of the Classes concerned with 
https://issues.apache.org/jira/projects/NUTCH/issues/NUTCH-2517, namely 
SegmentMerger and LinkDb


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png, 
> Screenshot_2018-03-07_07-50-05.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
>      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
> So as it seems mapreduce corrupts the segment folder during mergesegs command.
>  
> Pay 

[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-03-08 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391430#comment-16391430
 ] 

Lewis John McGibbney commented on NUTCH-2517:
-

Hi [~mebbinghaus] I ran it from the Docker container and can reproduce some of 
your results, there is one nuance however. I'll explain below
When I run  mergesegs and inspect the data structures created within the 
mycrawl/MERGEDsegments/segment/... I see BOTH crawl_generate and crawl_parse. 
So there must be something wrong with your crawl cycle for you only to have 
generated on directory. I'll leave that down to you to confirm.

The other issue however is that when I attempt to invertlinks using one of the 
merged segs, I end up with the same stack track as you so i am looking into the 
code right now.

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png, 
> Screenshot_2018-03-07_07-50-05.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
>      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
> So as it seems mapreduce corrupts the segment folder during mergesegs 

[jira] [Commented] (NUTCH-1541) Indexer plugin to write CSV

2018-03-08 Thread Semyon Semyonov (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391262#comment-16391262
 ] 

Semyon Semyonov commented on NUTCH-1541:


Hi [~wastl-nagel]
Why wasn't this plugin merged with master? 

> Indexer plugin to write CSV
> ---
>
> Key: NUTCH-1541
> URL: https://issues.apache.org/jira/browse/NUTCH-1541
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Affects Versions: 1.7
>Reporter: Sebastian Nagel
>Priority: Minor
> Attachments: NUTCH-1541-v1.patch, NUTCH-1541-v2.patch
>
>
> With the new pluggable indexer a simple plugin would be handy to write 
> configurable fields into a CSV file - for further analysis or just for export.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2411) Index-metadata to support indexing multiple values for a field

2018-03-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391258#comment-16391258
 ] 

Hudson commented on NUTCH-2411:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3506 (See 
[https://builds.apache.org/job/Nutch-trunk/3506/])
NUTCH-2411 Index-metadata to support indexing multiple values for a (markus: 
[https://github.com/apache/nutch/commit/9a77f43774b2c3cd70785895afb989e9ee2d8d5f])
* (edit) 
src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java
* (edit) conf/nutch-default.xml


> Index-metadata to support indexing multiple values for a field 
> ---
>
> Key: NUTCH-2411
> URL: https://issues.apache.org/jira/browse/NUTCH-2411
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2411-1.13.patch, NUTCH-2411-1.13.patch, 
> NUTCH-2411.patch, NUTCH-2411.patch, NUTCH-2411.patch, NUTCH-2411.patch
>
>
> {code}
> 
>   index.metadata.separator
>   
>   
>Separator to use if you want to index multiple values for a given field. 
> Leave empty to
>treat each value as a single value.
>   
> 
> 
>   index.metadata.multivalued.fields
>   
>   
> Comma separated list of fields that are multi valued.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2411) Index-metadata to support indexing multiple values for a field

2018-03-08 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-2411.
--
Resolution: Fixed

Committed for 1.15
bd70d2fe..9a77f437  master -> master



> Index-metadata to support indexing multiple values for a field 
> ---
>
> Key: NUTCH-2411
> URL: https://issues.apache.org/jira/browse/NUTCH-2411
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2411-1.13.patch, NUTCH-2411-1.13.patch, 
> NUTCH-2411.patch, NUTCH-2411.patch, NUTCH-2411.patch, NUTCH-2411.patch
>
>
> {code}
> 
>   index.metadata.separator
>   
>   
>Separator to use if you want to index multiple values for a given field. 
> Leave empty to
>treat each value as a single value.
>   
> 
> 
>   index.metadata.multivalued.fields
>   
>   
> Comma separated list of fields that are multi valued.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2411) Index-metadata to support indexing multiple values for a field

2018-03-08 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391141#comment-16391141
 ] 

Markus Jelsma commented on NUTCH-2411:
--

Forgot the last time i threatened to commit, will try again. 
Will commit shortly unless objections!

> Index-metadata to support indexing multiple values for a field 
> ---
>
> Key: NUTCH-2411
> URL: https://issues.apache.org/jira/browse/NUTCH-2411
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2411-1.13.patch, NUTCH-2411-1.13.patch, 
> NUTCH-2411.patch, NUTCH-2411.patch, NUTCH-2411.patch, NUTCH-2411.patch
>
>
> {code}
> 
>   index.metadata.separator
>   
>   
>Separator to use if you want to index multiple values for a given field. 
> Leave empty to
>treat each value as a single value.
>   
> 
> 
>   index.metadata.multivalued.fields
>   
>   
> Comma separated list of fields that are multi valued.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2525) Metadata indexer cannot handle uppercase parse metadata

2018-03-08 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391139#comment-16391139
 ] 

Markus Jelsma commented on NUTCH-2525:
--

Any comments on this one? Julien did the initial work, and treated parseMD 
different than contentMD or crawlMD. The old code would never work, but he did 
it for some reason.

> Metadata indexer cannot handle uppercase parse metadata
> ---
>
> Key: NUTCH-2525
> URL: https://issues.apache.org/jira/browse/NUTCH-2525
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2525.patch
>
>
> MetadataIndexer lowercases keys for parse metadata, making it impossible to 
> index metadata containing uppercase. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)