[jira] [Updated] (NUTCH-2666) increase default value for http.content.limit

2018-10-23 Thread Marco Ebbinghaus (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Ebbinghaus updated NUTCH-2666:

Description: 
The default value for http.content.limit in nutch-default.xml (The length limit 
for downloaded content using the http://
 protocol, in bytes. If this value is nonnegative (>=0), content longer
 than it will be truncated; otherwise, no truncation at all. Do not
 confuse this setting with the file.content.limit setting.) is set to 64kb. 
Maybe this default value should be increased as many pages today are greater 
than 64kb.

This fact hit me when trying to crawl a single website whose pages are much 
greater than 64kb and because of that with every crawl cycle the count of 
db_unfetched urls decreased until it hit zero and the crawler became inactive 
(because the first 64 kB contained always the same set of navigation links)

The description might also be updated as this is not only the case for the http 
protocol, but also for https.

  was:
The default value for http.content.limit in nutch-default.xml (The length limit 
for downloaded content using the http://
 protocol, in bytes. If this value is nonnegative (>=0), content longer
 than it will be truncated; otherwise, no truncation at all. Do not
 confuse this setting with the file.content.limit setting.) is set to 64kb. 
Maybe this default value should be increased as many pages today are greater 
than 64kb.

The description might also be updated as this is not only the case for the http 
protocol, but also for https.


> increase default value for http.content.limit
> -
>
> Key: NUTCH-2666
> URL: https://issues.apache.org/jira/browse/NUTCH-2666
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.15
>Reporter: Marco Ebbinghaus
>Priority: Minor
>
> The default value for http.content.limit in nutch-default.xml (The length 
> limit for downloaded content using the http://
>  protocol, in bytes. If this value is nonnegative (>=0), content longer
>  than it will be truncated; otherwise, no truncation at all. Do not
>  confuse this setting with the file.content.limit setting.) is set to 64kb. 
> Maybe this default value should be increased as many pages today are greater 
> than 64kb.
> This fact hit me when trying to crawl a single website whose pages are much 
> greater than 64kb and because of that with every crawl cycle the count of 
> db_unfetched urls decreased until it hit zero and the crawler became inactive 
> (because the first 64 kB contained always the same set of navigation links)
> The description might also be updated as this is not only the case for the 
> http protocol, but also for https.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2666) increase default value for http.content.limit

2018-10-23 Thread Marco Ebbinghaus (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Ebbinghaus updated NUTCH-2666:

Description: 
The default value for http.content.limit in nutch-default.xml (The length limit 
for downloaded content using the http://
 protocol, in bytes. If this value is nonnegative (>=0), content longer
 than it will be truncated; otherwise, no truncation at all. Do not
 confuse this setting with the file.content.limit setting.) is set to 64kb. 
Maybe this default value should be increased as many pages today are greater 
than 64kb.

The description might also be updated as this is not only the case for the http 
protocol, but also for https.

  was:
The default value for http.content.limit (The length limit for downloaded 
content using the http://
 protocol, in bytes. If this value is nonnegative (>=0), content longer
 than it will be truncated; otherwise, no truncation at all. Do not
 confuse this setting with the file.content.limit setting.) is set to 64kb. 
Maybe this default value should be increased as many pages today are greater 
than 64kb.

The description might also be updated as this is not only the case for the http 
protocol, but also for https.


> increase default value for http.content.limit
> -
>
> Key: NUTCH-2666
> URL: https://issues.apache.org/jira/browse/NUTCH-2666
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.15
>Reporter: Marco Ebbinghaus
>Priority: Minor
>
> The default value for http.content.limit in nutch-default.xml (The length 
> limit for downloaded content using the http://
>  protocol, in bytes. If this value is nonnegative (>=0), content longer
>  than it will be truncated; otherwise, no truncation at all. Do not
>  confuse this setting with the file.content.limit setting.) is set to 64kb. 
> Maybe this default value should be increased as many pages today are greater 
> than 64kb.
> The description might also be updated as this is not only the case for the 
> http protocol, but also for https.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2666) increase default value for http.content.limit

2018-10-23 Thread Marco Ebbinghaus (JIRA)
Marco Ebbinghaus created NUTCH-2666:
---

 Summary: increase default value for http.content.limit
 Key: NUTCH-2666
 URL: https://issues.apache.org/jira/browse/NUTCH-2666
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.15
Reporter: Marco Ebbinghaus


The default value for http.content.limit (The length limit for downloaded 
content using the http://
 protocol, in bytes. If this value is nonnegative (>=0), content longer
 than it will be truncated; otherwise, no truncation at all. Do not
 confuse this setting with the file.content.limit setting.) is set to 64kb. 
Maybe this default value should be increased as many pages today are greater 
than 64kb.

The description might also be updated as this is not only the case for the http 
protocol, but also for https.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NUTCH-2517) mergesegs corrupts segment data

2018-03-08 Thread Marco Ebbinghaus (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391830#comment-16391830
 ] 

Marco Ebbinghaus edited comment on NUTCH-2517 at 3/8/18 7:59 PM:
-

I double checked the behaviors of version 1.14 and the current master (via 
Docker containers).

*If you do a single crawling cycle*

_1.14_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is 
creating a merged segment (out of one segment) containing 2 folders: 
crawl_generate and crawl_parse

_master_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is 
creating a merged segment (out of one segment) containing 1 folder: 
crawl_generate

_(But it might be that doing a mergesegs after one single crawling cycle is a 
missuse of the software anyway (idk), so let's have a look on doing multiple 
crawling cycles, which works better.)_

*If you do two crawling cycles:*

_1.14_: bin/nutch inject > +generate > fetch > parse > updatedb+ > +generate > 
fetch > parse > updatedb+ > mergesegs is creating a merged segment (out of two 
segments) containing 6 folders: content, crawl_generate, crawl_fetch, 
crawl_parse, parse_data, parse_text

_master_: bin/nutch inject > +generate > fetch > parse > updatedb+ > +generate 
>fetch > parse > updatedb+ > mergesegs is creating a merged segment (out of two 
segments) containing 2 folders: crawl_generate and crawl_parse

I am not sure how invertlinks works, but I can imagine it depends of ALL 
segment sub folders. The SegmentMerger says:
{quote}SegmentMerger: using segment data from: content crawl_generate 
crawl_fetch crawl_parse parse_data parse_text
{quote}
And I think all of these folders are still needed (by eg invertlinks), even if 
multiple segments are merged. So it works for 1.14, and no longer for master.

 


was (Author: mebbinghaus):
I double checked the behaviors of version 1.14 and the current master (via 
Docker containers).

*If you do a single crawling cycle*

_1.14_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is 
creating a merged segment (out of one segment) containing 2 folders: 
crawl_generate and crawl_parse

_master_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is 
creating a merged segment (out of one segment) containing 1 folder: 
crawl_generate

_(But it might be that doing a mergesegs after one single crawling cycle is a 
missuse of the software anyway (idk), so let's have a look on doing multiple 
crawling cycles, which works better.)_

*If you do two crawling cycles:*

_1.14_: bin/nutch inject > +generate > fetch > parse > updatedb+ > +generate > 
fetch > parse > updatedb+ > mergesegs is creating a merged segment (out of two 
segments) containing 6 folders: crawl_generate, crawl_fetch, crawl_parse, 
parse_data, parse_text

_master_: bin/nutch inject > +generate > fetch > parse > updatedb+ > +generate 
>fetch > parse > updatedb+ > mergesegs is creating a merged segment (out of two 
segments) containing 2 folders: crawl_generate and crawl_parse

I am not sure how invertlinks works, but I can imagine it depends of ALL 
segment sub folders. The SegmentMerger says:
{quote}SegmentMerger: using segment data from: content crawl_generate 
crawl_fetch crawl_parse parse_data parse_text
{quote}
And I think all of these folders are still needed (by eg invertlinks), even if 
multiple segments are merged. So it works for 1.14, and no longer for master.

 

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png, 
> Screenshot_2018-03-07_07-50-05.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, 

[jira] [Comment Edited] (NUTCH-2517) mergesegs corrupts segment data

2018-03-08 Thread Marco Ebbinghaus (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391830#comment-16391830
 ] 

Marco Ebbinghaus edited comment on NUTCH-2517 at 3/8/18 7:58 PM:
-

I double checked the behaviors of version 1.14 and the current master (via 
Docker containers).

*If you do a single crawling cycle*

_1.14_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is 
creating a merged segment (out of one segment) containing 2 folders: 
crawl_generate and crawl_parse

_master_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is 
creating a merged segment (out of one segment) containing 1 folder: 
crawl_generate

_(But it might be that doing a mergesegs after one single crawling cycle is a 
missuse of the software anyway (idk), so let's have a look on doing multiple 
crawling cycles, which works better.)_

*If you do two crawling cycles:*

_1.14_: bin/nutch inject > +generate > fetch > parse > updatedb+ > +generate > 
fetch > parse > updatedb+ > mergesegs is creating a merged segment (out of two 
segments) containing 6 folders: crawl_generate, crawl_fetch, crawl_parse, 
parse_data, parse_text

_master_: bin/nutch inject > +generate > fetch > parse > updatedb+ > +generate 
>fetch > parse > updatedb+ > mergesegs is creating a merged segment (out of two 
segments) containing 2 folders: crawl_generate and crawl_parse

I am not sure how invertlinks works, but I can imagine it depends of ALL 
segment sub folders. The SegmentMerger says:
{quote}SegmentMerger: using segment data from: content crawl_generate 
crawl_fetch crawl_parse parse_data parse_text
{quote}
And I think all of these folders are still needed (by eg invertlinks), even if 
multiple segments are merged. So it works for 1.14, and no longer for master.

 


was (Author: mebbinghaus):
I double checked the behaviors of version 1.14 and the current master (via 
Docker containers).

*If you do a single crawling cycle*

_1.14_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is 
creating a merged segment (out of one segment) containing 2 folders: 
crawl_generate and crawl_parse

_master_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is 
creating a merged segment (out of one segment) containing 1 folder: 
crawl_generate

_(But it might be that doing a mergesegs after one single crawling cycle is a 
missuse of the software anyway (idk), so let's have a look on doing multiple 
crawling cycles, which works better.)_

*If you do two crawling cycles:*

_1.14_: bin/nutch 
inject->+{color:#33}generate->fetch->parse->updatedb-{color}+->+generate-+->fetch->parse->updatedb->mergesegs
 is creating a merged segment (out of two segments) containing 6 folders: 
crawl_generate, crawl_fetch, crawl_parse, parse_data, parse_text

_master_: bin/nutch inject > generate > fetch > parse > updatedb > generate 
>fetch > parse > updatedb > mergesegs is creating a merged segment (out of two 
segments) containing  2 folders: crawl_generate and crawl_parse

I am not sure how invertlinks works, but I can imagine it depends of ALL 
segment sub folders. The SegmentMerger says:
{quote}SegmentMerger: using segment data from: content crawl_generate 
crawl_fetch crawl_parse parse_data parse_text
{quote}
And I think all of these folders are still needed (by eg invertlinks), even if 
multiple segments are merged. So it works for 1.14, and no longer for master.

 

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png, 
> Screenshot_2018-03-07_07-50-05.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, 

[jira] [Comment Edited] (NUTCH-2517) mergesegs corrupts segment data

2018-03-08 Thread Marco Ebbinghaus (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391830#comment-16391830
 ] 

Marco Ebbinghaus edited comment on NUTCH-2517 at 3/8/18 7:57 PM:
-

I double checked the behaviors of version 1.14 and the current master (via 
Docker containers).

*If you do a single crawling cycle*

_1.14_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is 
creating a merged segment (out of one segment) containing 2 folders: 
crawl_generate and crawl_parse

_master_: bin/nutch inject > generate > fetch > parse > updatedb > mergesegs is 
creating a merged segment (out of one segment) containing 1 folder: 
crawl_generate

_(But it might be that doing a mergesegs after one single crawling cycle is a 
missuse of the software anyway (idk), so let's have a look on doing multiple 
crawling cycles, which works better.)_

*If you do two crawling cycles:*

_1.14_: bin/nutch 
inject->+{color:#33}generate->fetch->parse->updatedb-{color}+->+generate-+->fetch->parse->updatedb->mergesegs
 is creating a merged segment (out of two segments) containing 6 folders: 
crawl_generate, crawl_fetch, crawl_parse, parse_data, parse_text

_master_: bin/nutch inject > generate > fetch > parse > updatedb > generate 
>fetch > parse > updatedb > mergesegs is creating a merged segment (out of two 
segments) containing  2 folders: crawl_generate and crawl_parse

I am not sure how invertlinks works, but I can imagine it depends of ALL 
segment sub folders. The SegmentMerger says:
{quote}SegmentMerger: using segment data from: content crawl_generate 
crawl_fetch crawl_parse parse_data parse_text
{quote}
And I think all of these folders are still needed (by eg invertlinks), even if 
multiple segments are merged. So it works for 1.14, and no longer for master.

 


was (Author: mebbinghaus):
I double checked the behaviors of version 1.14 and the current master (via 
Docker containers).

*If you do a single crawling cycle*

_1.14_: bin/nutch inject->generate->fetch->parse->updatedb->mergesegs is 
creating a merged segment (out of one segment) containing 2 folders: 
crawl_generate and crawl_parse

_master_: bin/nutch inject->generate->fetch->parse->updatedb->mergesegs is 
creating a merged segment (out of one segment) containing 1 folder: 
crawl_generate

_(But it might be that doing a mergesegs after one single crawling cycle is a 
missuse of the software anyway (idk), so let's have a look on doing multiple 
crawling cycles, which works better.)_

*If you do two crawling cycles:*

_1.14_: bin/nutch 
inject->+{color:#33}generate->fetch->parse->updatedb{color}+->+generate->fetch->parse->updatedb+->mergesegs
 is creating a merged segment (out of two segments) containing 6 folders: 
crawl_generate, crawl_fetch, crawl_parse, parse_data, parse_text

_master_: bin/nutch 
inject->+generate->fetch->parse->updatedb+->+generate->fetch->parse->updatedb+->mergesegs
 is creating a merged segment (out of two segments) containing 2 folders: 
crawl_generate and crawl_parse

I am not sure how invertlinks works, but I can imagine it depends of ALL 
segment sub folders. The SegmentMerger says:
{quote}SegmentMerger: using segment data from: content crawl_generate 
crawl_fetch crawl_parse parse_data parse_text
{quote}
And I think all of these folders are still needed (by eg invertlinks), even if 
multiple segments are merged. So it works for 1.14, and no longer for master.

 

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png, 
> Screenshot_2018-03-07_07-50-05.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, 

[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-03-08 Thread Marco Ebbinghaus (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391830#comment-16391830
 ] 

Marco Ebbinghaus commented on NUTCH-2517:
-

I double checked the behaviors of version 1.14 and the current master (via 
Docker containers).

*If you do a single crawling cycle*

_1.14_: bin/nutch inject->generate->fetch->parse->updatedb->mergesegs is 
creating a merged segment (out of one segment) containing 2 folders: 
crawl_generate and crawl_parse

_master_: bin/nutch inject->generate->fetch->parse->updatedb->mergesegs is 
creating a merged segment (out of one segment) containing 1 folder: 
crawl_generate

_(But it might be that doing a mergesegs after one single crawling cycle is a 
missuse of the software anyway (idk), so let's have a look on doing multiple 
crawling cycles, which works better.)_

*If you do two crawling cycles:*

_1.14_: bin/nutch 
inject->+{color:#33}generate->fetch->parse->updatedb{color}+->+generate->fetch->parse->updatedb+->mergesegs
 is creating a merged segment (out of two segments) containing 6 folders: 
crawl_generate, crawl_fetch, crawl_parse, parse_data, parse_text

_master_: bin/nutch 
inject->+generate->fetch->parse->updatedb+->+generate->fetch->parse->updatedb+->mergesegs
 is creating a merged segment (out of two segments) containing 2 folders: 
crawl_generate and crawl_parse

I am not sure how invertlinks works, but I can imagine it depends of ALL 
segment sub folders. The SegmentMerger says:
{quote}SegmentMerger: using segment data from: content crawl_generate 
crawl_fetch crawl_parse parse_data parse_text
{quote}
And I think all of these folders are still needed (by eg invertlinks), even if 
multiple segments are merged. So it works for 1.14, and no longer for master.

 

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png, 
> Screenshot_2018-03-07_07-50-05.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> 

[jira] [Updated] (NUTCH-2517) mergesegs corrupts segment data

2018-03-06 Thread Marco Ebbinghaus (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Ebbinghaus updated NUTCH-2517:

Attachment: Screenshot_2018-03-07_07-50-05.png

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png, 
> Screenshot_2018-03-07_07-50-05.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
>      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
> So as it seems mapreduce corrupts the segment folder during mergesegs command.
>  
> Pay attention to the fact that this issue is not related on trying to merge a 
> single segment like described above. As you can see on the attached 
> screenshot that problem also appears when executing multiple bin/nutch 
> generate/fetch/parse/updatedb commands before executing mergesegs - resulting 
> in a segment count > 1.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-03-06 Thread Marco Ebbinghaus (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389143#comment-16389143
 ] 

Marco Ebbinghaus commented on NUTCH-2517:
-

I can also reproduce this when NOT running this from a Docker container. I 
checked out the master on my desktop 30 minutes ago and did the exactly same 
workflow as described above and the result is the same: after mergesegs I only 
have one folder crawl_generate in the merged segment. For the log output please 
see the attached screenshot.

In the meantime I am using the apache/nutch container with tag release-1.14, 
which is working as intended.

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
>      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
> So as it seems mapreduce corrupts the segment folder during mergesegs command.
>  
> Pay attention to the fact that this issue is not related on trying to merge a 
> single segment like described above. As you can see on the attached 
> screenshot that problem also appears when 

[jira] [Updated] (NUTCH-2517) mergesegs corrupts segment data

2018-03-04 Thread Marco Ebbinghaus (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Ebbinghaus updated NUTCH-2517:

Description: 
The problem probably occurs since commit 
[https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]

How to reproduce:
 * create container from apache/nutch image (latest)
 * open terminal in that container
 * set http.agent.name
 * create crawldir and urls file
 * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
 * run bin/nutch generate (bin/nutch generate mycrawl/crawldb mycrawl/segments 
1)
 ** this results in a segment (e.g. 20180304134215)
 * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
-threads 2)
 * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
-threads 2)
 ** ls in the segment folder -> existing folders: content, crawl_fetch, 
crawl_generate, crawl_parse, parse_data, parse_text
 * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
mycrawl/segments/20180304134215)
 * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
mycrawl/segments/* -filter)
 ** console output: `SegmentMerger: using segment data from: content 
crawl_generate crawl_fetch crawl_parse parse_data parse_text`
 ** resulting segment: 20180304134535
 * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing folder: 
crawl_generate
 * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
mycrawl/MERGEDsegments) which results in a consequential error
 ** console output: `LinkDb: adding segment: 
[file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
 LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
path does not exist: 
[file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
     at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
     at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
     at 
org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
     at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
     at 
org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
     at 
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
     at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
     at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
     at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
     at java.security.AccessController.doPrivileged(Native Method)
     at javax.security.auth.Subject.doAs(Subject.java:422)
     at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
     at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
     at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
     at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
     at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
     at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`

So as it seems mapreduce corrupts the segment folder during mergesegs command.

 

Pay attention to the fact that this issue is not related on trying to merge a 
single segment like described above. As you can see on the attached screenshot 
that problem also appears when executing multiple bin/nutch 
generate/fetch/parse/updatedb commands before executing mergesegs - resulting 
in a segment count > 1.

 

  was:
The problem probably occurs since commit 
[https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]

How to reproduce:
 * create container from apache/nutch image (latest)
 * open terminal in that container
 * set http.agent.name
 * create crawldir and urls file
 * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
 * run bin/nutch generate (bin/nutch generate mycrawl/crawldb mycrawl/segments 
1)
 ** this results in a segment (e.g. 20180304134215)
 * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
-threads 2)
 * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
-threads 2)
 ** ls in the segment folder -> existing folders: content, crawl_fetch, 
crawl_generate, crawl_parse, parse_data, parse_text
 * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
mycrawl/segments/20180304134215)
 * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
mycrawl/segments/* -filter)
 ** console output: `SegmentMerger: using segment data from: content 
crawl_generate crawl_fetch 

[jira] [Updated] (NUTCH-2517) mergesegs corrupts segment data

2018-03-04 Thread Marco Ebbinghaus (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Ebbinghaus updated NUTCH-2517:

Description: 
The problem probably occurs since commit 
[https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]

How to reproduce:
 * create container from apache/nutch image (latest)
 * open terminal in that container
 * set http.agent.name
 * create crawldir and urls file
 * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
 * run bin/nutch generate (bin/nutch generate mycrawl/crawldb mycrawl/segments 
1)
 ** this results in a segment (e.g. 20180304134215)
 * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
-threads 2)
 * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
-threads 2)
 ** ls in the segment folder -> existing folders: content, crawl_fetch, 
crawl_generate, crawl_parse, parse_data, parse_text
 * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
mycrawl/segments/20180304134215)
 * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
mycrawl/segments/* -filter)
 ** console output: `SegmentMerger: using segment data from: content 
crawl_generate crawl_fetch crawl_parse parse_data parse_text`
 ** resulting segment: 20180304134535
 * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing folder: 
crawl_generate
 * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
mycrawl/MERGEDsegments) which results in a consequential error
 ** console output: `LinkDb: adding segment: 
[file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
 LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
path does not exist: 
[file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
     at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
     at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
     at 
org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
     at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
     at 
org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
     at 
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
     at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
     at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
     at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
     at java.security.AccessController.doPrivileged(Native Method)
     at javax.security.auth.Subject.doAs(Subject.java:422)
     at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
     at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
     at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
     at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
     at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
     at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`

So as it seems mapreduce corrupts the segment folder during mergesegs command.

 

Pay attention to the fact that this issue is not related on trying to merge a 
single segment like described above. As you can see on the appended screenshot 
that problem also appears when using bin/nutch generate with a topN > 1 - 
resulting in a segment count > 1.

 

  was:
The problem probably occurs since commit 
[https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]

How to reproduce:
 * create container from apache/nutch image (latest)
 * open terminal in that container
 * set http.agent.name
 * create crawldir and urls file
 * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
 * run bin/nutch generate (bin/nutch generate mycrawl/crawldb mycrawl/segments 
1)
 ** this results in a segment (e.g. 20180304134215)
 * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
-threads 2)
 * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
-threads 2)
 ** ls in the segment folder -> existing folders: content, crawl_fetch, 
crawl_generate, crawl_parse, parse_data, parse_text
 * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
mycrawl/segments/20180304134215)
 * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
mycrawl/segments/* -filter)
 ** console output: `SegmentMerger: using segment data from: content 
crawl_generate crawl_fetch crawl_parse parse_data parse_text`
 ** resulting 

[jira] [Updated] (NUTCH-2517) mergesegs corrupts segment data

2018-03-04 Thread Marco Ebbinghaus (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Ebbinghaus updated NUTCH-2517:

Description: 
The problem probably occurs since commit 
[https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]

How to reproduce:
 * create container from apache/nutch image (latest)
 * open terminal in that container
 * set http.agent.name
 * create crawldir and urls file
 * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
 * run bin/nutch generate (bin/nutch generate mycrawl/crawldb mycrawl/segments 
1)
 ** this results in a segment (e.g. 20180304134215)
 * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
-threads 2)
 * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
-threads 2)
 ** ls in the segment folder -> existing folders: content, crawl_fetch, 
crawl_generate, crawl_parse, parse_data, parse_text
 * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
mycrawl/segments/20180304134215)
 * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
mycrawl/segments/* -filter)
 ** console output: `SegmentMerger: using segment data from: content 
crawl_generate crawl_fetch crawl_parse parse_data parse_text`
 ** resulting segment: 20180304134535
 * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing folder: 
crawl_generate
 * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
mycrawl/MERGEDsegments) which results in a consequential error
 ** console output: `LinkDb: adding segment: 
[file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
 LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
path does not exist: 
[file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
     at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
     at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
     at 
org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
     at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
     at 
org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
     at 
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
     at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
     at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
     at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
     at java.security.AccessController.doPrivileged(Native Method)
     at javax.security.auth.Subject.doAs(Subject.java:422)
     at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
     at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
     at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
     at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
     at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
     at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`

So as it seems mapreduce corrupts the segment folder during mergesegs command.

 

Pay attention to the fact that this issue is not related on trying to merge a 
single segment like described above. As you can see on the attached screenshot 
that problem also appears when using bin/nutch generate with a topN > 1 - 
resulting in a segment count > 1.

 

  was:
The problem probably occurs since commit 
[https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]

How to reproduce:
 * create container from apache/nutch image (latest)
 * open terminal in that container
 * set http.agent.name
 * create crawldir and urls file
 * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
 * run bin/nutch generate (bin/nutch generate mycrawl/crawldb mycrawl/segments 
1)
 ** this results in a segment (e.g. 20180304134215)
 * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
-threads 2)
 * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
-threads 2)
 ** ls in the segment folder -> existing folders: content, crawl_fetch, 
crawl_generate, crawl_parse, parse_data, parse_text
 * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
mycrawl/segments/20180304134215)
 * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
mycrawl/segments/* -filter)
 ** console output: `SegmentMerger: using segment data from: content 
crawl_generate crawl_fetch crawl_parse parse_data parse_text`
 ** resulting 

[jira] [Created] (NUTCH-2517) mergesegs corrupts segment data

2018-03-04 Thread Marco Ebbinghaus (JIRA)
Marco Ebbinghaus created NUTCH-2517:
---

 Summary: mergesegs corrupts segment data
 Key: NUTCH-2517
 URL: https://issues.apache.org/jira/browse/NUTCH-2517
 Project: Nutch
  Issue Type: Bug
  Components: segment
Affects Versions: 1.15
 Environment: xubuntu 17.10, docker container of apache/nutch LATEST
Reporter: Marco Ebbinghaus
 Attachments: Screenshot_2018-03-03_18-09-28.png

The problem probably occurs since commit 
[https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]

How to reproduce:
 * create container from apache/nutch image (latest)
 * open terminal in that container
 * set http.agent.name
 * create crawldir and urls file
 * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
 * run bin/nutch generate (bin/nutch generate mycrawl/crawldb mycrawl/segments 
1)
 ** this results in a segment (e.g. 20180304134215)
 * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
-threads 2)
 * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
-threads 2)
 ** ls in the segment folder -> existing folders: content, crawl_fetch, 
crawl_generate, crawl_parse, parse_data, parse_text
 * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
mycrawl/segments/20180304134215)
 * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
mycrawl/segments/* -filter)
 ** console output: `SegmentMerger: using segment data from: content 
crawl_generate crawl_fetch crawl_parse parse_data parse_text`
 ** resulting segment: 20180304134535
 * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing folder: 
crawl_generate
 * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
mycrawl/MERGEDsegments) which results in a consequential error
 ** console output: `LinkDb: adding segment: 
file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535
LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path 
does not exist: 
file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data
    at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
    at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
    at 
org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
    at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
    at 
org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
    at 
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
    at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
    at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`

So as it seems mapreduce corrupts the segment folder during mergesegs command

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)