[jira] [Commented] (NUTCH-2806) Nutch can't parse links

2020-07-27 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17165557#comment-17165557
 ] 

Sebastian Nagel commented on NUTCH-2806:


Hi [~immobilier-dz], could be also caused by http.content.limit which will by 
default in 2.4 only fetch the first 64 kiB of the page. If you increase the 
limit there are more links. You can test it by running
{noformat}
 $NUTCH_HOME/bin/nutch parsechecker -Dhttp.content.limit=-1 
https://www.algeriahome.com/{noformat}

> Nutch can't parse links 
> 
>
> Key: NUTCH-2806
> URL: https://issues.apache.org/jira/browse/NUTCH-2806
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4
>Reporter: lina dziri
>Priority: Major
>
> Testing with the following site: 
> [https://www.algeriahome.com|https://www.algeriahome.com/] , nutch only parse 
> links that does contain the base url. 
>  Tried tika as parser, tried to update db.max.outlinks.per.page to -1, tried 
> practically every comments about detecting all the links, doubted urlfilter 
> or regex-normalizer so it was disabled but having the same results. 
>  each time I rebuild nutch and test the parser, it gives the same urls count 
> arround 378. 
>  Can somebody help out to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2806) Nutch can't parse links

2020-07-10 Thread Jorge Luis Betancourt Gonzalez (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155785#comment-17155785
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2806:
---

Hi [~immobilier-dz] can you check the value of the {{db.ignore.external.links}} 
setting in your configuration? By default, it is set to false, which means that 
Nutch should be able to at least detect/add the external links for crawling in 
a future crawl. See 
[https://github.com/apache/nutch/blob/2.x/conf/nutch-default.xml#L498-L505]

Finally, keep in mind that normally is best to send this type of inquiries to 
the users/developers mailing lists 
([https://nutch.apache.org/mailing_lists.html]).

> Nutch can't parse links 
> 
>
> Key: NUTCH-2806
> URL: https://issues.apache.org/jira/browse/NUTCH-2806
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4
>Reporter: lina dziri
>Priority: Major
> Fix For: 2.4
>
>
> Testing with the following site: 
> [https://www.algeriahome.com|https://www.algeriahome.com/] , nutch only parse 
> links that does contain the base url. 
>  Tried tika as parser, tried to update db.max.outlinks.per.page to -1, tried 
> practically every comments about detecting all the links, doubted urlfilter 
> or regex-normalizer so it was disabled but having the same results. 
>  each time I rebuild nutch and test the parser, it gives the same urls count 
> arround 378. 
>  Can somebody help out to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)