Hi Yossi,
ok, I see, you need administrator privileges to reopen old issues.
Done: reopened NUTCH-1106.
Opened a new issue NUTCH-2530 instead of reopening NUTCH-2220
to avoid that we accidentally modify release notes, e.g.
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=10680
I think the first one should also be handled by reopening NUTCH-2220, which
specifically mentions renaming db.max.anchor.length. The problem is that it
seems like I am not able to reopen a closed/resolved issue. Sorry...
> -Original Message-
> From: Sebastian Nagel
> Sent: 12 March 2018
>> I was about renaming db.max.anchor.length -> linkdb.max.anchor.length
> OK, agreed, but it should also be moved to the LinkDB section in
> nutch-default.xml.
Yes, of course, plus make the description more explicit.
Could you open a Jira issue for this?
> It should apply to outlinks received f
> Which property, db.max.outlinks.per.page or db.max.anchor.length?
db.max.anchor.length, I already said that when I wrote
"db.max.outlinks.per.page" it was a copy/paste error.
> I was about renaming db.max.anchor.length -> linkdb.max.anchor.length
OK, agreed, but it should also be moved to the L
Hi Sebastian,
I think that the simplest(and more solid way then the regex modification) would
be modification of ParseOutputFormat.filterNormalize.
As far as I can see all the url modifications/filtrations occur there.
Therefore if in the beginning we add to
if (fromUrl.equals(toUrl)) {
Hi Semyon, Yossi, Markus,
> what db.max.anchor.length was supposed to do
it's applied to "anchor texts" (https://en.wikipedia.org/wiki/Anchor_text)
anchor text
Can we agree to use the term "anchor" in this meaning?
At least, that's how it is used in the class Outlink and hopefully throughout
N
The other properties in this section actually affect parsing (e.g.
db.max.outlinks.per.page). I was under the impression that this is what
db.max.anchor.length was supposed to do, and actually increased its value.
Turns out this is one of the many things in Nutch that are not intuitive (or in
t
So, which is the conclusion?
Should it be solved in regex file or through this property?
Though, how the property of crawldb/linkdb suppose to prevent this problem in
Parse?
Sent: Monday, March 12, 2018 at 1:42 PM
From: "Edward Capriolo"
To: "user@nutch.apache.org"
Subject: Re: UrlRegexFilter
Some regular expressions (those with backtracing) can be very expensive for
lomg strings
https://regular-expressions.mobi/catastrophic.html?wlr=1
Maybe that is your issue.
On Monday, March 12, 2018, Sebastian Nagel
wrote:
> Good catch. It should be renamed to be consistent with other propertie
Good catch. It should be renamed to be consistent with other properties, right?
On 03/12/2018 01:10 PM, Yossi Tamari wrote:
> Perhaps, however it starts with db, not linkdb (like the other linkdb
> properties), it is in the CrawlDB part of nutch-default.xml, and LinkDB code
> uses the property n
Perhaps, however it starts with db, not linkdb (like the other linkdb
properties), it is in the CrawlDB part of nutch-default.xml, and LinkDB code
uses the property name linkdb.max.anchor.length.
> -Original Message-
> From: Markus Jelsma
> Sent: 12 March 2018 14:05
> To: user@nutch.apa
That is for the LinkDB.
-Original message-
> From:Yossi Tamari
> Sent: Monday 12th March 2018 13:02
> To: user@nutch.apache.org
> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long
> links
>
> Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy pas
Hi Yossi,
it's used in FetcherThread and ParseOutputFormat:
git grep -F db.max.outlinks.per.page
However, it's not to limit the length of single outlink in characters
but the number of outlinks followed (added to CrawlDb).
There was NUTCH-1106 to add a property to limit the outlink length.
Se
Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy paste error...
> -Original Message-
> From: Markus Jelsma
> Sent: 12 March 2018 14:01
> To: user@nutch.apache.org
> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long
> links
>
> scripts/apache-nutch-
scripts/apache-nutch-1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:
maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page", 100);
scripts/apache-nutch-1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:
int maxOutlinksPerPage = job.getInt("db.max.outlinks.p
Nutch.default contains a property db.max.outlinks.per.page, which I think is
supposed to prevent these cases. However, I just searched the code and couldn't
find where it is used. Bug?
> -Original Message-
> From: Semyon Semyonov
> Sent: 12 March 2018 12:47
> To: usernutch.apache.org
Hello - see inline.
Regards,
Markus
-Original message-
> From:Semyon Semyonov
> Sent: Monday 12th March 2018 11:47
> To: usernutch.apache.org
> Subject: UrlRegexFilter is getting destroyed for unrealistically long links
>
> Dear all,
>
> There is an issue with UrlRegexFilter and pars
Dear all,
There is an issue with UrlRegexFilter and parsing. In average, parsing takes
about 1 millisecond, but sometimes the websites have the crazy links that
destroy the parsing(takes 3+ hours and destroy the next steps of the crawling).
For example, below you can see shortened logged versio
18 matches
Mail list logo