RE: UrlRegexFilter is getting destroyed for unrealistically long links

Yossi Tamari Mon, 12 Mar 2018 07:28:23 -0700

> Which property, db.max.outlinks.per.page or db.max.anchor.length?
db.max.anchor.length, I already said that when I wrote 
"db.max.outlinks.per.page" it was a copy/paste error.


> I was about renaming db.max.anchor.length -> linkdb.max.anchor.length
OK, agreed, but it should also be moved to the LinkDB section in 
nutch-default.xml.

> Regarding a property to limit the URL length as discussed in NUTCH-1106:
> - it should be applied before URL normalizers
Agreed, but it seems to me the most natural place to add it is where 
db.max.outlinks.per.page is applied, around line 257 in ParseOutputFormat. It 
should apply to outlinks received from the parser, not to injected URLs, for 
example. The only other place I can think of where this may be needed is after 
redirect.
This is pretty much the same as what Semyon suggests, whether we push it down 
into the filterNormalize method or do it before calling it.

        Yossi.

> -----Original Message-----
> From: Sebastian Nagel <wastl.na...@googlemail.com>
> Sent: 12 March 2018 15:57
> To: user@nutch.apache.org
> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long 
> links
> 
> Hi Semyon, Yossi, Markus,
> 
> > what db.max.anchor.length was supposed to do
> 
> it's applied to "anchor texts" (https://en.wikipedia.org/wiki/Anchor_text)
>   <a href="url">anchor text</a>
> Can we agree to use the term "anchor" in this meaning?
> At least, that's how it is used in the class Outlink and hopefully throughout
> Nutch.
> 
> > Personally, I still think the property should be used to limit outlink
> > length in parsing,
> 
> Which property, db.max.outlinks.per.page or db.max.anchor.length?
> 
> I was about renaming
>   db.max.anchor.length -> linkdb.max.anchor.length This property was forgotten
> when making the naming more consistent in
>   [NUTCH-2220] - Rename db.* options used only by the linkdb to linkdb.*
> 
> Regarding a property to limit the URL length as discussed in NUTCH-1106:
> - it should be applied before URL normalizers
>   (that would be the main advantage over adding a regex filter rule)
> - but probably for all tools / places where URLs are filtered
>   (ugly because there are many of them)
> - one option would be to rethink the pipeline of URL normalizers and filters
>   as Julien did it for Storm-crawler [1].
> - a pragmatic solution to keep the code changes limited:
>   do the length check twice at the beginning of
>    URLNormalizers.normalize(...)
>   and
>    URLFilters.filter(...)
>   (it's not guaranteed that normalizers are always called)
> - the minimal solution: add a default rule to regex-urlfilter.txt.template
>   to limit the length to 512 (or 1024/2048) characters
> 
> 
> Best,
> Sebastian
> 
> [1]
> https://github.com/DigitalPebble/storm-
> crawler/blob/master/archetype/src/main/resources/archetype-
> resources/src/main/resources/urlfilters.json
> 
> 
> 
> On 03/12/2018 02:02 PM, Yossi Tamari wrote:
> > The other properties in this section actually affect parsing (e.g.
> db.max.outlinks.per.page). I was under the impression that this is what
> db.max.anchor.length was supposed to do, and actually increased its value.
> Turns out this is one of the many things in Nutch that are not intuitive (or 
> in this
> case, does nothing at all).
> > One of the reasons I thought so is that very long links can be used as an 
> > attack
> on crawlers.
> > Personally, I still think the property should be used to limit outlink 
> > length in
> parsing, but if that is not what it's supposed to do, I guess it needs to be
> renamed (to match the code), moved to a different section of the properties
> file, and perhaps better documented. In that case, you'll need to use Markus'
> solution, and basically everybody should use Markus' first rule...
> >
> >> -----Original Message-----
> >> From: Semyon Semyonov <semyon.semyo...@mail.com>
> >> Sent: 12 March 2018 14:51
> >> To: user@nutch.apache.org
> >> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically
> >> long links
> >>
> >> So, which is the conclusion?
> >>
> >> Should it be solved in regex file or through this property?
> >>
> >> Though, how the property of crawldb/linkdb suppose to prevent this
> >> problem in Parse?
> >>
> >> Sent: Monday, March 12, 2018 at 1:42 PM
> >> From: "Edward Capriolo" <edlinuxg...@gmail.com>
> >> To: "user@nutch.apache.org" <user@nutch.apache.org>
> >> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically
> >> long links Some regular expressions (those with backtracing) can be
> >> very expensive for lomg strings
> >>
> >> https://regular-expressions.mobi/catastrophic.html?wlr=1
> >>
> >> Maybe that is your issue.
> >>
> >> On Monday, March 12, 2018, Sebastian Nagel
> >> <wastl.na...@googlemail.com>
> >> wrote:
> >>
> >>> Good catch. It should be renamed to be consistent with other
> >>> properties, right?
> >>>
> >>> On 03/12/2018 01:10 PM, Yossi Tamari wrote:
> >>>> Perhaps, however it starts with db, not linkdb (like the other
> >>>> linkdb
> >>> properties), it is in the CrawlDB part of nutch-default.xml, and
> >>> LinkDB code uses the property name linkdb.max.anchor.length.
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Markus Jelsma <markus.jel...@openindex.io>
> >>>>> Sent: 12 March 2018 14:05
> >>>>> To: user@nutch.apache.org
> >>>>> Subject: RE: UrlRegexFilter is getting destroyed for
> >>>>> unrealistically
> >>> long links
> >>>>>
> >>>>> That is for the LinkDB.
> >>>>>
> >>>>>
> >>>>>
> >>>>> -----Original message-----
> >>>>>> From:Yossi Tamari <yossi.tam...@pipl.com>
> >>>>>> Sent: Monday 12th March 2018 13:02
> >>>>>> To: user@nutch.apache.org
> >>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
> >>>>>> unrealistically long links
> >>>>>>
> >>>>>> Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy
> >>>>>> paste
> >>>>> error...
> >>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Markus Jelsma <markus.jel...@openindex.io>
> >>>>>>> Sent: 12 March 2018 14:01
> >>>>>>> To: user@nutch.apache.org
> >>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
> >>>>>>> unrealistically long links
> >>>>>>>
> >>>>>>> scripts/apache-nutch-
> >>>>>>> 1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:
> >>>>>>> maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page",
> >>>>>>> 100);
> >>>>>>> scripts/apache-nutch-
> >>>>>>> 1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:
> >>> int
> >>>>>>> maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page",
> >>>>>>> 100);
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> -----Original message-----
> >>>>>>>> From:Yossi Tamari <yossi.tam...@pipl.com>
> >>>>>>>> Sent: Monday 12th March 2018 12:56
> >>>>>>>> To: user@nutch.apache.org
> >>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
> >>>>>>>> unrealistically long links
> >>>>>>>>
> >>>>>>>> Nutch.default contains a property db.max.outlinks.per.page,
> >>>>>>>> which I think is
> >>>>>>> supposed to prevent these cases. However, I just searched the
> >>>>>>> code and couldn't find where it is used. Bug?
> >>>>>>>>
> >>>>>>>>> -----Original Message-----
> >>>>>>>>> From: Semyon Semyonov <semyon.semyo...@mail.com>
> >>>>>>>>> Sent: 12 March 2018 12:47
> >>>>>>>>> To: usernutch.apache.org <user@nutch.apache.org>
> >>>>>>>>> Subject: UrlRegexFilter is getting destroyed for
> >>>>>>>>> unrealistically long links
> >>>>>>>>>
> >>>>>>>>> Dear all,
> >>>>>>>>>
> >>>>>>>>> There is an issue with UrlRegexFilter and parsing. In average,
> >>>>>>>>> parsing takes about 1 millisecond, but sometimes the websites
> >>>>>>>>> have the crazy links that destroy the parsing(takes 3+ hours
> >>>>>>>>> and destroy the next
> >>>>>>> steps of the crawling).
> >>>>>>>>> For example, below you can see shortened logged version of url
> >>>>>>>>> with encoded image, the real lenght of the link is 532572
> >>> characters.
> >>>>>>>>>
> >>>>>>>>> Any idea what should I do with such behavior? Should I modify
> >>>>>>>>> the plugin to reject links with lenght > MAX or use more
> >>>>>>>>> comlex logic/check extra configuration?
> >>>>>>>>> 2018-03-10 23:39:52,082 INFO [main]
> >>>>>>>>> org.apache.nutch.parse.ParseOutputFormat:
> >>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.3. Before filteing
> >>>>>>>>> and normalization
> >>>>>>>>> 2018-03-10 23:39:52,178 INFO [main]
> >>>>>>>>> org.apache.nutch.urlfilter.api.RegexURLFilterBase:
> >>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url
> >>>>>>>>> filter for url
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>
> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS[
> >>>>>
> >>
> https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS]
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>
> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>
> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr
> >>>>> 7
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>
> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>
> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
> >>>>>>>>> dbnu50253lju... [532572 characters]
> >>>>>>>>> 2018-03-11 03:56:26,118 INFO [main]
> >>>>>>>>> org.apache.nutch.parse.ParseOutputFormat:
> >>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.4. After filteing
> >>>>>>>>> and normalization
> >>>>>>>>>
> >>>>>>>>> Semyon.
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>
> >>>
> >>
> >> --
> >> Sorry this was sent from mobile. Will do less grammar and spell check
> >> than usual.
> >

RE: UrlRegexFilter is getting destroyed for unrealistically long links

Reply via email to