Hi Semyon, Yossi, Markus, > what db.max.anchor.length was supposed to do
it's applied to "anchor texts" (https://en.wikipedia.org/wiki/Anchor_text) <a href="url">anchor text</a> Can we agree to use the term "anchor" in this meaning? At least, that's how it is used in the class Outlink and hopefully throughout Nutch. > Personally, I still think the property should be used to limit outlink length > in parsing, Which property, db.max.outlinks.per.page or db.max.anchor.length? I was about renaming db.max.anchor.length -> linkdb.max.anchor.length This property was forgotten when making the naming more consistent in [NUTCH-2220] - Rename db.* options used only by the linkdb to linkdb.* Regarding a property to limit the URL length as discussed in NUTCH-1106: - it should be applied before URL normalizers (that would be the main advantage over adding a regex filter rule) - but probably for all tools / places where URLs are filtered (ugly because there are many of them) - one option would be to rethink the pipeline of URL normalizers and filters as Julien did it for Storm-crawler [1]. - a pragmatic solution to keep the code changes limited: do the length check twice at the beginning of URLNormalizers.normalize(...) and URLFilters.filter(...) (it's not guaranteed that normalizers are always called) - the minimal solution: add a default rule to regex-urlfilter.txt.template to limit the length to 512 (or 1024/2048) characters Best, Sebastian [1] https://github.com/DigitalPebble/storm-crawler/blob/master/archetype/src/main/resources/archetype-resources/src/main/resources/urlfilters.json On 03/12/2018 02:02 PM, Yossi Tamari wrote: > The other properties in this section actually affect parsing (e.g. > db.max.outlinks.per.page). I was under the impression that this is what > db.max.anchor.length was supposed to do, and actually increased its value. > Turns out this is one of the many things in Nutch that are not intuitive (or > in this case, does nothing at all). > One of the reasons I thought so is that very long links can be used as an > attack on crawlers. > Personally, I still think the property should be used to limit outlink length > in parsing, but if that is not what it's supposed to do, I guess it needs to > be renamed (to match the code), moved to a different section of the > properties file, and perhaps better documented. In that case, you'll need to > use Markus' solution, and basically everybody should use Markus' first rule... > >> -----Original Message----- >> From: Semyon Semyonov <semyon.semyo...@mail.com> >> Sent: 12 March 2018 14:51 >> To: user@nutch.apache.org >> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long >> links >> >> So, which is the conclusion? >> >> Should it be solved in regex file or through this property? >> >> Though, how the property of crawldb/linkdb suppose to prevent this problem in >> Parse? >> >> Sent: Monday, March 12, 2018 at 1:42 PM >> From: "Edward Capriolo" <edlinuxg...@gmail.com> >> To: "user@nutch.apache.org" <user@nutch.apache.org> >> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long >> links >> Some regular expressions (those with backtracing) can be very expensive for >> lomg strings >> >> https://regular-expressions.mobi/catastrophic.html?wlr=1 >> >> Maybe that is your issue. >> >> On Monday, March 12, 2018, Sebastian Nagel <wastl.na...@googlemail.com> >> wrote: >> >>> Good catch. It should be renamed to be consistent with other >>> properties, right? >>> >>> On 03/12/2018 01:10 PM, Yossi Tamari wrote: >>>> Perhaps, however it starts with db, not linkdb (like the other >>>> linkdb >>> properties), it is in the CrawlDB part of nutch-default.xml, and >>> LinkDB code uses the property name linkdb.max.anchor.length. >>>> >>>>> -----Original Message----- >>>>> From: Markus Jelsma <markus.jel...@openindex.io> >>>>> Sent: 12 March 2018 14:05 >>>>> To: user@nutch.apache.org >>>>> Subject: RE: UrlRegexFilter is getting destroyed for >>>>> unrealistically >>> long links >>>>> >>>>> That is for the LinkDB. >>>>> >>>>> >>>>> >>>>> -----Original message----- >>>>>> From:Yossi Tamari <yossi.tam...@pipl.com> >>>>>> Sent: Monday 12th March 2018 13:02 >>>>>> To: user@nutch.apache.org >>>>>> Subject: RE: UrlRegexFilter is getting destroyed for >>>>>> unrealistically long links >>>>>> >>>>>> Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy >>>>>> paste >>>>> error... >>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Markus Jelsma <markus.jel...@openindex.io> >>>>>>> Sent: 12 March 2018 14:01 >>>>>>> To: user@nutch.apache.org >>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for >>>>>>> unrealistically long links >>>>>>> >>>>>>> scripts/apache-nutch- >>>>>>> 1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205: >>>>>>> maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page", >>>>>>> 100); >>>>>>> scripts/apache-nutch- >>>>>>> 1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118: >>> int >>>>>>> maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page", 100); >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -----Original message----- >>>>>>>> From:Yossi Tamari <yossi.tam...@pipl.com> >>>>>>>> Sent: Monday 12th March 2018 12:56 >>>>>>>> To: user@nutch.apache.org >>>>>>>> Subject: RE: UrlRegexFilter is getting destroyed for >>>>>>>> unrealistically long links >>>>>>>> >>>>>>>> Nutch.default contains a property db.max.outlinks.per.page, >>>>>>>> which I think is >>>>>>> supposed to prevent these cases. However, I just searched the >>>>>>> code and couldn't find where it is used. Bug? >>>>>>>> >>>>>>>>> -----Original Message----- >>>>>>>>> From: Semyon Semyonov <semyon.semyo...@mail.com> >>>>>>>>> Sent: 12 March 2018 12:47 >>>>>>>>> To: usernutch.apache.org <user@nutch.apache.org> >>>>>>>>> Subject: UrlRegexFilter is getting destroyed for >>>>>>>>> unrealistically long links >>>>>>>>> >>>>>>>>> Dear all, >>>>>>>>> >>>>>>>>> There is an issue with UrlRegexFilter and parsing. In average, >>>>>>>>> parsing takes about 1 millisecond, but sometimes the websites >>>>>>>>> have the crazy links that destroy the parsing(takes 3+ hours >>>>>>>>> and destroy the next >>>>>>> steps of the crawling). >>>>>>>>> For example, below you can see shortened logged version of url >>>>>>>>> with encoded image, the real lenght of the link is 532572 >>> characters. >>>>>>>>> >>>>>>>>> Any idea what should I do with such behavior? Should I modify >>>>>>>>> the plugin to reject links with lenght > MAX or use more comlex >>>>>>>>> logic/check extra configuration? >>>>>>>>> 2018-03-10 23:39:52,082 INFO [main] >>>>>>>>> org.apache.nutch.parse.ParseOutputFormat: >>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.3. Before filteing >>>>>>>>> and normalization >>>>>>>>> 2018-03-10 23:39:52,178 INFO [main] >>>>>>>>> org.apache.nutch.urlfilter.api.RegexURLFilterBase: >>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url >>>>>>>>> filter for url >>>>>>>>> >>>>>>> >>>>> >> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS[ >>>>> >> https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS] >>>>>>>>> >>>>>>> >>>>> >> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2 >>>>>>>>> >>>>>>> >>>>> >> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr >>>>> 7 >>>>>>>>> >>>>>>> >>>>> >> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1 >>>>>>>>> >>>>>>> >>>>> >> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu >>>>>>>>> dbnu50253lju... [532572 characters] >>>>>>>>> 2018-03-11 03:56:26,118 INFO [main] >>>>>>>>> org.apache.nutch.parse.ParseOutputFormat: >>>>>>>>> ParseOutputFormat.Write.filterNormalize 4.4. After filteing and >>>>>>>>> normalization >>>>>>>>> >>>>>>>>> Semyon. >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>> >>> >>> >> >> -- >> Sorry this was sent from mobile. Will do less grammar and spell check than >> usual. >