Hi Yossi, it's used in FetcherThread and ParseOutputFormat: git grep -F db.max.outlinks.per.page
However, it's not to limit the length of single outlink in characters but the number of outlinks followed (added to CrawlDb). There was NUTCH-1106 to add a property to limit the outlink length. Sebastian On 03/12/2018 12:56 PM, Yossi Tamari wrote: > Nutch.default contains a property db.max.outlinks.per.page, which I think is > supposed to prevent these cases. However, I just searched the code and > couldn't find where it is used. Bug? > >> -----Original Message----- >> From: Semyon Semyonov <semyon.semyo...@mail.com> >> Sent: 12 March 2018 12:47 >> To: usernutch.apache.org <user@nutch.apache.org> >> Subject: UrlRegexFilter is getting destroyed for unrealistically long links >> >> Dear all, >> >> There is an issue with UrlRegexFilter and parsing. In average, parsing takes >> about 1 millisecond, but sometimes the websites have the crazy links that >> destroy the parsing(takes 3+ hours and destroy the next steps of the >> crawling). >> For example, below you can see shortened logged version of url with encoded >> image, the real lenght of the link is 532572 characters. >> >> Any idea what should I do with such behavior? Should I modify the plugin to >> reject links with lenght > MAX or use more comlex logic/check extra >> configuration? >> 2018-03-10 23:39:52,082 INFO [main] >> org.apache.nutch.parse.ParseOutputFormat: >> ParseOutputFormat.Write.filterNormalize 4.3. Before filteing and >> normalization >> 2018-03-10 23:39:52,178 INFO [main] >> org.apache.nutch.urlfilter.api.RegexURLFilterBase: >> ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url filter for url >> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS >> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2 >> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr7 >> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1 >> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu >> dbnu50253lju... [532572 characters] >> 2018-03-11 03:56:26,118 INFO [main] >> org.apache.nutch.parse.ParseOutputFormat: >> ParseOutputFormat.Write.filterNormalize 4.4. After filteing and normalization >> >> Semyon. >