Good catch. It should be renamed to be consistent with other properties, right?
On 03/12/2018 01:10 PM, Yossi Tamari wrote: > Perhaps, however it starts with db, not linkdb (like the other linkdb > properties), it is in the CrawlDB part of nutch-default.xml, and LinkDB code > uses the property name linkdb.max.anchor.length. > >> -----Original Message----- >> From: Markus Jelsma <[email protected]> >> Sent: 12 March 2018 14:05 >> To: [email protected] >> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long >> links >> >> That is for the LinkDB. >> >> >> >> -----Original message----- >>> From:Yossi Tamari <[email protected]> >>> Sent: Monday 12th March 2018 13:02 >>> To: [email protected] >>> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically >>> long links >>> >>> Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy paste >> error... >>> >>>> -----Original Message----- >>>> From: Markus Jelsma <[email protected]> >>>> Sent: 12 March 2018 14:01 >>>> To: [email protected] >>>> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically >>>> long links >>>> >>>> scripts/apache-nutch- >>>> 1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205: >>>> maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page", 100); >>>> scripts/apache-nutch- >>>> 1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118: int >>>> maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page", 100); >>>> >>>> >>>> >>>> >>>> -----Original message----- >>>>> From:Yossi Tamari <[email protected]> >>>>> Sent: Monday 12th March 2018 12:56 >>>>> To: [email protected] >>>>> Subject: RE: UrlRegexFilter is getting destroyed for >>>>> unrealistically long links >>>>> >>>>> Nutch.default contains a property db.max.outlinks.per.page, which >>>>> I think is >>>> supposed to prevent these cases. However, I just searched the code >>>> and couldn't find where it is used. Bug? >>>>> >>>>>> -----Original Message----- >>>>>> From: Semyon Semyonov <[email protected]> >>>>>> Sent: 12 March 2018 12:47 >>>>>> To: usernutch.apache.org <[email protected]> >>>>>> Subject: UrlRegexFilter is getting destroyed for unrealistically >>>>>> long links >>>>>> >>>>>> Dear all, >>>>>> >>>>>> There is an issue with UrlRegexFilter and parsing. In average, >>>>>> parsing takes about 1 millisecond, but sometimes the websites >>>>>> have the crazy links that destroy the parsing(takes 3+ hours and >>>>>> destroy the next >>>> steps of the crawling). >>>>>> For example, below you can see shortened logged version of url >>>>>> with encoded image, the real lenght of the link is 532572 characters. >>>>>> >>>>>> Any idea what should I do with such behavior? Should I modify >>>>>> the plugin to reject links with lenght > MAX or use more comlex >>>>>> logic/check extra configuration? >>>>>> 2018-03-10 23:39:52,082 INFO [main] >>>>>> org.apache.nutch.parse.ParseOutputFormat: >>>>>> ParseOutputFormat.Write.filterNormalize 4.3. Before filteing and >>>>>> normalization >>>>>> 2018-03-10 23:39:52,178 INFO [main] >>>>>> org.apache.nutch.urlfilter.api.RegexURLFilterBase: >>>>>> ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url >>>>>> filter for url >>>>>> >>>> >> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS >>>>>> >>>> >> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2 >>>>>> >>>> >> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr7 >>>>>> >>>> >> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1 >>>>>> >>>> >> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu >>>>>> dbnu50253lju... [532572 characters] >>>>>> 2018-03-11 03:56:26,118 INFO [main] >>>>>> org.apache.nutch.parse.ParseOutputFormat: >>>>>> ParseOutputFormat.Write.filterNormalize 4.4. After filteing and >>>>>> normalization >>>>>> >>>>>> Semyon. >>>>> >>>>> >>> >>> >

