scripts/apache-nutch-1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205: maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page", 100); scripts/apache-nutch-1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118: int maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page", 100);
-----Original message----- > From:Yossi Tamari <yossi.tam...@pipl.com> > Sent: Monday 12th March 2018 12:56 > To: user@nutch.apache.org > Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long > links > > Nutch.default contains a property db.max.outlinks.per.page, which I think is > supposed to prevent these cases. However, I just searched the code and > couldn't find where it is used. Bug? > > > -----Original Message----- > > From: Semyon Semyonov <semyon.semyo...@mail.com> > > Sent: 12 March 2018 12:47 > > To: usernutch.apache.org <user@nutch.apache.org> > > Subject: UrlRegexFilter is getting destroyed for unrealistically long links > > > > Dear all, > > > > There is an issue with UrlRegexFilter and parsing. In average, parsing takes > > about 1 millisecond, but sometimes the websites have the crazy links that > > destroy the parsing(takes 3+ hours and destroy the next steps of the > > crawling). > > For example, below you can see shortened logged version of url with encoded > > image, the real lenght of the link is 532572 characters. > > > > Any idea what should I do with such behavior? Should I modify the plugin to > > reject links with lenght > MAX or use more comlex logic/check extra > > configuration? > > 2018-03-10 23:39:52,082 INFO [main] > > org.apache.nutch.parse.ParseOutputFormat: > > ParseOutputFormat.Write.filterNormalize 4.3. Before filteing and > > normalization > > 2018-03-10 23:39:52,178 INFO [main] > > org.apache.nutch.urlfilter.api.RegexURLFilterBase: > > ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url filter for url > > :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS > > UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2 > > Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr7 > > X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1 > > efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu > > dbnu50253lju... [532572 characters] > > 2018-03-11 03:56:26,118 INFO [main] > > org.apache.nutch.parse.ParseOutputFormat: > > ParseOutputFormat.Write.filterNormalize 4.4. After filteing and > > normalization > > > > Semyon. > >