Some regular expressions (those with backtracing) can be very expensive for lomg strings
https://regular-expressions.mobi/catastrophic.html?wlr=1 Maybe that is your issue. On Monday, March 12, 2018, Sebastian Nagel <wastl.na...@googlemail.com> wrote: > Good catch. It should be renamed to be consistent with other properties, > right? > > On 03/12/2018 01:10 PM, Yossi Tamari wrote: > > Perhaps, however it starts with db, not linkdb (like the other linkdb > properties), it is in the CrawlDB part of nutch-default.xml, and LinkDB > code uses the property name linkdb.max.anchor.length. > > > >> -----Original Message----- > >> From: Markus Jelsma <markus.jel...@openindex.io> > >> Sent: 12 March 2018 14:05 > >> To: user@nutch.apache.org > >> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically > long links > >> > >> That is for the LinkDB. > >> > >> > >> > >> -----Original message----- > >>> From:Yossi Tamari <yossi.tam...@pipl.com> > >>> Sent: Monday 12th March 2018 13:02 > >>> To: user@nutch.apache.org > >>> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically > >>> long links > >>> > >>> Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy paste > >> error... > >>> > >>>> -----Original Message----- > >>>> From: Markus Jelsma <markus.jel...@openindex.io> > >>>> Sent: 12 March 2018 14:01 > >>>> To: user@nutch.apache.org > >>>> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically > >>>> long links > >>>> > >>>> scripts/apache-nutch- > >>>> 1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205: > >>>> maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page", 100); > >>>> scripts/apache-nutch- > >>>> 1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118: > int > >>>> maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page", 100); > >>>> > >>>> > >>>> > >>>> > >>>> -----Original message----- > >>>>> From:Yossi Tamari <yossi.tam...@pipl.com> > >>>>> Sent: Monday 12th March 2018 12:56 > >>>>> To: user@nutch.apache.org > >>>>> Subject: RE: UrlRegexFilter is getting destroyed for > >>>>> unrealistically long links > >>>>> > >>>>> Nutch.default contains a property db.max.outlinks.per.page, which > >>>>> I think is > >>>> supposed to prevent these cases. However, I just searched the code > >>>> and couldn't find where it is used. Bug? > >>>>> > >>>>>> -----Original Message----- > >>>>>> From: Semyon Semyonov <semyon.semyo...@mail.com> > >>>>>> Sent: 12 March 2018 12:47 > >>>>>> To: usernutch.apache.org <user@nutch.apache.org> > >>>>>> Subject: UrlRegexFilter is getting destroyed for unrealistically > >>>>>> long links > >>>>>> > >>>>>> Dear all, > >>>>>> > >>>>>> There is an issue with UrlRegexFilter and parsing. In average, > >>>>>> parsing takes about 1 millisecond, but sometimes the websites > >>>>>> have the crazy links that destroy the parsing(takes 3+ hours and > >>>>>> destroy the next > >>>> steps of the crawling). > >>>>>> For example, below you can see shortened logged version of url > >>>>>> with encoded image, the real lenght of the link is 532572 > characters. > >>>>>> > >>>>>> Any idea what should I do with such behavior? Should I modify > >>>>>> the plugin to reject links with lenght > MAX or use more comlex > >>>>>> logic/check extra configuration? > >>>>>> 2018-03-10 23:39:52,082 INFO [main] > >>>>>> org.apache.nutch.parse.ParseOutputFormat: > >>>>>> ParseOutputFormat.Write.filterNormalize 4.3. Before filteing and > >>>>>> normalization > >>>>>> 2018-03-10 23:39:52,178 INFO [main] > >>>>>> org.apache.nutch.urlfilter.api.RegexURLFilterBase: > >>>>>> ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url > >>>>>> filter for url > >>>>>> > >>>> > >> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS > >>>>>> > >>>> > >> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2 > >>>>>> > >>>> > >> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr7 > >>>>>> > >>>> > >> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1 > >>>>>> > >>>> > >> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu > >>>>>> dbnu50253lju... [532572 characters] > >>>>>> 2018-03-11 03:56:26,118 INFO [main] > >>>>>> org.apache.nutch.parse.ParseOutputFormat: > >>>>>> ParseOutputFormat.Write.filterNormalize 4.4. After filteing and > >>>>>> normalization > >>>>>> > >>>>>> Semyon. > >>>>> > >>>>> > >>> > >>> > > > > -- Sorry this was sent from mobile. Will do less grammar and spell check than usual.