Re: UrlRegexFilter is getting destroyed for unrealistically long links

Sebastian Nagel Mon, 12 Mar 2018 05:34:17 -0700

Good catch. It should be renamed to be consistent with other properties, right?


On 03/12/2018 01:10 PM, Yossi Tamari wrote:
> Perhaps, however it starts with db, not linkdb (like the other linkdb 
> properties), it is in the CrawlDB part of nutch-default.xml, and LinkDB code 
> uses the property name linkdb.max.anchor.length.
> 
>> -----Original Message-----
>> From: Markus Jelsma <[email protected]>
>> Sent: 12 March 2018 14:05
>> To: [email protected]
>> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long 
>> links
>>
>> That is for the LinkDB.
>>
>>
>>
>> -----Original message-----
>>> From:Yossi Tamari <[email protected]>
>>> Sent: Monday 12th March 2018 13:02
>>> To: [email protected]
>>> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically
>>> long links
>>>
>>> Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy paste
>> error...
>>>
>>>> -----Original Message-----
>>>> From: Markus Jelsma <[email protected]>
>>>> Sent: 12 March 2018 14:01
>>>> To: [email protected]
>>>> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically
>>>> long links
>>>>
>>>> scripts/apache-nutch-
>>>> 1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:
>>>> maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page", 100);
>>>> scripts/apache-nutch-
>>>> 1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:    int
>>>> maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page", 100);
>>>>
>>>>
>>>>
>>>>
>>>> -----Original message-----
>>>>> From:Yossi Tamari <[email protected]>
>>>>> Sent: Monday 12th March 2018 12:56
>>>>> To: [email protected]
>>>>> Subject: RE: UrlRegexFilter is getting destroyed for
>>>>> unrealistically long links
>>>>>
>>>>> Nutch.default contains a property db.max.outlinks.per.page, which
>>>>> I think is
>>>> supposed to prevent these cases. However, I just searched the code
>>>> and couldn't find where it is used. Bug?
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Semyon Semyonov <[email protected]>
>>>>>> Sent: 12 March 2018 12:47
>>>>>> To: usernutch.apache.org <[email protected]>
>>>>>> Subject: UrlRegexFilter is getting destroyed for unrealistically
>>>>>> long links
>>>>>>
>>>>>> Dear all,
>>>>>>
>>>>>> There is an issue with UrlRegexFilter and parsing. In average,
>>>>>> parsing takes about 1 millisecond, but sometimes the websites
>>>>>> have the crazy links that destroy the parsing(takes 3+ hours and
>>>>>> destroy the next
>>>> steps of the crawling).
>>>>>> For example, below you can see shortened logged version of url
>>>>>> with encoded image, the real lenght of the link is 532572 characters.
>>>>>>
>>>>>> Any idea what should I do with such behavior?  Should I modify
>>>>>> the plugin to reject links with lenght > MAX or use more comlex
>>>>>> logic/check extra configuration?
>>>>>> 2018-03-10 23:39:52,082 INFO [main]
>>>>>> org.apache.nutch.parse.ParseOutputFormat:
>>>>>> ParseOutputFormat.Write.filterNormalize 4.3. Before filteing and
>>>>>> normalization
>>>>>> 2018-03-10 23:39:52,178 INFO [main]
>>>>>> org.apache.nutch.urlfilter.api.RegexURLFilterBase:
>>>>>> ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url
>>>>>> filter for url
>>>>>>
>>>>
>> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoAAAANS
>>>>>>
>>>>
>> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
>>>>>>
>>>>
>> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr7
>>>>>>
>>>>
>> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
>>>>>>
>>>>
>> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
>>>>>> dbnu50253lju... [532572 characters]
>>>>>> 2018-03-11 03:56:26,118 INFO [main]
>>>>>> org.apache.nutch.parse.ParseOutputFormat:
>>>>>> ParseOutputFormat.Write.filterNormalize 4.4. After filteing and
>>>>>> normalization
>>>>>>
>>>>>> Semyon.
>>>>>
>>>>>
>>>
>>>
>

Re: UrlRegexFilter is getting destroyed for unrealistically long links

Reply via email to