[ 
http://issues.apache.org/jira/browse/NUTCH-364?page=comments#action_12435945 ] 
            
Doug Cook commented on NUTCH-364:
---------------------------------

I've been looking into this a little bit. I see two problems:

(1) The current "two pass" heuristic URL-like string extractor has some flaws 
(I know, it was intended to be simple). 

The biggest is that it considers a URL to be more or less anything with a "." 
or a "/" in it. This is problematic in that lots of Javascript outputs HTML, 
where "/" commonly occurs in closing HTML tags. The philosophy seems to be to 
keep the extraction simple and rely on the URLNormalizer to throw an exception 
for a malformed URL, but the URLNormalizer doesn't seem to do much checking.

The problem can be fixed either here or with a more robust validity checker in 
the URL normalizers. I'd be inclined to put it here to avoid slowing down all 
of the normalizations of mostly valid URLs. A simple but not perfect 
improvement would be to avoid strings containing tag-like objects "<" ">" 
"&gt;" "&lt;" and so on. I can run some tests to see what other common "garbage 
URLs" occur.

I see that there  is a more robust URL pattern string commented out. This one 
is better, but still has the same problem, in that it would allow &gt; and &lt; 

(2) Absolute URLs are also not handled properly. For example:

http://www.palmbayimports.com/xq/asp/VID.401/WID.1446/qx/products.html

refers to ./menu.js, which in turn creates a menu linking to (among others):
/tours_marchesi.asp

This should resolve to http://www.palmbayimports.com/tours_marchesi.asp, but 
instead resolves to:
http://www.palmbayimports.com/xq/asp/VID.401/WID.1446/qx/tours_marchesi.html

This won't be perfectly solvable given the current heuristic string-extraction 
approach, because a string beginning with a "/" may in fact be a suffix string 
to which the javascript prepends some other directory, and not actually an 
absolute URL. However, given that we don't know the prefix string, interpreting 
it as relative as likely incorrect as interpreting it as absolute, but will 
create a lot more unique URLs. We may want to interpret as absolute to avoid 
creating a lot of garbage (as in the palmbayimports example above, which 
creates tens of thousands of garbage URLs).

Comments?


> Javascript parser creates some fairly bogus URLs
> ------------------------------------------------
>
>                 Key: NUTCH-364
>                 URL: http://issues.apache.org/jira/browse/NUTCH-364
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8
>         Environment: OS X 10.4.7
>            Reporter: Doug Cook
>
> If one crawls, say, 
>      http://www.metropoleparis.com/2000/501/
> with the Javascript parser enabled, one gets outlinks of the form:
> 2006-09-08 16:55:06,301 DEBUG js.JSParseFilter -  - outlink from JS: 
> 'http://www.metropoleparis.com/2000/501/</IFRAME>'
> 2006-09-08 16:55:06,302 DEBUG js.JSParseFilter -  - outlink from JS: 
> 'http://www.metropoleparis.com/2000/501/</SCR'
> 2006-09-08 16:55:06,302 DEBUG js.JSParseFilter -  - outlink from JS: 
> 'http://www.metropoleparis.com/2000/501/</DIV>'
> Another example would be:
> http://www.wein-plus.de/glossar/G.htm
> which yields the URL (among others):
> 2006-09-08 16:55:10,499 DEBUG js.JSParseFilter -  - outlink from JS: 
> 'http://www.wein-plus.de/glossar/<\/a>'
> I have seen these form "crawler traps" and make small sites explode to many, 
> many URLs. For the moment, I have the worst offenders plugged with specific 
> filter rules, but it would be nice to see if there is a way to improve the 
> JSParseFilter's heuristics to avoid these.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to