Re: HTML outllink extraction

Mingfai Thu, 02 Apr 2009 04:39:13 -0700

let's just look at the specific case first. Maybe I have jumped to the
conclusion that the Link Extraction feature is too simple too soon.


At line 139 of LinkExtractor.java, it uses URI.resolve(String) to resolve a
URI.
      if (!target.toLowerCase().startsWith("javascript")
          && !target.contains(":/")) {
139:        return base.getURI().resolve(target.split("#")[0]);
      }
      else if (!target.toLowerCase().startsWith("javascript")) {
        return new URI(target.split("#")[0]);
      }

When I test the URI API with:
  new URI("http://www.google.com";).resolve("index.php")
it resolves the url to "http://www.google.comindex.php";

if you didn't mean it is a bug with my JDK, then we need to specially append
a "/" prefix

And previously, I found another scenario that doesn't work, when there is a
link <a href="?test=true">test</a> under www.google.com/index.php , it
resolves to www.google.com/?test=true rather than
www.google.com/index.php?test=true like in a web browser.

This makes me feel there are many special scenario that a crawler need to
cater. What do you think? Is it really so simple? My suggest to add a page
is for listing those special scenarios, that sometimes maybe just cause by
non-standard usage.

regards,
mingfai




On Thu, Apr 2, 2009 at 7:24 PM, Thorsten Scherler <
[email protected]> wrote:

> On Thu, 2009-04-02 at 18:53 +0800, Mingfai wrote:
> > hi,
> >
> > The default LinkExtractor seems to be quite simple. (too simple) It
> mainly
> > uses URI.resolve and only cater the # and javascript scenarios. (from
> > LinkExtractor.java getURI) Simple usage link resolving a <a
> > href="test.html"> for new URI("http://www.google.com";) will be wrong as
> it
> > will return a http://www.google.comtest.html";.
>
> Well the link extraction always worked well. The case you just pointed
> out looks like a bug BUT if you mean new URL
> ("http://testServer.com","test.html)) then have a look at
> http://java.sun.com/j2se/1.4.2/docs/api/java/net/URL.html#URL(java.net.URL<http://java.sun.com/j2se/1.4.2/docs/api/java/net/URL.html#URL%28java.net.URL>,
> java.lang.String)
>
> > And there are many case that
> > the URI.resolve doesn't cater. It seems to me we need to do some works at
> > this area to make Droids more usable. Does anyone have any experience in
> out
> > link extraction?
>
> Enhancements are always welcome however the link extraction should work
> fine. At least when I last looked at it was fine. The limitation ATM is
> the extraction of jscript generated links.
>
> >
> > I'm trying to see how other frameworks handle out link extraction and
> looked
> > at:
> >
> http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java?view=log
>
> Funny enough that have been the base of droids outlink extraction in the
> first version I hacked.
>
> >
> https://archive-crawler.svn.sourceforge.net/svnroot/archive-crawler/trunk/heritrix2/engine/src/main/java/org/archive/extractor/RegexpHTMLLinkExtractor.java
> > (Heritrix's JavaDoc shows they have given some good thought in handling
> > different tags and attributes)
> >
> > What do you think if I add a wiki page that list out some scenarios of
> > outlink handling (i.e. the requirement)? Or does anyone know if any of
> the
> > many Java crawler projects have documentation at this area?
>
> If you do not look into jscript/ajax link extraction then there is no
> secret to it. Either go with xpath expression or e.g. for plain text
> with regexp. Please fell free to open a wiki page around the issue.
>
> salu2
>
> >
> > regards,
> > mingfai
> --
> Thorsten Scherler <thorsten.at.apache.org>
> Open Source Java <consulting, training and solutions>
>
> Sociedad Andaluza para el Desarrollo de la Sociedad
> de la Información, S.A.U. (SADESI)
>
>
>
>
>

Re: HTML outllink extraction

Reply via email to