On Wed, 2008-03-19 at 23:50 +0200, Jukka Zitting wrote:
> Hi,
> 
> On Wed, Mar 19, 2008 at 11:33 PM, Thorsten Scherler <[EMAIL PROTECTED]> wrote:
> >  Sounds good. We should add more elements for the link recognition,
> >  though.
> >
> >  I mean ATM we are looking for <a/> but for a crawler that scraps the
> >  whole page all external resources are links and needs to be saved.
> >
> >  Meaning for xhtml the following elements are important:
> >  - <img src="..."/>
> >  - <link href="..."/>
> >  - <script src="..."/>
> 
> Good point!
> 
> I'm wondering how we should best handle those in Tika, i.e. as <img/>
> and <script/> tags don't really have much meaning in the scope of text
> extraction. Perhaps we should map <img src="..." alt="..."/> to <a
> href="...">...</a> or something like that to keep the client view
> simple.

Maybe something like
http://svn.apache.org/repos/asf/labs/droids/trunk/src/core/java/org/apache/droids/parse/Outlink.java

You said in another mail that the outlinks are stored in a metadata
object. Why not store it in an exclusive object for it. I guess besides
the depth variable (which does not make an awful lot of sense for tika)

> 
> Not sure what to do with <script/> tags, perhaps those links should go
> a metadata property? 

IMO all links should go into the same outlink object. Tika should not
further tread them just report them.

> I don't think inline scripts should be part of
> the extracted text content (but others may disagree), so script links
> should probably also not be included in the XHTML output.

I agree. However AJAX is becoming more popular and some page content
only can be reached via scripts. Not sure where this leaves us.

> 
> >  In css files there can as well be links to either images or other css
> >  files.
> 
> We don't currently have explicit CSS parser support in Tika, the plain
> text extractor comes closest. I'll see if we could add something that
> would allow easy detection of links within CSS files.

http://svn.apache.org/repos/asf/forrest/trunk/main/webapp/resources/chaperon/
In forrest we are using chaperon for this, but not sure whether tika
should parse css files. Is similar to inline scripts, or?

> 
> On a related note, currently there's no base URI support in Tika. We
> probably should add that, and treat a dc:identifier URI (if available)
> as the base unless one has been explicitly specified in the document.
> Also, if a base URI is available, Tika should automatically make all
> relative URIs absolute to make client life easier.

IMO that should be the problem of the client because there are situation
where the client may prefer relative links. 

salu2
-- 
Thorsten Scherler                                 thorsten.at.apache.org
Open Source Java                      consulting, training and solutions

Reply via email to