On Wed, 2008-03-19 at 23:50 +0200, Jukka Zitting wrote: > Hi, > > On Wed, Mar 19, 2008 at 11:33 PM, Thorsten Scherler <[EMAIL PROTECTED]> wrote: > > Sounds good. We should add more elements for the link recognition, > > though. > > > > I mean ATM we are looking for <a/> but for a crawler that scraps the > > whole page all external resources are links and needs to be saved. > > > > Meaning for xhtml the following elements are important: > > - <img src="..."/> > > - <link href="..."/> > > - <script src="..."/> > > Good point! > > I'm wondering how we should best handle those in Tika, i.e. as <img/> > and <script/> tags don't really have much meaning in the scope of text > extraction. Perhaps we should map <img src="..." alt="..."/> to <a > href="...">...</a> or something like that to keep the client view > simple.
Maybe something like http://svn.apache.org/repos/asf/labs/droids/trunk/src/core/java/org/apache/droids/parse/Outlink.java You said in another mail that the outlinks are stored in a metadata object. Why not store it in an exclusive object for it. I guess besides the depth variable (which does not make an awful lot of sense for tika) > > Not sure what to do with <script/> tags, perhaps those links should go > a metadata property? IMO all links should go into the same outlink object. Tika should not further tread them just report them. > I don't think inline scripts should be part of > the extracted text content (but others may disagree), so script links > should probably also not be included in the XHTML output. I agree. However AJAX is becoming more popular and some page content only can be reached via scripts. Not sure where this leaves us. > > > In css files there can as well be links to either images or other css > > files. > > We don't currently have explicit CSS parser support in Tika, the plain > text extractor comes closest. I'll see if we could add something that > would allow easy detection of links within CSS files. http://svn.apache.org/repos/asf/forrest/trunk/main/webapp/resources/chaperon/ In forrest we are using chaperon for this, but not sure whether tika should parse css files. Is similar to inline scripts, or? > > On a related note, currently there's no base URI support in Tika. We > probably should add that, and treat a dc:identifier URI (if available) > as the base unless one has been explicitly specified in the document. > Also, if a base URI is available, Tika should automatically make all > relative URIs absolute to make client life easier. IMO that should be the problem of the client because there are situation where the client may prefer relative links. salu2 -- Thorsten Scherler thorsten.at.apache.org Open Source Java consulting, training and solutions