Improved handling of attributes

Ken Krugler Thu, 20 May 2010 18:09:08 -0700

Hi all,

I've been looking into improving Tika handling of attributes - forboth HTML and other formats.

There are several different issues that I've seen, that all seemrelated:


1. Ability to allow all attributes through from HTML documents

TIKA-379, building on TIKA-347, allows both more relaxed passing ofattributes, as well as letting all elements through.

So if somebody wants to get the "lang" attribute for the <html>element of an HTML document, they could do this by using the identitymapper.

Assuming this gets reviewed/committed by Chris, then this issue wouldbe solved.

2. Automatically let all valid XHTML 1.0 attributes through from HTMLdocuments

This would be an improvement, as many consumers of parse outputwouldn't want to process the raw (unnormalized) elements they'd getwith the IdentityHtmlMapper, but they would want to get any standardattributes.

I believe this would require changing the DefaultHtmlMapper to "know"about valid attributes for different elements. I've filed TIKA-430 forthis, please take a look and comment.

3. Make it easier for parsers to correctly add valid attributes toXHTML elements.

For example, the PDF parser might have language data that it wouldlike to use to label individual paragraphs.

I think this would require a utility that takes a normalized elementname, and a generic attribute name (e.g. something from Dublin Core),and returns back what the element attribute name should be (or null,if not appropriate).


I'd like to get some feedback on this approach, before filing an issue.

4. Make it possible for parsers to return non-standard attributes

Andrzej requested this, and suggested the use of namespaces to avoidgenerating invalid XHTML output.

But currently we strip out namespaces from the source XHTML, forexample, as they can make processing the resulting data much harder ifyou're using XPath expressions. I don't know if the same would be truefor clients of Tika. Any thoughts on this?


5. Validation of attribute values

Not sure if this is important, but if we want the XHTML output to bevalid, then what you can put in an attribute value has some generalrestrictions (e.g. must be quoted) and some specific restrictionsbased on the actual attribute.

So an open question is whether the mapSafeAttributes() method shouldalso take the attribute value, and do simple fixup (quoting) orrejection of invalid values. This would mean passing in the attributevalue, and returning an "attribute record" (or null) in the response,to be able to pass back normalized name & value. Again, any thoughtson this/


Thanks,

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Improved handling of attributes

Reply via email to