Hi all,

I've been looking into improving Tika handling of attributes - for both HTML and other formats.

There are several different issues that I've seen, that all seem related:

1. Ability to allow all attributes through from HTML documents

TIKA-379, building on TIKA-347, allows both more relaxed passing of attributes, as well as letting all elements through.

So if somebody wants to get the "lang" attribute for the <html> element of an HTML document, they could do this by using the identity mapper.

Assuming this gets reviewed/committed by Chris, then this issue would be solved.

2. Automatically let all valid XHTML 1.0 attributes through from HTML documents

This would be an improvement, as many consumers of parse output wouldn't want to process the raw (unnormalized) elements they'd get with the IdentityHtmlMapper, but they would want to get any standard attributes.

I believe this would require changing the DefaultHtmlMapper to "know" about valid attributes for different elements. I've filed TIKA-430 for this, please take a look and comment.

3. Make it easier for parsers to correctly add valid attributes to XHTML elements.

For example, the PDF parser might have language data that it would like to use to label individual paragraphs.

I think this would require a utility that takes a normalized element name, and a generic attribute name (e.g. something from Dublin Core), and returns back what the element attribute name should be (or null, if not appropriate).

I'd like to get some feedback on this approach, before filing an issue.

4. Make it possible for parsers to return non-standard attributes

Andrzej requested this, and suggested the use of namespaces to avoid generating invalid XHTML output.

But currently we strip out namespaces from the source XHTML, for example, as they can make processing the resulting data much harder if you're using XPath expressions. I don't know if the same would be true for clients of Tika. Any thoughts on this?

5. Validation of attribute values

Not sure if this is important, but if we want the XHTML output to be valid, then what you can put in an attribute value has some general restrictions (e.g. must be quoted) and some specific restrictions based on the actual attribute.

So an open question is whether the mapSafeAttributes() method should also take the attribute value, and do simple fixup (quoting) or rejection of invalid values. This would mean passing in the attribute value, and returning an "attribute record" (or null) in the response, to be able to pass back normalized name & value. Again, any thoughts on this/

Thanks,

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to