Re: Improved handling of attributes

2010-05-27 Thread Ken Krugler
Hi Chris, On May 26, 2010, at 6:49am, Mattmann, Chris A (388J) wrote: Hey Ken, I wanted to get back to you on this: 1. Ability to allow all attributes through from HTML documents TIKA-379, building on TIKA-347, allows both more relaxed passing of attributes, as well as letting all elements

Re: Improved handling of attributes

2010-05-26 Thread Jukka Zitting
Hi, On Wed, May 26, 2010 at 5:10 PM, Mattmann, Chris A (388J) wrote: > If so, interesting, I wonder then if there should be some sort of rethinking > then > of the way that we capture or represent the XHTML because I would think that > our existing Metadata object could be reused at that level t

Re: Improved handling of attributes

2010-05-26 Thread Mattmann, Chris A (388J)
Hey Jukka, So you're seeing the delineation more as: * metadata = document level stuff * XHTML = textual representation [which can included finer-grained what I would call "metadata" too] ? If so, interesting, I wonder then if there should be some sort of rethinking then of the way tha

Re: Improved handling of attributes

2010-05-26 Thread Jukka Zitting
Hi, On Wed, May 26, 2010 at 3:49 PM, Mattmann, Chris A (388J) wrote: > I'm worried that we're mixing concerns here. Some of the information that > you cite above sounds more to me like metadata (and in fact, thinking about > it, you could argue that attributes themselves on the XHTML amount that

Re: Improved handling of attributes

2010-05-26 Thread Mattmann, Chris A (388J)
Hey Ken, I wanted to get back to you on this: > > 1. Ability to allow all attributes through from HTML documents > > TIKA-379, building on TIKA-347, allows both more relaxed passing of > attributes, as well as letting all elements through. > > So if somebody wants to get the "lang" attribute f

Improved handling of attributes

2010-05-20 Thread Ken Krugler
Hi all, I've been looking into improving Tika handling of attributes - for both HTML and other formats. There are several different issues that I've seen, that all seem related: 1. Ability to allow all attributes through from HTML documents TIKA-379, building on TIKA-347, allows both mor