Hi,

As you've seen, I've been refactoring the Parser classes quite heavily
for the past few weeks, and now with TIKA-43 I'm reaching a milestone
that already resembles the proposed interface design.

Once TIKA-43 is committed (I'm giving it a day or two for reviews and
comments) there are still two Parser related changes that I'd like to
do before I think we're ready to do the first 0.1 release.

First, I'd like to replace the current Iterable<Content> construct
with a Metadata object that allows metadata to be passed in and out of
the parser. Also, this Metadata object should be decoupled from parser
configuration.

Second, instead of returning the text content of a document as a
String, I'd like the parsers to generate SAX events with the text
content passed as characters() events.

Unless anyone objects (feel free to do so if you have better design
ideas!), I'll follow up with new patches for these two issues in the
next week or two. Once these changes are done, I think we're good to
go for the first Tika release. Such a timing would also be perfect for
the upcoming ApacheCon US conference. :-)

BR,

Jukka Zitting

Reply via email to