Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Jukka Zitting Mon, 08 Dec 2008 04:33:07 -0800

Hi,

On Mon, Dec 8, 2008 at 8:44 AM, Nadav Har'El <[EMAIL PROTECTED]> wrote:
> I don't think this is a pipe-dream - I believe (but correct me if I got the
> wrong impression...) that only a minority of the code in PDFBox and POI for
> example is relevant at all to Tika, and that Tika could do better by copying
> only the relevant parts of the code rather than using the whole code as
> a black box.


As mentioned before, I don't think that's a good idea as we don't have
the required format-specific expertise here and IMHO trying to gather
such expertise into a single project and community would be a futile
exercise.

I really don't want to start dealing with detailed questions about why
this specific PDF construct or Office XML feature is not supported by
Tika. It's an ocean of trouble that we're in no way equipped to
handle.

> I would be even happier if those projects made the separation
> themselves, e.g., PDFBox split into a small "PDFBOX-extract" package which is
> exactly what Tika needs and "PDFBOX-etc" which is all the rest, but I am
> doubtful that this will happen on its own any time soon.

This sounds like a much more fertile approach and I don't think it's
all that far fetched. Now with Tika we have a clear rationale why such
a trimmed down component would be useful, and I think many parser
libraries would be happy to consider such proposals.

I'm currently mentoring the PDFBox project at the Apache Incubator,
and I think the project would respond really well if someone came up
there with a proposal of generating such an extra pdfbox-extract
release artifact.

We can and should work together with the parser projects to address
the requirements we see. That's a much better alternative than forking
parts of those projects inside Tika.

BR,

Jukka Zitting

Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Reply via email to