Hey Jukka,

For places like POI and PDFBox I think this could definitely work. And then for 
places where we have Parsers, but aren't ready to push upstream yet (I can
think of two examples of this relevant to me, NetCDF/HDF and GDAL), 
we can just leave the Parser in tika-parsers I think.

In this manner, what you're really suggesting is that it would be great for
our mature Parsers to be "promoted" upstream to the communities that 
really understand the underlying Parser implementation toolkit. I think 
this makes sense to me, so long as there is a Champion or someone in 
that community willing to spend the small amount of time to learn Tika
and its interfaces (if they haven't done so already).

The net effect to the casual Tika user is nil, since we have Parser loading via
service factories, and the only thing that'll change there is the package
name (and potentially the class name) but it's all behind the scenes.
The net effect to the Tika developer is that the class and package name
changes may cause folks to have to recompile code/etc., and the 
code/unit tests/maintenance of some of the parsers would no longer
be readily available in Tika's tika-parsers artifact, but would live
in the tika-parser dependency library upstream.

Cheers,
Chris

On Dec 13, 2011, at 1:42 AM, Jukka Zitting wrote:

> Hi,
> 
> As you know, we see a lot of questions about version mismatches (which
> POI or PDFBox version should go with this Tika version) and there's a
> long queue of patches that are waiting for new official releases of
> our upstream dependencies to become available.
> 
> To avoid this issue I propose that we start moving some of our parser
> implementations to upstream projects. Now with Tika 1.0 out we have a
> stable Parser and Detector interfaces and related APIs that upstream
> libraries could implement directly without us having to worry about
> changing Tika code whenever a new version of a parser library becomes
> available.
> 
> This would allow our users to for example directly upgrade to a new
> POI version without waiting for a releated Tika release first.
> Similarly, a new PDF parsing option or improvement could be
> implemented directly in PDFBox and be usable without any code changes
> in Tika.
> 
> The classloading and OSGi service mechanisms we've added should make
> such upstream Parser implementations trivially easy to use, and we
> could still keep the dependencies in tika-parsers as a way to pull in
> the libraries even if the relevant implementation classes would no
> longer reside in org.apache.tika.parsers.*.
> 
> In addition to some of the GPL libraries for which we've already done
> this, I recently took the liberty of trying this out also with PDFBox.
> See PDFBOX-1132 [1] for the issue where I copied the
> org.apache.tika.pdf implementation to org.apache.pdfbox.tika. It works
> without problems, so now I'd like to propose that we copy any more
> recent PDF parser changes to PDFBox and prepare to drop the parser
> implementation in tika-parsers. Any further PDF parser work should
> then be done directly in PDFBox. I haven't yet talked about this with
> the PDFBox PMC (of which I'm a member), but I suppose we should be
> able to come up with an arrangement where Tika committers can commit
> directly to the Tika parser implementation in PDFBox.
> 
> It would be cool if we could do the same thing also with POI.
> 
> WDYT?
> 
> [1] https://issues.apache.org/jira/browse/PDFBOX-1132
> 
> BR,
> 
> Jukka Zitting


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to