Hi Nick,
On 3/27/2016 6:52 PM, Nick Burch wrote:
On Sun, 27 Mar 2016, Bob Paulin wrote:
Currently the Apache POI dependency is in several modules and it's
sort of a beast (> 2 MB in size).
You should've seen it before Jukka and Yegor spent a crazy ApacheCon
hacking up the ooxml-lite support... ;-)
I can only imagine.
It appears many of the modules are only using the IOUtils library.
I suspect a strong overlap with the parser classes I've helped write...
Any concerns with replacing this POI stuff with commons-io? Does POI
offer anything above the commons-io functionality in IOUtils? If not
I think it would be great to isolate the poi dependency to the office
module only.
A lot of the use is for endian-specific reading of numbers and
strings. Might be a bit of stream stuff, but mostly that can be passed
off to the Tika IO utils classes.
Didn't even think of looking at Tika IO but yes that would be even better.
From a quick check, I can't see any endian number stuff in commons
IO, but
I might of missed it, or it might be in a different commons module. If
not, there might be something to be said for popping that POI logic
along with some of the Ogg-Vorbis utils stuff (another one with my
grubby mits all over it) into a more helpful general utils grouping
Yes I think overall if these functions can live in somewhere either
inside tika or a smaller dependent library we're in a better place. I'll
take a look at Ogg-Vorbis.
Thanks!
Nick