https://bz.apache.org/bugzilla/show_bug.cgi?id=60570
Tim Allison <[email protected]> changed: What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #4 from Tim Allison <[email protected]> --- r1779493 This patch adds the capability to perform a rudimentary parse of EMF and EMFPlus records with the goals of extracting embedded pdfs (and other binary files) as well as wmfs. This offers a start towards text extraction, although more work remains, including: 1) parsing and tracking the fonts to handle exttextouta and polytexta 2) implementation of the polytexts (I couldn't find examples) I developed this code with emfs and wmfs extracted from commoncrawl and govdocs1. I only included unit tests for emfs/wmfs that I could extract from POI's test files and/or Tika's test files. If we're ok adding commoncrawl and/or govdocs1 docs to our unit test suite, I can add more unit tests. -- You are receiving this mail because: You are the assignee for the bug. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
