https://bz.apache.org/bugzilla/show_bug.cgi?id=60570

Tim Allison <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|NEW                         |RESOLVED

--- Comment #4 from Tim Allison <[email protected]> ---
r1779493

This patch adds the capability to perform a rudimentary parse of EMF and
EMFPlus records with the goals of extracting embedded pdfs (and other binary
files) as well as wmfs.

This offers a start towards text extraction, although more work remains,
including: 
1) parsing and tracking the fonts to handle exttextouta and polytexta
2) implementation of the polytexts (I couldn't find examples)

I developed this code with emfs and wmfs extracted from commoncrawl and
govdocs1.  I only included unit tests for emfs/wmfs that I could extract from
POI's test files and/or Tika's test files.

If we're ok adding commoncrawl and/or govdocs1 docs to our unit test suite, I
can add more unit tests.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to