That doc appears to have the text 'abcd...' and applies a font called Symbol to 
this text and it displays as Greek chars.

I think the result in Tika/POI is correct.

They could create the doc using a font like Arial and use the actual Greek 
chars - instead of using the Latin chars and applying the Symbol font.





On Wednesday 12 April 2023 at 22:08:23 GMT+2, Tim Allison <talli...@apache.org> 
wrote: 





Fellow devs,

Has anyone fought with charsets for fonts in ooxml?  The following
issue has a file that uses the Symbol font which has a charset=02
(Symbol) charset.  This rings a vague bell in hwpf.  Do we have
handling for this in docx or is it fairly easy to port from hwpf (if
it exists there?)?

And of course, in the email below the characters have been modified
back to the underlying text, but they should be "alpha" "beta" "chi",
etc...  see the screenshot on the issue

Thank you!

Best,

        Tim

---------- Forwarded message ---------
From: Tim Allison (Jira) <j...@apache.org>
Date: Wed, Apr 12, 2023 at 3:51 PM
Subject: [jira] [Created] (TIKA-4015) Extract symbols as symbols from .docx
To: <d...@tika.apache.org>


Tim Allison created TIKA-4015:
---------------------------------

            Summary: Extract symbols as symbols from .docx
                Key: TIKA-4015
                URL: https://issues.apache.org/jira/browse/TIKA-4015
            Project: Tika
          Issue Type: New Feature
            Reporter: Tim Allison
        Attachments: symbol.docx.zip

[~chetab] raised this issue on the user list.and supplied an example document.

The Font is symbol and the text should be: abcedefghijklmnopqrstuvwxyz

However, the text as literally stored in the docx and extracted by
Tika is: abcedefghijklmnopqrstuvwxyz



We may need to add processing for unicode mappings or the equivalent
in ooxml.  I haven't seen this before. :P



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org
For additional commands, e-mail: dev-h...@poi.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org
For additional commands, e-mail: dev-h...@poi.apache.org

Reply via email to