That doc appears to have the text 'abcd...' and applies a font called Symbol to this text and it displays as Greek chars.
I think the result in Tika/POI is correct. They could create the doc using a font like Arial and use the actual Greek chars - instead of using the Latin chars and applying the Symbol font. On Wednesday 12 April 2023 at 22:08:23 GMT+2, Tim Allison <talli...@apache.org> wrote: Fellow devs, Has anyone fought with charsets for fonts in ooxml? The following issue has a file that uses the Symbol font which has a charset=02 (Symbol) charset. This rings a vague bell in hwpf. Do we have handling for this in docx or is it fairly easy to port from hwpf (if it exists there?)? And of course, in the email below the characters have been modified back to the underlying text, but they should be "alpha" "beta" "chi", etc... see the screenshot on the issue Thank you! Best, Tim ---------- Forwarded message --------- From: Tim Allison (Jira) <j...@apache.org> Date: Wed, Apr 12, 2023 at 3:51 PM Subject: [jira] [Created] (TIKA-4015) Extract symbols as symbols from .docx To: <d...@tika.apache.org> Tim Allison created TIKA-4015: --------------------------------- Summary: Extract symbols as symbols from .docx Key: TIKA-4015 URL: https://issues.apache.org/jira/browse/TIKA-4015 Project: Tika Issue Type: New Feature Reporter: Tim Allison Attachments: symbol.docx.zip [~chetab] raised this issue on the user list.and supplied an example document. The Font is symbol and the text should be: abcedefghijklmnopqrstuvwxyz However, the text as literally stored in the docx and extracted by Tika is: abcedefghijklmnopqrstuvwxyz We may need to add processing for unicode mappings or the equivalent in ooxml. I haven't seen this before. :P -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org