Hello, We are using tika to extract text from XPS files and have hit an issue where whitespace is not emitted where we would expect. See the attached example file where opening the file it visually has a large gap between "x" and "abcde1234f" but when extracted by tika it calls `characters` with "x" and then `characters` on "abcde1234f". We would expect a `ignorableWhitespace` in between those calls but we don't get one.
I've taken a look through the XPS source code and think I've identified the issue and how to fix it. I would like to submit a pull request on github. The contribution requirements say I must have a tika issue open first. My request to make an ASF account was denied so if anyone is able to create an issue for me I will create my pull request against that. Any help or feedback would be appreciated. Kind regards, Ruairidh -- Next DLP, Huckletree West, Mediaworks, 191 Wood Ln, London W12 7FP. Company number 13785405.
testXLSX.xps
Description: application/vnd.ms-xpsdocument
