Hello,

We are using tika to extract text from XPS files and have hit an issue
where whitespace is not emitted where we would expect. See the attached
example file where opening the file it visually has a large gap between "x"
and "abcde1234f" but when extracted by tika it calls `characters` with "x"
and then `characters` on "abcde1234f". We would expect a
`ignorableWhitespace` in between those calls but we don't get one.

I've taken a look through the XPS source code and think I've identified the
issue and how to fix it. I would like to submit a pull request on github.
The contribution requirements say I must have a tika issue open first. My
request to make an ASF account was denied so if anyone is able to create an
issue for me I will create my pull request against that.

Any help or feedback would be appreciated.

Kind regards,
Ruairidh

-- 
Next DLP, Huckletree West, Mediaworks, 191 Wood Ln, London W12 7FP. Company 
number 13785405.

Attachment: testXLSX.xps
Description: application/vnd.ms-xpsdocument

Reply via email to