Mismatch between XeLaTeX fontspec and Apache PDFBox

Flynn, Peter Thu, 25 Jan 2018 07:47:41 -0800

I have a very large number of bibliographic references in BiBTeX format which 
we need to make available individually in formal reference formats within web 
pages (as HTML, not as embedded images).


I experimented a couple of years ago with Apache PDFBox and found that it could 
extract the text from a PDF and preserve bold and italics. This would let us 
use LaTeX to typeset each PDF in the required format, and then have PDFBox 
extract the text with bold and italics in all the right places.

Regular pdflatex with old-style bibtex is insufficient, as it doesn't handle 
all the UTF-8 characters we need, and the reference formats supported are out 
of date; XeLaTeX with biblatex and biber do all this just fine...but...

...if I do this using the fontspec package (the standard way to provide XeLaTeX 
with the font data for handling UTF-8 diacritics), the output has all accented 
characters, but PDFBox doesn't recognise the bold or italic. If I omit the 
fontspec package, PDFBox can get the bold and italics, but XeLaTeX will omit 
the diacritics.

Examples of both PDFs and both HTML files are at 
http://epu.ucc.ie/latex/pdfbox-xelatex-fontspec-error.zip

As I don't know the internals either of fontspec or of PDFBox, I am hoping that 
someone on the pdfbox mailing list or the comp.text.tex newsgroup may have a 
lead.

///Peter

Mismatch between XeLaTeX fontspec and Apache PDFBox

Reply via email to