Yes, I did use the version from SVN. Hopefully we can get Kevin's feedback
- I've done a few more side-by-side comparisons with PDFBox and while the
"Tl < 200" logic seems entirely consistent, I don't think my change with Td
is quite as solid - it has introduced extra spaces in a couple of PDFs.
Alex Vigdor wrote:
> Once again, I don't know if this is an ideal or even
> appropriate patch, not knowing the code deeply, but works in the cases I
> am testing.
I've looked at your changes, but I don't know the parser packages well
enough to decide whether or not your approach is the best way
One more followup: the words with 0 kerning that needed space had 'Td' or
new line commands before them that were not working properly. I found
another approach to fix those cases that doesn't introduce space in places
where there is legitimately 0 kerning. The new patch follows. Once again,
I
Sorry to respond so quickly to my own message, but I thought I would at
least demonstrate a naive patch - obviously this would need to be validated
against many other sources, but at least it solves this particular case.
Interestingly, I observed that in some instances words that should be
separate
Hello,
I've begun experimenting with the PdfTextExtractor in iText as a
replacement for PDFBox. So far I'm very pleased with the results in
many cases, however I've noticed several examples where all the words
in the extracted text run together without spaces, so perhaps some
tweaking is