[
https://issues.apache.org/jira/browse/PDFBOX-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968291#comment-14968291
]
John Hewson edited comment on PDFBOX-3043 at 10/22/15 12:59 AM:
----------------------------------------------------------------
Yep, that's a LaTeX thing. The characters in the content stream are actually
{code} ⃝{code} and {code}c{code} with the latter written on top of the former.
That's just how TeX handles diacritics because its fonts predate Unicode, so it
layers characters.
The encoding of the circle is a known mapping for the "circlecopyrt" character
and is built-into PDFBox's [additional
glyphlist|https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/resources/org/apache/pdfbox/resources/glyphlist/additional.txt];
perhaps we should be mapping it to a combining circle instead. Note that the
additional list is not standard in anyway, it's just a collection of common
mappings which we've encountered over the years and ship. It's only really used
by PDFTextStripper, though the code which loads it can be found in
PDFTextStreamEngine.
was (Author: jahewson):
Yep, that's a LaTeX thing. The characters in the content stream are actually
{code} ⃝{code} and {code}c{code} with the latter written on top of the former.
That's just how TeX handles diacritics because its fonts predate Unicode, so it
layers characters.
The encoding of the circle is a known mapping for the "circlecopyrt" character
and is built-into PDFBox's [additional
glyphlist|https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/resources/org/apache/pdfbox/resources/glyphlist/additional.txt];
perhaps we should be mapping it to a combining circle instead. Note that the
additional glyph list is not standard in anyway, it's just a collection of
common mappings which we've encountered over the years and ship. It's only
really used by PDFTextStripper, though the code which loads it can be found in
PDFTextStreamEngine.
> Character is extracted twice
> ----------------------------
>
> Key: PDFBOX-3043
> URL: https://issues.apache.org/jira/browse/PDFBOX-3043
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Ben McCann
> Attachments: cweb2.pdf
>
>
> This document has a © symbol. It's being extracted as "c©". I wanted to check
> if this is a bug.
> One of the things that's strange about this is that PDFTextStripper first
> processes "c" and then processes "©". However, PrintTextLocations prints them
> in the other order
> String[214.936,618.879 fs=9.963 xscale=9.963 height=8.642903 space=9.963
> width=9.962997]©
> String[217.704,618.579 fs=9.963 xscale=9.963 height=6.072449 space=5.537458
> width=4.4235687]c
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]