Launchpad has imported 27 comments from the remote bug at https://bugs.freedesktop.org/show_bug.cgi?id=46603.
If you reply to an imported comment from within Launchpad, your comment will be sent to the remote bug automatically. Read more about Launchpad's inter-bugtracker facilities at https://help.launchpad.net/InterBugTracking. ------------------------------------------------------------------------ On 2012-02-25T01:13:44+00:00 Jason Crain wrote: Created attachment 57622 test pdf Forwarding from gnome bugzilla: https://bugzilla.gnome.org/show_bug.cgi?id=654473 When selecting text in this pdf, some glyphs are not visible. Text is displayed correctly when not selected. Reply at: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/6 ------------------------------------------------------------------------ On 2012-02-25T01:14:58+00:00 Jason Crain wrote: Created attachment 57623 incorrect display in evince Reply at: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/7 ------------------------------------------------------------------------ On 2012-02-25T01:15:35+00:00 Jason Crain wrote: Created attachment 57624 correct display in evince Reply at: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/8 ------------------------------------------------------------------------ On 2012-02-25T01:25:40+00:00 Jason Crain wrote: Created attachment 57625 Fixed display for selected glyph in ActualText span This patch seems to correct the issue. It sets the correct CharCode. Reply at: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/9 ------------------------------------------------------------------------ On 2012-02-25T07:45:38+00:00 Albert Astals Cid wrote: I don't think this patch is correct, you are only setting the charcode once in ActualText::addChar so if multiple ActualText::addChar calls happen before ActualText::end the other charcodes are lost, no? Reply at: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/10 ------------------------------------------------------------------------ On 2012-02-26T04:03:40+00:00 Adrian Johnson wrote: The patch doesn't work when the ActualText span contains more than one glyph. There is a test case in the test suite that demonstrates the problem: http://cgit.freedesktop.org/poppler/test/tree/unittestcases/WithActualText.pdf Reply at: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/11 ------------------------------------------------------------------------ On 2012-03-03T16:40:29+00:00 Jason Crain wrote: Created attachment 57989 Enable displayed chars to map to any number of text chars It's tricky when the length of the ActualText does not match the number of displayed glyphs. This first patch modifies the TextWord, TextLine, TextLineFrag, TextBlock, and TextPage classes to suport displayed characters that can be mapped to any number of text characters. Reply at: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/12 ------------------------------------------------------------------------ On 2012-03-03T16:46:38+00:00 Jason Crain wrote: Created attachment 57990 Fixes display for selected glyphs in ActualText span This sets the correct CharCode for each glyph in an ActualText span. Attempts to match one text character to each glyph. If there are more glyphs, they are added without matching text. If there is more text, the remaining is added to the last glyph. Reply at: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/13 ------------------------------------------------------------------------ On 2012-03-03T21:27:44+00:00 Adrian Johnson wrote: Created attachment 57993 fix selection of glyphs in actualtext Thanks for these patches. I have started looking at some of the text related bugs and the inability of TextOuputDev to understand the difference between glyphs and characters is the cause of some of these bugs. The first patch is very similar to the solution I had in mind. The first patch also fixes bug 9001. Some comments on the first patch: The following code in TextBlock::coalesce() needs fixing: if (word2->len == word0->len && !memcmp(word2->text, word0->text, word0->len * sizeof(Unicode))) { len need to be replaced with textLen. I don't think addChar should be renamed to addChars. My understanding of the code is that 'Char' is referring to the CharCodes and only one CharCode is added per call. I would move the surrogate decoding to TextWord:addChar() and do the decoding as the unicode values are copied into the text array. This avoids having to make a copy of the unicode array in TextPage::addChar(). Some comments on the second patch: I don't like the way the second patch is mapping the replacement text to the charcodes. I am attaching an alternative patch that distributes the replacement text evenly across the charcodes. Reply at: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/14 ------------------------------------------------------------------------ On 2012-03-04T20:38:26+00:00 Jason Crain wrote: Comment on attachment 57993 fix selection of glyphs in actualtext Review of attachment 57993: ----------------------------------------------------------------- ::: poppler/TextOutputDev.cc @@ +5331,5 @@ > + // If this is the last glyph ensure all remaining text is included > + // as pos may be < length due to rounding errors. > + if (i == lenGlyphs - 1) > + count = length - first; > + text->addChar(state, glyphs[i].x, glyphs[i].y, glyphs[i].dx, > glyphs[i].dy, This needs to make sure that a surrogate pair is not split Reply at: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/15 ------------------------------------------------------------------------ On 2012-03-04T21:09:24+00:00 Jason Crain wrote: Created attachment 58013 Enable displayed chars to map to any number of text chars updated patch Reply at: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/16 ------------------------------------------------------------------------ On 2012-03-05T15:04:19+00:00 Albert Astals Cid wrote: Guys, i'm a bit lost, sorry, are both patches supposed to fix the same issue in a different way? Reply at: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/17 ------------------------------------------------------------------------ On 2012-03-05T15:16:54+00:00 Adrian Johnson wrote: Both patches are required. The "fix selection of glyphs in actualtext" ensures all charcodes are passed through to TextPage. I need to update this to correctly handle surrogates. The "enable displayed chars to map to any number of text chars" is a prerequisite for the "fix selection of glyphs in actualtext" patch. It also fixes text selection of ligatures (bug 9001). Reply at: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/18 ------------------------------------------------------------------------ On 2012-03-05T15:41:38+00:00 Albert Astals Cid wrote: When using Jason's patch on a pdf that i will be attaching in a moment, pdftotext changes from extracting Remark 8. Ordering a line pencil. Let 𝑥 be a point and 𝐿 a noncompact line passing through 𝑥. We want to define an order on the set ℒ 𝑥 ∖ {𝐿} in the same way as we did in the proof of Proposition 5. In the disk model ̃ 𝑃 with boundary circle 𝐿, every line 𝐻 ∈ ℒ 𝑥 ∖{𝐿} separates ̃ 𝑃 into an upper part 𝐻 + and a lower part 𝐻 − (the parts may be disconnected). Since we know from Propositions 6 and 7 that lines always intersect transversally, it follows that for two such lines, one of the respective lower parts is entirely to Remark 8. Ordering a line pencil. Let 𝑥 be a point and 𝐿 a noncompact line passing through 𝑥. We want to define an order on the set ℒ𝑥 ∖ {𝐿} in the same way as we with boundary circle 𝐿, every did in the proof of Proposition 5. In the disk model 𝑃 into an upper part 𝐻 + and a lower part 𝐻 − (the parts may line 𝐻 ∈ ℒ𝑥 ∖{𝐿} separates 𝑃 be disconnected). Since we know from Propositions 6 and 7 that lines always intersect transversally, it follows that for two such lines, one of the respective lower parts is entirely You can see that in the second case some weird/unwanted reordering happened Reply at: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/19 ------------------------------------------------------------------------ On 2012-03-05T15:42:20+00:00 Albert Astals Cid wrote: Created attachment 58045 The said pdf Reply at: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/20 ------------------------------------------------------------------------ On 2012-03-08T04:05:41+00:00 Adrian Johnson wrote: Created attachment 58178 convert utf-16 to ucs-4 when reading ToUnicode The next two patches fix the problem of "fix selection of glyphs in actualtext" not handling surrogates. The "Unicode" type is meant to be UCS-4 so the solution is to convert UTF-16 to UCS-4 when it the ToUnicode cmap is parsed. This patch does the UTF-16 conversion in CharCodeToUnicode.cc. As a result the special surrogate handling in TextOutputDev and HtmlOutputDev can be removed. Reply at: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/21 ------------------------------------------------------------------------ On 2012-03-08T04:09:18+00:00 Adrian Johnson wrote: Created attachment 58179 move text string to unicode conversion into a separate function This patch adds a new function for converting PDF text strings to UCS-4. As a result the duplicated code in TextOutputDev and pdfinfo can be replaced by a call to this function. This patch is to ensure that my updated "fix selection of glyphs in actualtext" does not have to care about surrogates. Reply at: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/22 ------------------------------------------------------------------------ On 2012-03-08T04:12:30+00:00 Adrian Johnson wrote: Created attachment 58180 fix selection of glyphs in actualtext The updated version of "fix selection of glyphs in actualtext". The patch order is: 1 - convert utf-16 to ucs-4 when reading ToUnicode 2 - move text string to unicode conversion into a separate function 3 - Enable displayed chars to map to any number of text chars 4 - fix selection of glyphs in actualtext Patch 3 needs to be updated to remove the surrogate handling and fix the regression. Reply at: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/23 ------------------------------------------------------------------------ On 2012-03-18T15:18:23+00:00 Adrian Johnson wrote: Created attachment 58643 fix regressions This patch fixes the regressions in "Enable displayed chars to map to any number of text chars". The problem is the changes now allow glyphs that map to zero length unicode strings to be added to TextWords. Often these glyphs have overlapping bounding boxes or are not on the same baseline. This confuses TextOutputDev when trying to determine the layout of the text. This patch does two things: - it avoids breaking words when one of these glyphs with an empty mapping is encountered - it increases the tolerance for overlapping bounding boxes. With the attached PDF the result the text output is still different but checking the differences it is actually an improvement. However I suspect the changes could potentially break other PDFs. If this patch causes problems, plan B is to change TextOutputDev to ignore the glyphs with zero mapping when determining the text layout (but still add these glyphs to the words to make text selection work correctly). This should emulate the old behavior as closely as possible. Reply at: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/24 ------------------------------------------------------------------------ On 2012-03-18T15:21:09+00:00 Adrian Johnson wrote: Created attachment 58644 Enable displayed chars to map to any number of text chars This is Jason's patch rebased so it applies on top of my first two patches. The patch order is: 1 - convert utf-16 to ucs-4 when reading ToUnicode 2 - move text string to unicode conversion into a separate function 3 - Enable displayed chars to map to any number of text chars 4 - fix selection of glyphs in actualtext 5 - fix regressions Reply at: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/25 ------------------------------------------------------------------------ On 2012-03-19T15:30:08+00:00 Albert Astals Cid wrote: With all the patches applied the pdftotext extraction of https://www.libreoffice.org/bugzilla/attachment.cgi?id=41459 changes from • Patches may be grouped with other patches to test the whole of a to • P atches may be grouped with other patches to test the whole of a The correct extraction would be having no newline, but if we are going to have a newline i very much prefer the old way than the new one that breaks the word after the P Reply at: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/26 ------------------------------------------------------------------------ On 2012-03-22T05:07:01+00:00 Adrian Johnson wrote: Created attachment 58860 Don't reverse order of words with same xMin This fixes the regression. Output is now: • Patches may be grouped with other patches to test the whole of a Reply at: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/27 ------------------------------------------------------------------------ On 2012-03-22T15:29:49+00:00 Albert Astals Cid wrote: Tehre's still some problems, in the file that i will be attaching • white X in patch b: gets changed to •w hite X in patch b: Reply at: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/28 ------------------------------------------------------------------------ On 2012-03-22T15:32:29+00:00 Albert Astals Cid wrote: Created attachment 58892 The pdf with the "white X in patch" problem Reply at: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/29 ------------------------------------------------------------------------ On 2012-03-25T04:48:16+00:00 Adrian Johnson wrote: Created attachment 59003 don't start a new word if the previous char is a control char Problem is caused by a ^G that overlaps the first letter of the word. This patch avoids started a new word if a control characters overlaps other character. Although it would probably be better to strip out control characters from the extracted text. Acroread does not include the ^G in extracted text. Reply at: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/30 ------------------------------------------------------------------------ On 2012-03-25T08:17:29+00:00 Albert Astals Cid wrote: With this last patch now we get P atch creation date instead of Patch creation date again in https://www.libreoffice.org/bugzilla/attachment.cgi?id=41459 Reply at: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/31 ------------------------------------------------------------------------ On 2014-12-17T13:33:13+00:00 Jason Crain wrote: *** Bug 87401 has been marked as a duplicate of this bug. *** Reply at: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/comments/36 ** Changed in: poppler Status: Unknown => Confirmed ** Changed in: poppler Importance: Unknown => Medium -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to poppler in Ubuntu. https://bugs.launchpad.net/bugs/808894 Title: Certain characters are not rendered correctly when selected (highlighted) Status in Poppler: Confirmed Status in poppler package in Ubuntu: Triaged Bug description: 1) lsb_release -rd Description: Ubuntu Vivid Vervet (development branch) Release: 15.04 2) apt-cache policy evince evince: Installed: 3.14.1-0ubuntu1 Candidate: 3.14.1-0ubuntu1 Version table: *** 3.14.1-0ubuntu1 0 500 http://us.archive.ubuntu.com/ubuntu/ vivid/main amd64 Packages 100 /var/lib/dpkg/status 3) What is expected to happen via https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/+attachment/2202502/+files/testfile.pdf is when one highlights the first three lines, it doesn't mis-highlight the words. What happens instead is certain letters are not visible as per https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/808894/+attachment/2202506/+files/screenshot.png . ProblemType: Bug DistroRelease: Ubuntu 11.04 Package: evince 2.32.0-0ubuntu12.2 ProcVersionSignature: Ubuntu 2.6.38-8.42-generic 2.6.38.2 Uname: Linux 2.6.38-8-generic i686 Architecture: i386 CheckboxSubmission: 9e6554c36969a101b9e0e3075c8ffbe0 CheckboxSystem: b8f3ec504801f13fc208edb5c785b099 Date: Mon Jul 11 18:38:00 2011 InstallationMedia: Ubuntu 11.04 "Natty Narwhal" - Release i386 (20110427.1) ProcEnviron: LANGUAGE=fr_FR:en LANG=fr_FR.UTF-8 SHELL=/bin/bash ProcVersionSignature_: Ubuntu 2.6.38-8.42-generic 2.6.38.2 SourcePackage: evince UpgradeStatus: No upgrade log present (probably fresh install) To manage notifications about this bug go to: https://bugs.launchpad.net/poppler/+bug/808894/+subscriptions -- Mailing list: https://launchpad.net/~touch-packages Post to : touch-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~touch-packages More help : https://help.launchpad.net/ListHelp