Created attachment 114485 Combine base characters and diacritical marks My attempt to improve this.
When you make a diacriticized character with LaTeX, ü for example, it will make a PDF with separate u and ¨ characters and draw them over each other. This patch detects when this happens and converts it to a combining character sequence so that pdftotext and the search function will see a ü and not separate characters. Also refactors some (TextWord::ensureCapacity and TextWord::setInitialBounds) to avoid duplicating code. Limitations: It doesn't handle some of LaTeX's diacritic commands, such as \b for bar under letter or \d for dot under letter, because they are positioned differently and \d would be easy to confuse with a period. They don't seem to be used very often though. If the base character is unusual, such as a math symbol or number, adding a combining character can make the result of pdftotext look a bit odd. I think this is because if the font or rendering engine don't know how to draw the character sequence, it will place the diacritic in a strange position, such as to the right of the letter. In these cases, the output of pdftotext is technically correct, it just looks odd when drawn on screen. When selecting text in evince, you can separately select the character and diacritic. If that's a problem, I think I could fix it by adding clustering support so that a group of glyphs and characters are treated as a single unit. It would make this a much more invasive change, but maybe I should try it anyway. It would be nice to also fix the assumpution that one glyph is always matched 1 character. -- You received this bug notification because you are a member of Desktop Packages, which is subscribed to poppler in Ubuntu. https://bugs.launchpad.net/bugs/116453 Title: evince can not find ü in attached PDF Status in Poppler: Confirmed Status in poppler package in Ubuntu: Triaged Bug description: Binary package hint: evince 1) lsb_release -rd Description: Ubuntu Vivid Vervet (development branch) Release: 15.04 2) apt-cache policy evince evince: Installed: 3.14.1-0ubuntu1 Candidate: 3.14.1-0ubuntu1 Version table: *** 3.14.1-0ubuntu1 0 500 http://us.archive.ubuntu.com/ubuntu/ vivid/main amd64 Packages 100 /var/lib/dpkg/status 3) What is expected to happen with the attached document is when one searches for: über it is found: https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/116453/+attachment/102979/+files/example.pdf 4) What happens instead is it does not return any matches. WORKAROUND: Use the built-in PDF viewer+search with chromium-browser or chrome (doesn't work in Firefox). apt-cache policy chromium-browser chromium-browser: Installed: 39.0.2171.65-0ubuntu0.14.04.1.1064 Candidate: 39.0.2171.65-0ubuntu0.14.04.1.1064 Version table: *** 39.0.2171.65-0ubuntu0.14.04.1.1064 0 500 http://us.archive.ubuntu.com/ubuntu/ trusty-updates/universe amd64 Packages 500 http://security.ubuntu.com/ubuntu/ trusty-security/universe amd64 Packages 100 /var/lib/dpkg/status 34.0.1847.116-0ubuntu2 0 500 http://us.archive.ubuntu.com/ubuntu/ trusty/universe amd64 Packages apt-cache policy google-chrome-stable:i386 google-chrome-stable:i386: Installed: 39.0.2171.95-1 Candidate: 39.0.2171.95-1 Version table: *** 39.0.2171.95-1 0 500 http://dl.google.com/linux/chrome/deb/ stable/main i386 Packages 100 /var/lib/dpkg/status ProblemType: Bug Architecture: i386 Date: Wed May 23 18:22:27 2007 DistroRelease: Ubuntu 7.04 ExecutablePath: /usr/bin/evince Package: evince 0.8.1-0ubuntu1 PackageArchitecture: i386 ProcEnviron: LANGUAGE=en_US:en PATH=~/local/bin:~/local/lib:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/bin/X11:/usr/games LANG=en_US.UTF-8 SHELL=/bin/bash SourcePackage: evince Uname: Linux copper 2.6.20-15-generic #2 SMP Sun Apr 15 07:36:31 UTC 2007 i686 GNU/Linux To manage notifications about this bug go to: https://bugs.launchpad.net/poppler/+bug/116453/+subscriptions -- Mailing list: https://launchpad.net/~desktop-packages Post to : desktop-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~desktop-packages More help : https://help.launchpad.net/ListHelp