[Bug 33288]
(In reply to comment #70) (I originally replied on launchpad, which is supposed to copy it through to here, but it hasn't.) Carlos: it isn't a regression that lines outside a rectangle formed by the start and endpoints are included, it's the intent. Consider selecting in a document with two columns, starting in the 1st column 2/3 down the page, ending in the 2nd column 1/3 down the page. In this case, the correct selection consists entirely of lines that lie outside the rectangle formed by the start and endpoints (ie, the bottom 1/3 of the 1st column and the top 1/3 of the 2nd column). You get situations like this even for single column text; just choose start and end points vertically above each other. The motivation for this patch was that text selection by rectangles is fundamentally wrong. The correct approach is to reconstruct the reading order of text; then from two points on the page, find the nearest insertion points (where an edit cursor would go); swap the insertion points if necessary; then return the characters between them. The difficulties lie in inferring the reading order, and determining what 'nearest insertion point' means. Clicking inside a word, the nearest insertion point is obvious; it's the nearest character boundary. Click in a blank area, and it's less clear. In Breuel's algorithms that I used for determining reading order, there is something that helps here. There, line width is determined by expanding the line left and right to fit the column it contains. So the line 'box' contains the initial indent if it is the first line of a paragraph, or the trailing space in the last line; or the ragged space for left- or right- justified text. Poppler doesn't have columns as such, but blocks instead, and as I recall the line boxes are the tight bounding box of the words contained in the line. So we can try to determine insertion point by looking for the nearest block (horizontally and vertically), then the nearest line (vertically ONLY, so that we ignore indents/ragged space), then nearest character (horizontally). I mean these to be three different comparisons, discarding blocks, line and character candidates at each stage, not some single distance you sum up. The upshot would be that clicking in blank areas of a line that lie within its block's bounding box - or even nearby - will choose that line, not the one above or below. (It's been ages since I looked at the poppler code, I can't remember if this heuristic is what the patches do already) -- You received this bug notification because you are a member of Ubuntu Desktop Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/33288 Title: Evince doesn't handle columns properly To manage notifications about this bug go to: https://bugs.launchpad.net/poppler/+bug/33288/+subscriptions -- desktop-bugs mailing list desktop-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/desktop-bugs
[Bug 33288] Re: Evince doesn't handle columns properly
(oops, replied on launchpad, not sure if Carlos reads there. Repeating for fdo) Carlos: it isn't a regression that lines outside a rectangle formed by the start and endpoints are included, it's the intent. Consider selecting in a document with two columns, starting in the 1st column 2/3 down the page, ending in the 2nd column 1/3 down the page. In this case, the correct selection consists entirely of lines that lie outside the rectangle formed by the start and endpoints (ie, the bottom 1/3 of the 1st column and the top 1/3 of the 2nd column). The motivation for this patch was that text selection by rectangles is fundamentally wrong. The correct approach is to reconstruct the reading order of text; then from two points on the page, find the nearest insertion points (where an edit cursor would go); swap the insertion points if necessary; then return the characters between them. The difficulties lie in inferring the reading order, and determining what 'nearest insertion point' means. Clicking inside a word, the nearest insertion point is obvious; it's the nearest character boundary. Click in a blank area, and it's less clear. In Breuel's algorithms that I used for determining reading order, there is something that helps here. There, line width is determined by expanding the line left and right to fit the column it contains. So the line 'box' contains the initial indent if it is the first line of a paragraph, or the trailing space in the last line; or the ragged space for left- or right- justified text. Poppler doesn't have columns as such, but blocks instead, and as I recall the line boxes are the tight bounding box of the words contained in the line. So we can try to determine insertion point by looking for the nearest block (horizontally and vertically), then the nearest line (vertically ONLY, so that we ignore indents/ragged space), then nearest character (horizontally). I mean these to be three different comparisons, discarding blocks, line and character candidates at each stage, not some single distance you sum up. The upshot would be that clicking in blank areas of a line that lie within its block's bounding box - or even nearby - will choose that line, not the one above or below. (It's been ages since I looked at the poppler code, I can't remember if this heuristic is what the patches do already) -- You received this bug notification because you are a member of Ubuntu Desktop Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/33288 Title: Evince doesn't handle columns properly To manage notifications about this bug go to: https://bugs.launchpad.net/poppler/+bug/33288/+subscriptions -- desktop-bugs mailing list desktop-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/desktop-bugs
[Bug 33288] Re: Evince doesn't handle columns properly
Carlos: it isn't a regression that lines outside a rectangle formed by the start and endpoints are included, it's the intent. Consider selecting in a document with two columns, starting in the 1st column 2/3 down the page, ending in the 2nd column 1/3 down the page. In this case, the correct selection consists entirely of lines that lie outside the rectangle formed by the start and endpoints (ie, the bottom 1/3 of the 1st column and the top 1/3 of the 2nd column). The motivation for this patch was that text selection by rectangles is fundamentally wrong. The correct approach is to reconstruct the reading order of text; then from two points on the page, find the nearest insertion points (where an edit cursor would go); swap the insertion points if necessary; then return the characters between them. The difficulties lie in inferring the reading order, and determining what 'nearest insertion point' means. Clicking inside a word, the nearest insertion point is obvious; it's the nearest character boundary. Click in a blank area, and it's less clear. In Breuel's algorithms that I used for determining reading order, there is something that helps here. There, line width is determined by expanding the line left and right to fit the column it contains. So the line 'box' contains the initial indent if it is the first line of a paragraph, or the trailing space in the last line; or the ragged space for left- or right- justified text. Poppler doesn't have columns as such, but blocks instead, and as I recall the line boxes are the tight bounding box of the words contained in the line. So we can try to determine insertion point by looking for the nearest block (horizontally and vertically), then the nearest line (vertically ONLY, so that we ignore indents/ragged space), then nearest character (horizontally). I mean these to be three different comparisons, discarding blocks, line and character candidates at each stage, not some single distance you sum up. The upshot would be that clicking in blank areas of a line that lie within its block's bounding box - or even nearby - will choose that line, not the one above or below. (It's been ages since I looked at the poppler code, I can't remember if this heuristic is what the patches do already) -- You received this bug notification because you are a member of Ubuntu Desktop Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/33288 Title: Evince doesn't handle columns properly To manage notifications about this bug go to: https://bugs.launchpad.net/poppler/+bug/33288/+subscriptions -- desktop-bugs mailing list desktop-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/desktop-bugs
[Bug 33288]
Created attachment 40124 patch without extraneous whitespace changes Oops! Ok, here's the patch without the whitespace changes. -- You received this bug notification because you are a member of Ubuntu Desktop Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/33288 Title: Evince doesn't handle columns properly To manage notifications about this bug go to: https://bugs.launchpad.net/poppler/+bug/33288/+subscriptions -- desktop-bugs mailing list desktop-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/desktop-bugs
[Bug 33288]
Created attachment 40061 improved patch Had another look and tidied the code a bit removing repeated page orientation checks, and a redundant test for overlap in rule(2). This is noticeably faster rendering the bus map. (down to ~14.8s) -- You received this bug notification because you are a member of Ubuntu Desktop Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/33288 Title: Evince doesn't handle columns properly To manage notifications about this bug go to: https://bugs.launchpad.net/poppler/+bug/33288/+subscriptions -- desktop-bugs mailing list desktop-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/desktop-bugs
[Bug 33288]
Created attachment 40056 patch to fix performance regression Here's what I've got so far. On my very slow VM, this renders the paris bus map reported on the mailing list in ~15.2s, compared to ~60s without the patch. YX-sorting alone got the time down to ~16.2s. Rendering on other documents is as fast as ever. Almost all of the time rendering the bus map is prior to the sort, so there must still be some quadratic algorithms in there unrelated to reading order. There is one obvious fix on my list that I didn't implement (track the first unvisited block, start loops there) but I don't think this will make much difference for the effort it requires. I'll be offline until Monday 8 Nov, but I'd be grateful if some more eyes could look at this to make sure I haven't regressed anything. -- You received this bug notification because you are a member of Ubuntu Desktop Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/33288 Title: Evince doesn't handle columns properly To manage notifications about this bug go to: https://bugs.launchpad.net/poppler/+bug/33288/+subscriptions -- desktop-bugs mailing list desktop-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/desktop-bugs
[Bug 33288]
It should be possible to improve this substantially. When I wrote the patch I was being very conservative with the existing poppler data structures, so essentially that method is traversing an unordered list. If the block list was in isBeforeByRule1 order most of those comparisons would go away. I can't remember if this would break clients wanting access to the text in physical order-it's been a while since I looked at the code and I'm reading this on a phone. Can take a deeper look tomorrow. -- You received this bug notification because you are a member of Ubuntu Desktop Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/33288 Title: Evince doesn't handle columns properly To manage notifications about this bug go to: https://bugs.launchpad.net/poppler/+bug/33288/+subscriptions -- desktop-bugs mailing list desktop-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/desktop-bugs
[Bug 33288] Re: Evince doesn't handle columns properly
Mark: what's going on there is that the bounding boxes of the (1) (2) (3) ... text regions are tightly around the text you can see, and when the mouse is between them, its not over any text region at all. Hence, the best guess of what you were trying to select is the nearest text region (by manhattan distance), which will be the one above or below the current line, because these columns are very short and widely spaced. When the mouse is close to what you intended to select, and not in the middle of blank space, this usually does the right thing. Acrobat makes the same guess. -- Evince doesn't handle columns properly https://bugs.launchpad.net/bugs/33288 You received this bug notification because you are a member of Ubuntu Desktop Bugs, which is a direct subscriber. -- desktop-bugs mailing list desktop-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/desktop-bugs
[Bug 33288] Re: Evince doesn't handle columns properly
João - actually that's quite interesting - this side effect in Okular is something I hadn't seen. Going back through my mails I see Carlos told me: "Okular doesn't use TextOutputDev for selecting, but it does for extracting the text, so it will be affected anyway." This is what you're seeing - Okular's selection algorithm wedded to my text extraction order. Is it at least WYSIWYG, in that the text selected (and only the text selected) appears in the extract? If that's the case, then at least its not broken anything; it may even be a slight improvement, the old 'preserve layout' extracts were a mess. -- Evince doesn't handle columns properly https://bugs.launchpad.net/bugs/33288 You received this bug notification because you are a member of Ubuntu Desktop Bugs, which is a direct subscriber. -- desktop-bugs mailing list desktop-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/desktop-bugs
[Bug 33288] Re: Evince doesn't handle columns properly
Carlos has just pushed the patches upstream, but is leaving the bug open for future work (so the status of the upstream bug 3188 won't appear to change). Hopefully this improves the chances of them making it into Lucid. -- Evince doesn't handle columns properly https://bugs.launchpad.net/bugs/33288 You received this bug notification because you are a member of Ubuntu Desktop Bugs, which is a direct subscriber. -- desktop-bugs mailing list desktop-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/desktop-bugs
[Bug 33288] Re: Evince doesn't handle columns properly
@tpgraveen "any known regressions". What we're testing here is a heuristic for what users want, there is no right answer. So yes, in some cases the selection is "worse" (a specific example is dealing with bullet lists that have come from TeX, they leave a very wide gap before the list text and poppler thinks this is a column break). From what I've seen, for most single-column documents it's the same or slightly worse, but for most multi-column docs, it's a massive improvement. Its also much more consistent than before, you don't get the 'jumpy-selection' effect. So, far from perfect but 'mostly better'. I'm not aware of any remaining regressions from memory errors, crashes etc. -- Evince doesn't handle columns properly https://bugs.launchpad.net/bugs/33288 You received this bug notification because you are a member of Ubuntu Desktop Bugs, which is a direct subscriber. -- desktop-bugs mailing list desktop-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/desktop-bugs
[Bug 33288] Re: Evince doesn't handle columns properly
Yes, I have had no time to work on/review this for a while now, for a number of reasons. My calendar clears up a bit after Easter, hoping to pick it up again then. However, its worth pointing out that the patches as it stands are about as far as I want to push the current api (which is based on selecting a 'region', and then under the hood I made it select a run of text). What I'd started on when I ran out of time was a different approach: * support tagged PDF first, with 'guessed' text order as a fallback (which will give much better results on tagged PDF, obviously). This would be ridiculously hard in the current TextDevice; the structures don't match at all. * use a caret-less version of the AT-SPI AccessibleText interface (http://library.gnome.org/devel/at-spi-cspi/unstable/at-spi-cspi- AccessibleText-Interface.html). This simplifies a lot of the code I wrote, since you don't have to /repeatedly/ figure out the text order at each level of the block hierarchy. This should also make it easier to hook in accessibility efforts on top of poppler. * implement some automated testing of text extraction via the command line; changing how reading order is guessed always introduces regressions, so any time I do anything right now I need to check it out in a dozen or so documents. Its needlessly painful. This added a command line tool to exercise the AccessibleText-style interface, so I don't need to patch evince to make progress, and don't break the existing tools. * All of this means a new Device class in poppler, since as well as completely replacing the internals, the interface is completely different, and hence evince needs to change as well to integrate it. That's not going to happen quickly, I suspect. I mention all this to make it clear that I'm not actively developing the old patch any more, I've moved on to looking at a long-term solution. I'm of the opinion that the old patch is a big improvement, but I'm not getting code review feedback upstream either (I know it doesn't help when I can only work on this from time to time) -- Evince doesn't handle columns properly https://bugs.launchpad.net/bugs/33288 You received this bug notification because you are a member of Ubuntu Desktop Bugs, which is a direct subscriber. -- desktop-bugs mailing list desktop-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/desktop-bugs
[Bug 33288] Re: Evince doesn't handle columns properly
Wouter: I'll see what I can do, but it might be a while. I've never built a PPA either (and had to look that up). I recently gave up on using Gnome Dev Kit (Foresight Linux) for an Ubuntu 9 VM, so who knows, there may be tools on that to help me. However, I don't have much time to work on this and what I have might be better spent nailing issues I already know about...which is a neverending task. Now that I have (some) RTL support in there I need to worry about page rotation with RTL, bidi text, etc. -- Evince doesn't handle columns properly https://bugs.launchpad.net/bugs/33288 You received this bug notification because you are a member of Ubuntu Desktop Bugs, which is a direct subscriber. -- desktop-bugs mailing list desktop-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/desktop-bugs
[Bug 33288] Re: Evince doesn't handle columns properly
I've uploaded an updated patch series to https://bugs.freedesktop.org/show_bug.cgi?id=3188 , with corrections to selection and reading order. Those of you who can apply these and rebuild evince might want to give this a go? Comments over there please! For me it fixes up selection for most (but not all) of the documents on the various dupes of this bug, including the one at comment #7. If you're going to tell me about a document it doesn't work on, don't paste a link, upload the pdf, please (half of the reported-buggy docs are dead links now). Caveats: doesn't cover RTL or documents with rotated text. -- Evince doesn't handle columns properly https://bugs.launchpad.net/bugs/33288 You received this bug notification because you are a member of Ubuntu Desktop Bugs, which is a direct subscriber. -- desktop-bugs mailing list desktop-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/desktop-bugs
[Bug 346403] Re: In Evince, copy/paste from .pdf files reorders data
*** This bug is a duplicate of bug 33288 *** https://bugs.launchpad.net/bugs/33288 David - I believe this bug is the one you mentioned on bug 33288. Yes, this is not quite a duplicate, but hopefully fixing that will help here too. The bug in 33288 is caused because the heuristic used to detect reading order is poor (among other things). This bug, however, appears to be caused by rounding issues. The current 'copy & paste' code tries to preserve a tabular layout it infers from the original pdf, and will move text around to do that, in particular to avoid overlapping blocks of text. Note that it does this on all text, whether or not we'd see it as a table - PDF only contains coordinates of chunks of text, not their interrelationships. In this case, it appears that there is whitespace in the text to the right of each number, and that each number, with its whitespace, forms a separate block of text. It seems likely to me that the right edge of the whitespace is coincident with the left edge of the last number. If poppler sees an overlap here, eg because of a rounding error, it would cause the bug - but only because it then tries to preserve the layout. So this is a peculiarity of the way rectangular selection works in poppler, and is possibly fixable. However, if 33288 is fixed, the selection code will work entirely differently. It *may* fix this bug, but it may make it worse. If these non-columns are detected as being columns (because the whitespace between them is wide enough), then fixed poppler will select down the column first. If the whitespace is narrow enough, though, it will treat these as lines of text and you will get the result you expected. Numbers will not be shifted up and down, because poppler will no longer be trying to preserve a tabular layout. -- In Evince, copy/paste from .pdf files reorders data https://bugs.launchpad.net/bugs/346403 You received this bug notification because you are a member of Ubuntu Desktop Bugs, which is a bug assignee. -- desktop-bugs mailing list desktop-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/desktop-bugs
[Bug 33288] Re: Evince doesn't handle columns properly
David #19: you say "it perhaps recognizes column stuff from the display layout instead of the internal representation." In PDF, the internal representation *is* just the display layout. Internally, poppler tries to divide this text into blocks (roughly paragraphs) which are then grouped into columns based on spacing, and independently into 'flows' (roughly, sequences of similar blocks in reading order), based on a bunch of heuristics. This is already tricky, but is made more complicated by text rotation, and different writing systems (vertical, right to left, etc). Acrobat and Apple's Preview use different heuristics, so they group text differently, and make a mess of things on different documents - but they still make a mess of things. Just explaining what's going on here; this isn't to say that text selection can't be improved. I'm slowly putting together a patch based on the reading order sort described in http://pubs.iupr.org/#2003 -breuel-sdiut , which seems to be fixing some of the problems with the attachment in #7. However as I said to Andres I have no idea when or if my patches would be accepted. -- Evince doesn't handle columns properly https://bugs.launchpad.net/bugs/33288 You received this bug notification because you are a member of Ubuntu Desktop Bugs, which is a direct subscriber. -- desktop-bugs mailing list desktop-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/desktop-bugs