[Bug 33288]

2014-02-21 Thread Brian Ewins
(In reply to comment #70)

(I originally replied on launchpad, which is supposed to copy it through
to here, but it hasn't.)

Carlos: it isn't a regression that lines outside a rectangle formed by
the start and endpoints are included, it's the intent.

Consider selecting in a document with two columns, starting in the 1st
column 2/3 down the page, ending in the 2nd column 1/3 down the page. In
this case, the correct selection consists entirely of lines that lie
outside the rectangle formed by the start and endpoints (ie, the bottom
1/3 of the 1st column and the top 1/3 of the 2nd column).

You get situations like this even for single column text; just choose
start and end points vertically above each other.

The motivation for this patch was that text selection by rectangles is
fundamentally wrong. The correct approach is to reconstruct the reading
order of text; then from two points on the page, find the nearest
insertion points (where an edit cursor would go); swap the insertion
points if necessary; then return the characters between them. The
difficulties lie in inferring the reading order, and determining what
'nearest insertion point' means.

Clicking inside a word, the nearest insertion point is obvious; it's the
nearest character boundary. Click in a blank area, and it's less clear.
In Breuel's algorithms that I used for determining reading order, there
is something that helps here. There, line width is determined by
expanding the line left and right to fit the column it contains. So the
line 'box' contains the initial indent if it is the first line of a
paragraph, or the trailing space in the last line; or the ragged space
for left- or right- justified text.

Poppler doesn't have columns as such, but blocks instead, and as I
recall the line boxes are the tight bounding box of the words contained
in the line. So we can try to determine insertion point by looking for
the nearest block (horizontally and vertically), then the nearest line
(vertically ONLY, so that we ignore indents/ragged space), then nearest
character (horizontally). I mean these to be three different
comparisons, discarding blocks, line and character candidates at each
stage, not some single distance you sum up. The upshot would be that
clicking in blank areas of a line that lie within its block's bounding
box - or even nearby - will choose that line, not the one above or
below.

(It's been ages since I looked at the poppler code, I can't remember if
this heuristic is what the patches do already)

-- 
You received this bug notification because you are a member of Ubuntu
Desktop Bugs, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/33288

Title:
  Evince doesn't handle columns properly

To manage notifications about this bug go to:
https://bugs.launchpad.net/poppler/+bug/33288/+subscriptions

-- 
desktop-bugs mailing list
desktop-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/desktop-bugs


[Bug 33288] Re: Evince doesn't handle columns properly

2013-12-01 Thread Brian Ewins
(oops, replied on launchpad, not sure if Carlos reads there. Repeating
for fdo)

Carlos: it isn't a regression that lines outside a rectangle formed by
the start and endpoints are included, it's the intent.

Consider selecting in a document with two columns, starting in the 1st
column 2/3 down the page, ending in the 2nd column 1/3 down the page. In
this case, the correct selection consists entirely of lines that lie
outside the rectangle formed by the start and endpoints (ie, the bottom
1/3 of the 1st column and the top 1/3 of the 2nd column).

The motivation for this patch was that text selection by rectangles is
fundamentally wrong. The correct approach is to reconstruct the reading
order of text; then from two points on the page, find the nearest
insertion points (where an edit cursor would go); swap the insertion
points if necessary; then return the characters between them. The
difficulties lie in inferring the reading order, and determining what
'nearest insertion point' means.

Clicking inside a word, the nearest insertion point is obvious; it's the
nearest character boundary. Click in a blank area, and it's less clear.
In Breuel's algorithms that I used for determining reading order, there
is something that helps here. There, line width is determined by
expanding the line left and right to fit the column it contains. So the
line 'box' contains the initial indent if it is the first line of a
paragraph, or the trailing space in the last line; or the ragged space
for left- or right- justified text.

Poppler doesn't have columns as such, but blocks instead, and as I
recall the line boxes are the tight bounding box of the words contained
in the line. So we can try to determine insertion point by looking for
the nearest block (horizontally and vertically), then the nearest line
(vertically ONLY, so that we ignore indents/ragged space), then nearest
character (horizontally). I mean these to be three different
comparisons, discarding blocks, line and character candidates at each
stage, not some single distance you sum up. The upshot would be that
clicking in blank areas of a line that lie within its block's bounding
box - or even nearby - will choose that line, not the one above or
below.

(It's been ages since I looked at the poppler code, I can't remember if
this heuristic is what the patches do already)

-- 
You received this bug notification because you are a member of Ubuntu
Desktop Bugs, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/33288

Title:
  Evince doesn't handle columns properly

To manage notifications about this bug go to:
https://bugs.launchpad.net/poppler/+bug/33288/+subscriptions

-- 
desktop-bugs mailing list
desktop-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/desktop-bugs


[Bug 33288] Re: Evince doesn't handle columns properly

2013-12-01 Thread Brian Ewins
Carlos: it isn't a regression that lines outside a rectangle formed by
the start and endpoints are included, it's the intent.

Consider selecting in a document with two columns, starting in the 1st
column 2/3 down the page, ending in the 2nd column 1/3 down the page. In
this case, the correct selection consists entirely of lines that lie
outside the rectangle formed by the start and endpoints (ie, the bottom
1/3 of the 1st column and the top 1/3 of the 2nd column).

The motivation for this patch was that text selection by rectangles is
fundamentally wrong. The correct approach is to reconstruct the reading
order of text; then from two points on the page, find the nearest
insertion points (where an edit cursor would go); swap the insertion
points if necessary; then return the characters between them. The
difficulties lie in inferring the reading order, and determining what
'nearest insertion point' means.

Clicking inside a word, the nearest insertion point is obvious; it's the
nearest character boundary. Click in a blank area, and it's less clear.
In Breuel's algorithms that I used for determining reading order, there
is something that helps here. There, line width is  determined by
expanding the line left and right to fit the column it contains. So the
line 'box' contains the initial indent if it is the first line of a
paragraph, or the trailing space in the last line; or the ragged space
for left- or right- justified text.

Poppler doesn't have columns as such, but blocks instead, and as I
recall the line boxes are the tight bounding box of the words contained
in the line. So we can try to determine insertion point by looking for
the nearest block (horizontally and vertically), then the nearest line
(vertically ONLY, so that we ignore indents/ragged space), then nearest
character (horizontally). I mean these to be three different
comparisons, discarding blocks, line and character candidates at each
stage, not some single distance you sum up. The upshot would be that
clicking in blank areas of a line that lie within its block's bounding
box - or even nearby - will choose that line, not the one above or
below.

(It's been ages since I looked at the poppler code, I can't remember if
this heuristic is what the patches do already)

-- 
You received this bug notification because you are a member of Ubuntu
Desktop Bugs, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/33288

Title:
  Evince doesn't handle columns properly

To manage notifications about this bug go to:
https://bugs.launchpad.net/poppler/+bug/33288/+subscriptions

-- 
desktop-bugs mailing list
desktop-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/desktop-bugs


[Bug 33288]

2012-09-10 Thread Brian Ewins
Created attachment 40124
patch without extraneous whitespace changes

Oops! Ok, here's the patch without the whitespace changes.

-- 
You received this bug notification because you are a member of Ubuntu
Desktop Bugs, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/33288

Title:
  Evince doesn't handle columns properly

To manage notifications about this bug go to:
https://bugs.launchpad.net/poppler/+bug/33288/+subscriptions

-- 
desktop-bugs mailing list
desktop-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/desktop-bugs


[Bug 33288]

2012-09-10 Thread Brian Ewins
Created attachment 40061
improved patch

Had another look and tidied the code a bit removing repeated page
orientation checks, and a redundant test for overlap in rule(2). This is
noticeably faster rendering the bus map. (down to ~14.8s)

-- 
You received this bug notification because you are a member of Ubuntu
Desktop Bugs, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/33288

Title:
  Evince doesn't handle columns properly

To manage notifications about this bug go to:
https://bugs.launchpad.net/poppler/+bug/33288/+subscriptions

-- 
desktop-bugs mailing list
desktop-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/desktop-bugs


[Bug 33288]

2012-09-10 Thread Brian Ewins
Created attachment 40056
patch to fix performance regression

Here's what I've got so far. On my very slow VM, this renders the paris
bus map reported on the mailing list in ~15.2s, compared to ~60s without
the patch. YX-sorting alone got the time down to ~16.2s. Rendering on
other documents is as fast as ever.

Almost all of the time rendering the bus map is prior to the sort, so
there must still be some quadratic algorithms in there unrelated to
reading order. There is one obvious fix on my list that I didn't
implement (track the first unvisited block, start loops there) but I
don't think this will make much difference for the effort it requires.

I'll be offline until Monday 8 Nov, but I'd be grateful if some more
eyes could look at this to make sure I haven't regressed anything.

-- 
You received this bug notification because you are a member of Ubuntu
Desktop Bugs, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/33288

Title:
  Evince doesn't handle columns properly

To manage notifications about this bug go to:
https://bugs.launchpad.net/poppler/+bug/33288/+subscriptions

-- 
desktop-bugs mailing list
desktop-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/desktop-bugs


[Bug 33288]

2012-09-10 Thread Brian Ewins
It should be possible to improve this substantially. When I wrote the
patch I was being very conservative with the existing poppler data
structures, so essentially that method is traversing an unordered list.
If the block list was in isBeforeByRule1 order most of those comparisons
would go away. I can't remember if this would break clients wanting
access to the text in physical order-it's been a while since I looked at
the code and I'm reading this on a phone. Can take a deeper look
tomorrow.

-- 
You received this bug notification because you are a member of Ubuntu
Desktop Bugs, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/33288

Title:
  Evince doesn't handle columns properly

To manage notifications about this bug go to:
https://bugs.launchpad.net/poppler/+bug/33288/+subscriptions

-- 
desktop-bugs mailing list
desktop-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/desktop-bugs


[Bug 33288] Re: Evince doesn't handle columns properly

2010-05-07 Thread Brian Ewins
Mark: what's going on there is that the bounding boxes of the (1) (2)
(3) ... text regions are tightly around the text you can see, and when
the mouse is between them, its not over any text region at all. Hence,
the best guess of what you were trying to select is the nearest text
region (by manhattan distance), which will be the one above or below the
current line, because these columns are very short and widely spaced.
When the mouse is close to what you intended to select, and not in the
middle of blank space, this usually does the right thing. Acrobat makes
the same guess.

-- 
Evince doesn't handle columns properly
https://bugs.launchpad.net/bugs/33288
You received this bug notification because you are a member of Ubuntu
Desktop Bugs, which is a direct subscriber.

-- 
desktop-bugs mailing list
desktop-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/desktop-bugs


[Bug 33288] Re: Evince doesn't handle columns properly

2010-04-21 Thread Brian Ewins
João - actually that's quite interesting - this side effect in Okular is
something I hadn't seen. Going back through my mails I see Carlos told
me:

"Okular doesn't use TextOutputDev for selecting, but it does for
extracting the text, so it will be affected anyway."

This is what you're seeing - Okular's selection algorithm wedded to my
text extraction order. Is it at least WYSIWYG, in that the text selected
(and only the text selected) appears in the extract?

If that's the case, then at least its not broken anything; it may even
be a slight improvement, the old 'preserve layout' extracts were a mess.

-- 
Evince doesn't handle columns properly
https://bugs.launchpad.net/bugs/33288
You received this bug notification because you are a member of Ubuntu
Desktop Bugs, which is a direct subscriber.

-- 
desktop-bugs mailing list
desktop-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/desktop-bugs

[Bug 33288] Re: Evince doesn't handle columns properly

2010-04-19 Thread Brian Ewins
Carlos has just pushed the patches upstream, but is leaving the bug open
for future work (so the status of the upstream bug 3188 won't appear to
change). Hopefully this improves the chances of them making it into
Lucid.

-- 
Evince doesn't handle columns properly
https://bugs.launchpad.net/bugs/33288
You received this bug notification because you are a member of Ubuntu
Desktop Bugs, which is a direct subscriber.

-- 
desktop-bugs mailing list
desktop-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/desktop-bugs


[Bug 33288] Re: Evince doesn't handle columns properly

2010-03-17 Thread Brian Ewins
@tpgraveen "any known regressions". What we're testing here is a
heuristic for what users want, there is no right answer. So yes, in some
cases the selection is "worse" (a specific example is dealing with
bullet lists that have come from TeX, they leave a very wide gap before
the list text and poppler thinks this is a column break). From what I've
seen, for most single-column documents it's the same or slightly worse,
but for most multi-column docs, it's a massive improvement. Its also
much more consistent than before, you don't get the 'jumpy-selection'
effect. So, far from perfect but 'mostly better'. I'm not aware of any
remaining regressions from memory errors, crashes etc.

-- 
Evince doesn't handle columns properly
https://bugs.launchpad.net/bugs/33288
You received this bug notification because you are a member of Ubuntu
Desktop Bugs, which is a direct subscriber.

-- 
desktop-bugs mailing list
desktop-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/desktop-bugs


[Bug 33288] Re: Evince doesn't handle columns properly

2010-03-17 Thread Brian Ewins
Yes, I have had no time to work on/review this for a while now, for a
number of reasons. My calendar clears up a bit after Easter, hoping to
pick it up again then. However, its worth pointing out that the patches
as it stands are about as far as I want to push the current api (which
is based on selecting a 'region', and then under the hood I made it
select a run of text). What I'd started on when I ran out of time was a
different approach:

* support tagged PDF first, with 'guessed' text order as a fallback
(which will give much better results on tagged PDF, obviously). This
would be ridiculously hard in the current TextDevice; the structures
don't match at all.

* use a caret-less version of the AT-SPI AccessibleText interface
(http://library.gnome.org/devel/at-spi-cspi/unstable/at-spi-cspi-
AccessibleText-Interface.html). This simplifies a lot of the code I
wrote, since you don't have to /repeatedly/ figure out the text order at
each level of the block hierarchy. This should also make it easier to
hook in accessibility efforts on top of poppler.

* implement some automated testing of text extraction via the command
line; changing how reading order is guessed always introduces
regressions, so any time I do anything right now I need to check it out
in a dozen or so documents. Its needlessly painful. This added a command
line tool to exercise the AccessibleText-style interface, so I don't
need to patch evince to make progress, and don't break the existing
tools.

* All of this means a new Device class in poppler, since as well as
completely replacing the internals, the interface is completely
different, and hence evince needs to change as well to integrate it.
That's not going to happen quickly, I suspect.

I mention all this to make it clear that I'm not actively developing the
old patch any more, I've moved on to looking at a long-term solution.
I'm of the opinion that the old patch is a big improvement, but I'm not
getting code review feedback upstream either (I know it doesn't help
when I can only work on this from time to time)

-- 
Evince doesn't handle columns properly
https://bugs.launchpad.net/bugs/33288
You received this bug notification because you are a member of Ubuntu
Desktop Bugs, which is a direct subscriber.

-- 
desktop-bugs mailing list
desktop-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/desktop-bugs


[Bug 33288] Re: Evince doesn't handle columns properly

2009-11-23 Thread Brian Ewins
Wouter: I'll see what I can do, but it might be a while. I've never
built a PPA either (and had to look that up). I recently gave up on
using Gnome Dev Kit (Foresight Linux) for an Ubuntu 9 VM, so who knows,
there may be tools on that to help me. However, I don't have much time
to work on this and what I have might be better spent nailing issues I
already know about...which is a neverending task. Now that I have (some)
RTL support in there I need to worry about page rotation with RTL, bidi
text, etc.

-- 
Evince doesn't handle columns properly
https://bugs.launchpad.net/bugs/33288
You received this bug notification because you are a member of Ubuntu
Desktop Bugs, which is a direct subscriber.

-- 
desktop-bugs mailing list
desktop-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/desktop-bugs


[Bug 33288] Re: Evince doesn't handle columns properly

2009-11-13 Thread Brian Ewins
I've uploaded an updated patch series to
https://bugs.freedesktop.org/show_bug.cgi?id=3188 , with corrections to
selection and reading order. Those of you who can apply these and
rebuild evince might want to give this a go? Comments over there please!
For me it fixes up selection for most (but not all) of the documents on
the various dupes of this bug, including the one at comment #7. If
you're going to tell me about a document it doesn't work on, don't paste
a link, upload the pdf, please (half of the reported-buggy docs are dead
links now).

Caveats: doesn't cover RTL or documents with rotated text.

-- 
Evince doesn't handle columns properly
https://bugs.launchpad.net/bugs/33288
You received this bug notification because you are a member of Ubuntu
Desktop Bugs, which is a direct subscriber.

-- 
desktop-bugs mailing list
desktop-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/desktop-bugs


[Bug 346403] Re: In Evince, copy/paste from .pdf files reorders data

2009-11-11 Thread Brian Ewins
*** This bug is a duplicate of bug 33288 ***
https://bugs.launchpad.net/bugs/33288

David - I believe this bug is the one you mentioned on bug 33288. Yes,
this is not quite a duplicate, but hopefully fixing that will help here
too.

The bug in 33288 is caused because the heuristic used to detect reading
order is poor (among other things). This bug, however, appears to be
caused by rounding issues. The current 'copy & paste' code tries to
preserve a tabular layout it infers from the original pdf, and will move
text around to do that, in particular to avoid overlapping blocks of
text. Note that it does this on all text, whether or not we'd see it as
a table - PDF only contains coordinates of chunks of text, not their
interrelationships.

In this case, it appears that there is whitespace in the text to the
right of each number, and that each number, with its whitespace, forms a
separate block of text. It seems likely to me that the right edge of the
whitespace is coincident with the left edge of the last number. If
poppler sees an overlap here, eg because of a rounding error, it would
cause the bug - but only because it then tries to preserve the layout.
So this is a peculiarity of the way rectangular selection works in
poppler, and is possibly fixable.

However, if 33288 is fixed, the selection code will work entirely
differently. It *may* fix this bug, but it may make it worse. If these
non-columns are detected as being columns (because the whitespace
between them is wide enough), then fixed poppler will select down the
column first. If the whitespace is narrow enough, though, it will treat
these as lines of text and you will get the result you expected. Numbers
will not be shifted up and down, because poppler will no longer be
trying to preserve a tabular layout.

-- 
In Evince,  copy/paste from .pdf files reorders data
https://bugs.launchpad.net/bugs/346403
You received this bug notification because you are a member of Ubuntu
Desktop Bugs, which is a bug assignee.

-- 
desktop-bugs mailing list
desktop-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/desktop-bugs


[Bug 33288] Re: Evince doesn't handle columns properly

2009-11-11 Thread Brian Ewins
David #19: you say "it perhaps recognizes column stuff from the display
layout instead of the internal representation."

In PDF, the internal representation *is* just the display layout.
Internally, poppler tries to divide this text into blocks (roughly
paragraphs) which are then grouped into columns based on spacing, and
independently into 'flows' (roughly, sequences of similar blocks in
reading order), based on a bunch of heuristics. This is already tricky,
but is made more complicated by text rotation, and different writing
systems (vertical, right to left, etc). Acrobat and Apple's Preview use
different heuristics, so they group text differently, and make a mess of
things on different documents - but they still make a mess of things.

Just explaining what's going on here; this isn't to say that text
selection can't be improved. I'm slowly putting together a patch based
on the reading order sort described in http://pubs.iupr.org/#2003
-breuel-sdiut , which seems to be fixing some of the problems with the
attachment in #7. However as I said to Andres I have no idea when or if
my patches would be accepted.

-- 
Evince doesn't handle columns properly
https://bugs.launchpad.net/bugs/33288
You received this bug notification because you are a member of Ubuntu
Desktop Bugs, which is a direct subscriber.

-- 
desktop-bugs mailing list
desktop-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/desktop-bugs