Hi,

Am 10.08.2015 um 13:22 schrieb Gilad Denneboom:
Hi Andreas,

Of course the output itself is different, but I would expect that the
underlying text each tool processes would be the same, and it's not. Have a
look at the first line in the PrintTextLocations output file:
String[472.89,54.0 fs=10.0 xscale=10.0 height=7.21 space=2.5
width=2.7799988]:
It is repeated, with exactly the same information, 12 times throughout the
output, lines 1, 91, 181, 271, 361, 451, 541, 631, 721, 811, 901 and 991.

Why would the same information be processed 12 times in a single run?
The pdf contains a lot of redundant information, e.g. the header is repeated several times (I didn't count them but I guess it's 12 times). PDFTextStripper eliminates overlapping text/characters and PrintTextLocations doesn't.

BR
Andreas

Gilad

On Mon, Aug 10, 2015 at 12:18 PM, Andreas Lehmkühler <[email protected]>
wrote:

Hi Gilad,

sorry for the late answer ....

I'm not sure what you're expecting. You are using 2 totally different
approaches
to process a pdf. PrintTextLocations provides a lot of additional
information
for every piece of text, which may vary from one character up to whole
words or
lines of text. Consequently the output has to be totally different and of
course
much bigger than the output of a simple text extraction.

BR
Andreas

Gilad Denneboom <[email protected]> hat am 10. August 2015 um
10:05
geschrieben:


No one has any ideas?

On Thu, Aug 6, 2015 at 5:49 PM, Gilad Denneboom <
[email protected]>
wrote:

Hi everyone,

I'm looking for advice on a problem I'm encountering where the output
of
PDFTextStripper and PrintTextLocations is dramatically different when
processing the same file.
For some reason, the output of PrintTextLocations is 12 times longer
than
that of PDFTextStripper, ie the entire text is printed out 12 times,
instead of just once.

I'm attaching the file in question, as well as the output produced
using
both methods via Google Drive... Hopefully it will come through.

I'd appreciate any ideas as to what might be causing this issue (I'm
guessing there's something wrong with the structure of the file), and
of
course any possible solutions.

Thanks in advance, Gilad.

PS. I'm using 1.8.10.
​
  output problem.zip
<
https://drive.google.com/file/d/0B_eBFHMNjkhseTVaQ0FxSkdmZUE/view?usp=drive_web


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]





---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to