Hi,

At this time, the problem I see and wanted to solve (PDFBOX-2246) exists regardless whether we use a properties file or initialize directly in the code.

Tilman


Am 29.07.2014 19:41, schrieb John Hewson:
On 29 Jul 2014, at 03:44, Andreas Lehmkühler <andr...@lehmi.de> wrote:

Hi,

it's not a black and white issue (comments inline)

John Hewson <j...@jahewson.com> hat am 29. Juli 2014 um 07:44 geschrieben:


Yes, really I should have said subclasses of PDFStreamEngine -  that's where
the .properties file originates. I'd propose replacing the properties
mechanism with a simple method containing the mapping which can be overridden
in subclasses. Ultimately, users expect to be able to subclass the behaviour
of a class by just subclassing the class.
PDFStreamEngine doesn't configure any operator set itself. The subclasses are
supposed to configure their own set of operators depending on the particular
usecase. E.g. to extend the text extraction one has to subclass PDFTextStripper
and so on.
It’s PDFStreamEngine which implements the .property mechanism though, via the
PDFStreamEngine(Properties properties) constructor.

E.g. to extend the text extraction one has to subclass PDFTextStripper and so 
on.
That’s true, but it’s only half the story, don’t forget that the .properties 
files need
to be copied and pasted elsewhere and modified along with overriding which 
.property
file is passed in the constructor if you want to truly override the class’ 
behaviour.

We've seen a number of incidents of confusion on the mailing list due to the
current design.
IMHO, most of the confusion is based on the lack of knowledge of the pdf spec.
One can't understand how pdfbox works under the hood by simply looking at the
code. One has to understand the pdf spec as well, at least the base concepts.
I’m specifically talking about confusion surrounding how to override operators, 
and
.properties files, this has come up before. This entire thread has been caused 
by
PDFBox’s design and *not* the PDF spec.

I'd say that to the modern Java developer having non-code runtime binding has
become an anti-pattern, resulting in brittle code which can't easily be
navigated in an IDE and which resists automated analysis and exhibits runtime
failures despite compiling ok. This is one of those cases where the collective
wisdom has just evolved over the years.
It depends on the given usecase. All solutions have advantages and
disadvantages. E.g. if someone wants to configure the PDFTextStripper without
recompiling the code, it is quite handy to keep the configuration in a text
file.
Has anybody *ever* wanted to change the operators which PDFTextStripper is
processing without recompiling the code? These are internal implementation
details that shouldn’t be exposed in the first place - it’s not a 
“configuration” at
all, especially as 99% of possible changes would just break PDFTextStripper.

In this case I'm neither pro or con a text based config, but I tend to agree
with John to have the different configurations in some method within the
subclasses of PDFStreamEngine.
As above, this isn’t “configuration” at all, it lacks even a basic use case. I 
don’t
see any pros which aren’t fabricated for the sake of argument, but the cons are
causing us significant problems right here, right now.

BR
Andreas Lehmkühler

-- John

On 28 Jul 2014, at 13:42, Tilman Hausherr <thaush...@t-online.de> wrote:

I disagree - one doesn't *have* to pass a property file to PDFTextStripper
and PageDrawer. The properties file for PDFTextStripper is optional. The
property parameter was already there before it became an apache project.


Tilman



Am 28.07.2014 22:08, schrieb John Hewson:
We need to get rid of these .properties files, they’re causing endless
confusion, not to mention that they hide runtime dependencies in text
files.

We should make it so that overriding a TextStripper, PageDrawer, etc.
doesn’t require external .properties files, currently Preflight works in
this manner and it’s much clearer.

I guess this is a legacy of the “old” ways of Java XML everything.

-- John

On 27 Jul 2014, at 10:09, -A <aa...@hrtmn.net> wrote:

Thank you, that works as promised and removes the warning. I'm still
hoping
to find a resource that better explains the pieces of PDFBox and how they
work together. Unfortunately most posts on the internet are solely how and
not why.

Appreciate it!

-Aaron


On Sun, Jul 27, 2014 at 8:00 AM, Tilman Hausherr <thaush...@t-online.de>
wrote:

Hi,

That didn't happen to me, but maybe it did happen to you with another
file.

Another solution would be to pass your own properties file, and it should
have this content:

=======================
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# This table is maps PDF stream operators to concrete OperatorProcessor
# subclasses that are used by the PDFStreamEngine class to interpret the
# PDF document. The classes configured here allow the PDFTextStripper
# subclass of PDFStreamEngine to extract text content of the document.

BT = org.apache.pdfbox.util.operator.BeginText
cm = org.apache.pdfbox.util.operator.Concatenate
Do = org.apache.pdfbox.util.operator.Invoke
ET = org.apache.pdfbox.util.operator.EndText
gs = org.apache.pdfbox.util.operator.SetGraphicsStateParameters
q  = org.apache.pdfbox.util.operator.GSave
Q  = org.apache.pdfbox.util.operator.GRestore
T* = org.apache.pdfbox.util.operator.NextLine
Tc = org.apache.pdfbox.util.operator.SetCharSpacing
Td = org.apache.pdfbox.util.operator.MoveText
TD = org.apache.pdfbox.util.operator.MoveTextSetLeading
Tf = org.apache.pdfbox.util.operator.SetTextFont
Tj = org.apache.pdfbox.util.operator.ShowText
TJ = org.apache.pdfbox.util.operator.ShowTextGlyph
TL = org.apache.pdfbox.util.operator.SetTextLeading
Tm = org.apache.pdfbox.util.operator.SetMatrix
Tr = org.apache.pdfbox.util.operator.SetTextRenderingMode
Ts = org.apache.pdfbox.util.operator.SetTextRise
Tw = org.apache.pdfbox.util.operator.SetWordSpacing
Tz = org.apache.pdfbox.util.operator.SetHorizontalTextScaling
w  = org.apache.pdfbox.util.operator.SetLineWidth
\' = org.apache.pdfbox.util.operator.MoveAndShow
\" = org.apache.pdfbox.util.operator.SetMoveAndShow

CS=org.apache.pdfbox.util.operator.SetStrokingColorSpace
cs=org.apache.pdfbox.util.operator.SetNonStrokingColorSpace
rg=org.apache.pdfbox.util.operator.SetNonStrokingRGBColor
G=org.apache.pdfbox.util.operator.SetStrokingGrayColor
g=org.apache.pdfbox.util.operator.SetNonStrokingGrayColor
K=org.apache.pdfbox.util.operator.SetStrokingCMYKColor
k=org.apache.pdfbox.util.operator.SetNonStrokingCMYKColor
RG=org.apache.pdfbox.util.operator.SetStrokingRGBColor
rg=org.apache.pdfbox.util.operator.SetNonStrokingRGBColor
SC=org.apache.pdfbox.util.operator.SetStrokingColor
sc=org.apache.pdfbox.util.operator.SetNonStrokingColor
SCN=org.apache.pdfbox.util.operator.SetStrokingColor
scn=org.apache.pdfbox.util.operator.SetNonStrokingColor

# The following operators are not relevant to text extraction,
# so we can silently ignore them.

b
B
b*
B*
BDC
BI
BMC
BX
c
d
d0
d1
DP
El
EMC
EX
f
F
f*
h
i
ID
j
J
l
m
M
MP
n
re
ri
s
S
sh
v
W
W*
y

=======================

Tilman

Am 27.07.2014 15:54, schrieb -A:

Tilman;
That is somewhat embarrassing. At one point I brought this to the
mailing
list (because of the following warning) and was told to remove that line
because the TextStripper wasn't actually a PageDrawer. The functionality
still worked after that, however.

Is there a way to do this without the warning, perhaps something within
PageDrawer?


Thank you,
-Aaron


WARNING: java.lang.ClassCastException: IncrementalPDFStripper cannot be
cast to org.apache.pdfbox.pdfviewer.PageDrawer
java.lang.ClassCastException: IncrementalPDFStripper cannot be cast to
org.apache.pdfbox.pdfviewer.PageDrawer
   at
org.apache.pdfbox.util.operator.pagedrawer.AppendRectangleToPath.process(
AppendRectangleToPath.java:46)
   at
org.apache.pdfbox.util.PDFStreamEngine.processOperator(
PDFStreamEngine.java:557)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(
PDFStreamEngine.java:268)
   at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(
PDFStreamEngine.java:235)
   at
org.apache.pdfbox.util.PDFStreamEngine.processStream(
PDFStreamEngine.java:215)
at IncrementalPDFStripper.containsRed(IncrementalPDFStripper.java:90)
   at IncrementalPDFStripper.main(IncrementalPDFStripper.java:56)




On Sun, Jul 27, 2014 at 5:47 AM, Tilman Hausherr <thaush...@t-online.de>
wrote:

It is even easier than I thought - replace super() with this:
super(ResourceLoader.loadProperties("org/apache/
pdfbox/resources/PageDrawer.properties", true));

Tilman

Am 27.07.2014 13:03, schrieb Tilman Hausherr:

   After having written the text below, I tested by including the "rg"

operator in the properties list and now it worked. I also tested
deleting
your println and instead adding this if the text is red:

      System.out.print (textPos.getCharacter());

and so I got this output:

21_Key .1295 R~Wall Prof LinP 0.003             0.004     0.000 true

which is exactly what is red in the PDF.

Another way (probably better) to do it would probably be to not derive
PDFTextStripper but |PDFStreamEngine and construct it with||

ResourceLoader.loadProperties("org/apache/pdfbox/resources/
PageDrawer.properties")|


see also http://stackoverflow.com/a/9157714/535646

Tilman


Am 27.07.2014 12:14, schrieb Tilman Hausherr:

Hi,
Do you still have the code that worked?

I'm not the text extraction specialist here, but what I did was to
look
in the uncompressed source of the PDF. The stream has code like this:

0 0 0 rg
0 0.5019 0 rg
1 0 0 rg

The first line sets to black, the second to green, the third to red.
And
from what I saw, it can't work at all, because the "rg" operator
isn't
processed when extracting text, because PDFTextStripper.properties
doesn't
contain the "rg" operator. (The operator is in another list, which is
used
when rendering)

So that is what puzzles me. I think it can't work at all. But you
said
it did work at a time.

Tilman


Am 27.07.2014 07:43, schrieb Tilman Hausherr:

Hi,
Please upload the PDF somewhere and post the URL, PDF files are
removed
from the mailing list.

Tilman

Am 27.07.2014 02:35, schrieb -A:

Hello again. I've been trying to figure out this issue that has come
up for me and in my research I found someone posting on
StackOverflow (
http://stackoverflow.com/questions/10844271/how-to-get-
font-color-using-pdfbox) a similar issue where they could not read
any colors from a PDF. The user posted the code and someone else
took it,
ran it, and reported that it worked. The users approach was
different than
mine, but alas.

I'm not sure at this point what is going on. I have stepped through
each individual character and checked the PDGraphicsState object,
and even
when I am looking at an open file with visibly red text (attached)
the
debugger only reports DeviceGray. If I print out the ColorSpace
name
from
the PDGraphicsState this is what is printed - for every character.

I would appreciate if someone could perhaps run the attached text
stripper with the attached PDF file and report back if it actually
prints
trueinstead of false, as it does for me. Since I saw this
occurrence
elsewhere I'd like to rule that out - in case an IDE setting of
some
sort
may be causing this?

It should be noted that I began using PDFBox with 1.8.5 and had
this
code working fine. Still with 1.8.5 yesterday it was failing.
Upgrading to
1.8.6 yielded the same results.

If this is an actual issue I do not mind attempting to solve it if
someone may have a general idea where to point me as to prevent
needless
meddling with graphics state objects. Or, if this should be
reported
I can
do that as well.

Thanks!

-Aaron




*Previous Message:*
*
*
*
*
I've attached an updated stripper file with the only addition being
a
main function to test the class specifically.

When ran with the PDF I have also attached it indeed does not
recognize the red text.

At this point it seems that this issue is solely dependent on
PDFBox.
I'll stay tuned for some insight hopefully. If any other
information
is
needed, let me know!

Reply via email to