We need to get rid of these .properties files, they’re causing endless 
confusion, not to mention that they hide runtime dependencies in text files.

We should make it so that overriding a TextStripper, PageDrawer, etc. doesn’t 
require external .properties files, currently Preflight works in this manner 
and it’s much clearer.

I guess this is a legacy of the “old” ways of Java XML everything.

-- John

On 27 Jul 2014, at 10:09, -A <aa...@hrtmn.net> wrote:

> Thank you, that works as promised and removes the warning. I'm still hoping
> to find a resource that better explains the pieces of PDFBox and how they
> work together. Unfortunately most posts on the internet are solely how and
> not why.
> 
> Appreciate it!
> 
> -Aaron
> 
> 
> On Sun, Jul 27, 2014 at 8:00 AM, Tilman Hausherr <thaush...@t-online.de>
> wrote:
> 
>> Hi,
>> 
>> That didn't happen to me, but maybe it did happen to you with another file.
>> 
>> Another solution would be to pass your own properties file, and it should
>> have this content:
>> 
>> =======================
>> # Licensed to the Apache Software Foundation (ASF) under one or more
>> # contributor license agreements.  See the NOTICE file distributed with
>> # this work for additional information regarding copyright ownership.
>> # The ASF licenses this file to You under the Apache License, Version 2.0
>> # (the "License"); you may not use this file except in compliance with
>> # the License.  You may obtain a copy of the License at
>> #
>> #      http://www.apache.org/licenses/LICENSE-2.0
>> #
>> # Unless required by applicable law or agreed to in writing, software
>> # distributed under the License is distributed on an "AS IS" BASIS,
>> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>> # See the License for the specific language governing permissions and
>> # limitations under the License.
>> 
>> # This table is maps PDF stream operators to concrete OperatorProcessor
>> # subclasses that are used by the PDFStreamEngine class to interpret the
>> # PDF document. The classes configured here allow the PDFTextStripper
>> # subclass of PDFStreamEngine to extract text content of the document.
>> 
>> BT = org.apache.pdfbox.util.operator.BeginText
>> cm = org.apache.pdfbox.util.operator.Concatenate
>> Do = org.apache.pdfbox.util.operator.Invoke
>> ET = org.apache.pdfbox.util.operator.EndText
>> gs = org.apache.pdfbox.util.operator.SetGraphicsStateParameters
>> q  = org.apache.pdfbox.util.operator.GSave
>> Q  = org.apache.pdfbox.util.operator.GRestore
>> T* = org.apache.pdfbox.util.operator.NextLine
>> Tc = org.apache.pdfbox.util.operator.SetCharSpacing
>> Td = org.apache.pdfbox.util.operator.MoveText
>> TD = org.apache.pdfbox.util.operator.MoveTextSetLeading
>> Tf = org.apache.pdfbox.util.operator.SetTextFont
>> Tj = org.apache.pdfbox.util.operator.ShowText
>> TJ = org.apache.pdfbox.util.operator.ShowTextGlyph
>> TL = org.apache.pdfbox.util.operator.SetTextLeading
>> Tm = org.apache.pdfbox.util.operator.SetMatrix
>> Tr = org.apache.pdfbox.util.operator.SetTextRenderingMode
>> Ts = org.apache.pdfbox.util.operator.SetTextRise
>> Tw = org.apache.pdfbox.util.operator.SetWordSpacing
>> Tz = org.apache.pdfbox.util.operator.SetHorizontalTextScaling
>> w  = org.apache.pdfbox.util.operator.SetLineWidth
>> \' = org.apache.pdfbox.util.operator.MoveAndShow
>> \" = org.apache.pdfbox.util.operator.SetMoveAndShow
>> 
>> CS=org.apache.pdfbox.util.operator.SetStrokingColorSpace
>> cs=org.apache.pdfbox.util.operator.SetNonStrokingColorSpace
>> rg=org.apache.pdfbox.util.operator.SetNonStrokingRGBColor
>> G=org.apache.pdfbox.util.operator.SetStrokingGrayColor
>> g=org.apache.pdfbox.util.operator.SetNonStrokingGrayColor
>> K=org.apache.pdfbox.util.operator.SetStrokingCMYKColor
>> k=org.apache.pdfbox.util.operator.SetNonStrokingCMYKColor
>> RG=org.apache.pdfbox.util.operator.SetStrokingRGBColor
>> rg=org.apache.pdfbox.util.operator.SetNonStrokingRGBColor
>> SC=org.apache.pdfbox.util.operator.SetStrokingColor
>> sc=org.apache.pdfbox.util.operator.SetNonStrokingColor
>> SCN=org.apache.pdfbox.util.operator.SetStrokingColor
>> scn=org.apache.pdfbox.util.operator.SetNonStrokingColor
>> 
>> # The following operators are not relevant to text extraction,
>> # so we can silently ignore them.
>> 
>> b
>> B
>> b*
>> B*
>> BDC
>> BI
>> BMC
>> BX
>> c
>> d
>> d0
>> d1
>> DP
>> El
>> EMC
>> EX
>> f
>> F
>> f*
>> h
>> i
>> ID
>> j
>> J
>> l
>> m
>> M
>> MP
>> n
>> re
>> ri
>> s
>> S
>> sh
>> v
>> W
>> W*
>> y
>> 
>> =======================
>> 
>> Tilman
>> 
>> Am 27.07.2014 15:54, schrieb -A:
>> 
>> Tilman;
>>> 
>>> That is somewhat embarrassing. At one point I brought this to the mailing
>>> list (because of the following warning) and was told to remove that line
>>> because the TextStripper wasn't actually a PageDrawer. The functionality
>>> still worked after that, however.
>>> 
>>> Is there a way to do this without the warning, perhaps something within
>>> PageDrawer?
>>> 
>>> 
>>> Thank you,
>>> -Aaron
>>> 
>>> 
>>> WARNING: java.lang.ClassCastException: IncrementalPDFStripper cannot be
>>> cast to org.apache.pdfbox.pdfviewer.PageDrawer
>>> java.lang.ClassCastException: IncrementalPDFStripper cannot be cast to
>>> org.apache.pdfbox.pdfviewer.PageDrawer
>>>  at
>>> org.apache.pdfbox.util.operator.pagedrawer.AppendRectangleToPath.process(
>>> AppendRectangleToPath.java:46)
>>>  at
>>> org.apache.pdfbox.util.PDFStreamEngine.processOperator(
>>> PDFStreamEngine.java:557)
>>> at
>>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(
>>> PDFStreamEngine.java:268)
>>>  at
>>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(
>>> PDFStreamEngine.java:235)
>>>  at
>>> org.apache.pdfbox.util.PDFStreamEngine.processStream(
>>> PDFStreamEngine.java:215)
>>> at IncrementalPDFStripper.containsRed(IncrementalPDFStripper.java:90)
>>>  at IncrementalPDFStripper.main(IncrementalPDFStripper.java:56)
>>> 
>>> 
>>> 
>>> 
>>> On Sun, Jul 27, 2014 at 5:47 AM, Tilman Hausherr <thaush...@t-online.de>
>>> wrote:
>>> 
>>> It is even easier than I thought - replace super() with this:
>>>> 
>>>> super(ResourceLoader.loadProperties("org/apache/
>>>> pdfbox/resources/PageDrawer.properties", true));
>>>> 
>>>> Tilman
>>>> 
>>>> Am 27.07.2014 13:03, schrieb Tilman Hausherr:
>>>> 
>>>>  After having written the text below, I tested by including the "rg"
>>>> 
>>>>> operator in the properties list and now it worked. I also tested
>>>>> deleting
>>>>> your println and instead adding this if the text is red:
>>>>> 
>>>>>     System.out.print (textPos.getCharacter());
>>>>> 
>>>>> and so I got this output:
>>>>> 
>>>>> 21_Key .1295 R~Wall Prof LinP 0.003             0.004     0.000 true
>>>>> 
>>>>> which is exactly what is red in the PDF.
>>>>> 
>>>>> Another way (probably better) to do it would probably be to not derive
>>>>> PDFTextStripper but |PDFStreamEngine and construct it with||
>>>>> 
>>>>> ResourceLoader.loadProperties("org/apache/pdfbox/resources/
>>>>> PageDrawer.properties")|
>>>>> 
>>>>> 
>>>>> see also http://stackoverflow.com/a/9157714/535646
>>>>> 
>>>>> Tilman
>>>>> 
>>>>> 
>>>>> Am 27.07.2014 12:14, schrieb Tilman Hausherr:
>>>>> 
>>>>> Hi,
>>>>>> 
>>>>>> Do you still have the code that worked?
>>>>>> 
>>>>>> I'm not the text extraction specialist here, but what I did was to look
>>>>>> in the uncompressed source of the PDF. The stream has code like this:
>>>>>> 
>>>>>> 0 0 0 rg
>>>>>> 0 0.5019 0 rg
>>>>>> 1 0 0 rg
>>>>>> 
>>>>>> The first line sets to black, the second to green, the third to red.
>>>>>> And
>>>>>> from what I saw, it can't work at all, because the "rg" operator isn't
>>>>>> processed when extracting text, because PDFTextStripper.properties
>>>>>> doesn't
>>>>>> contain the "rg" operator. (The operator is in another list, which is
>>>>>> used
>>>>>> when rendering)
>>>>>> 
>>>>>> So that is what puzzles me. I think it can't work at all. But you said
>>>>>> it did work at a time.
>>>>>> 
>>>>>> Tilman
>>>>>> 
>>>>>> 
>>>>>> Am 27.07.2014 07:43, schrieb Tilman Hausherr:
>>>>>> 
>>>>>> Hi,
>>>>>>> 
>>>>>>> Please upload the PDF somewhere and post the URL, PDF files are
>>>>>>> removed
>>>>>>> from the mailing list.
>>>>>>> 
>>>>>>> Tilman
>>>>>>> 
>>>>>>> Am 27.07.2014 02:35, schrieb -A:
>>>>>>> 
>>>>>>> Hello again. I've been trying to figure out this issue that has come
>>>>>>>> up for me and in my research I found someone posting on
>>>>>>>> StackOverflow (
>>>>>>>> http://stackoverflow.com/questions/10844271/how-to-get-
>>>>>>>> font-color-using-pdfbox) a similar issue where they could not read
>>>>>>>> any colors from a PDF. The user posted the code and someone else
>>>>>>>> took it,
>>>>>>>> ran it, and reported that it worked. The users approach was
>>>>>>>> different than
>>>>>>>> mine, but alas.
>>>>>>>> 
>>>>>>>> I'm not sure at this point what is going on. I have stepped through
>>>>>>>> each individual character and checked the PDGraphicsState object,
>>>>>>>> and even
>>>>>>>> when I am looking at an open file with visibly red text (attached)
>>>>>>>> the
>>>>>>>> debugger only reports DeviceGray. If I print out the ColorSpace name
>>>>>>>> from
>>>>>>>> the PDGraphicsState this is what is printed - for every character.
>>>>>>>> 
>>>>>>>> I would appreciate if someone could perhaps run the attached text
>>>>>>>> stripper with the attached PDF file and report back if it actually
>>>>>>>> prints
>>>>>>>> trueinstead of false, as it does for me. Since I saw this occurrence
>>>>>>>> elsewhere I'd like to rule that out - in case an IDE setting of some
>>>>>>>> sort
>>>>>>>> may be causing this?
>>>>>>>> 
>>>>>>>> It should be noted that I began using PDFBox with 1.8.5 and had this
>>>>>>>> code working fine. Still with 1.8.5 yesterday it was failing.
>>>>>>>> Upgrading to
>>>>>>>> 1.8.6 yielded the same results.
>>>>>>>> 
>>>>>>>> If this is an actual issue I do not mind attempting to solve it if
>>>>>>>> someone may have a general idea where to point me as to prevent
>>>>>>>> needless
>>>>>>>> meddling with graphics state objects. Or, if this should be reported
>>>>>>>> I can
>>>>>>>> do that as well.
>>>>>>>> 
>>>>>>>> Thanks!
>>>>>>>> 
>>>>>>>> -Aaron
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> *Previous Message:*
>>>>>>>> *
>>>>>>>> *
>>>>>>>> *
>>>>>>>> *
>>>>>>>> I've attached an updated stripper file with the only addition being a
>>>>>>>> main function to test the class specifically.
>>>>>>>> 
>>>>>>>> When ran with the PDF I have also attached it indeed does not
>>>>>>>> recognize the red text.
>>>>>>>> 
>>>>>>>> At this point it seems that this issue is solely dependent on PDFBox.
>>>>>>>> I'll stay tuned for some insight hopefully. If any other information
>>>>>>>> is
>>>>>>>> needed, let me know!
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>> 

Reply via email to