Re: PDFMergerUtility creating PDFs with incorrect version

2024-06-12 Thread Tilman Hausherr
Please upload 2 files to a sharehoster and the post code you used, I'll 
then investigate. It should take the highest version, here's an excerpt 
of our source code:


    // use the highest version number for the resulting pdf
    float destVersion = destination.getVersion();
    float srcVersion = source.getVersion();

    if (destVersion < srcVersion)
    {
    destination.setVersion(srcVersion);
    }

Tilman

On 12.06.2024 12:34, Suryavanshi, Sajal wrote:

Hi,

We tried below solution:
use CompressParameters.NO_COMPRESSION as second parameter. For the first 
parameter, use IOUtils.createMemoryOnlyStreamCache()

But this now geenrates the pdf in version 1.4, our expectation is that the 
version of pdf should remain 1.5 which is original version of input pdf.
Could you please suggest on this?

Thanks,
Sajal

From: Suryavanshi, Sajal
Sent: Thursday, June 6, 2024 5:28 PM
To: users@pdfbox.apache.org
Cc: Pradhan, Kartik 
Subject: PDFMergerUtility creating PDFs with incorrect version

Need help with below issue:

PDFBox :- version 3.0.2
Input PDFs :- version 1.5

When using method PDFMergerUtility.mergeDocuments for merging input pdfs , it 
creates output PDFs having version 1.6
Tried other method PDFMergerUtility.appendDocument and set version of 
destination PDF document to 1.5 using method PDDocument.setVersion , still the 
output PDF is version 1.6
When checked version with method PDDocument.getVersion , it shows correct 
version 1.5 but the PDF version under the document properties still shows 
version 1.6,  that is why all the other tools are considering the version as 1.6

Thanks in advance.

Thanks,
Sajal

The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you. Message Encrypted via TLS 
connection




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: PDFMergerUtility creating PDFs with incorrect version

2024-06-10 Thread Tilman Hausherr

Hi,

Please try this: use CompressParameters.NO_COMPRESSION as second 
parameter. For the first parameter, use 
IOUtils.createMemoryOnlyStreamCache()


Tilman

On 06.06.2024 13:58, Suryavanshi, Sajal wrote:

Need help with below issue:

PDFBox :- version 3.0.2
Input PDFs :- version 1.5

When using method PDFMergerUtility.mergeDocuments for merging input pdfs , it 
creates output PDFs having version 1.6
Tried other method PDFMergerUtility.appendDocument and set version of 
destination PDF document to 1.5 using method PDDocument.setVersion , still the 
output PDF is version 1.6
When checked version with method PDDocument.getVersion , it shows correct 
version 1.5 but the PDF version under the document properties still shows 
version 1.6,  that is why all the other tools are considering the version as 1.6

Thanks in advance.

Thanks,
Sajal

The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you. Message Encrypted via TLS 
connection




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



WG: Re: PDFBox Loader Issue

2024-06-07 Thread Tilman Hausherr
Ah, that's commons logging that is missing. You have to add that one as well.

Alternatively use pdfbox-app, that one has everything. 

https://commons.apache.org/proper/commons-logging/download_logging.cgi

Tilman 





Gesendet mit der Telekom Mail App

-- Original-Nachricht --
Von: AllenM10 
Betreff: Re: PDFBox Loader Issue
Datum: 07.06.2024, 18:08 Uhr
An: users@pdfbox.apache.org

Tilman,

I refreshed my .jar files, removed them from my project and readded them
from the files on my device, and redownloaded them from the website and
readded them in that order. Unfortunately, following each of these actions,
I am faced with the same issue. Please let me know if I have misunderstood
the meaning of "reload" in this case -- is it a special function?
For full transparency, I am attaching a full video of the process as I
understand it now. If I am doing anything wrong it should be far easier to
identify it like this.
[
https://drive.google.com/file/d/1enO5zC-HXMxdhF7nTGplRfdUaONLD7I8/view?usp=sharing
]

Andreas,

As far as I know, Eclipse is compiling fine, and I am able to edit my code.
I am able to run my class and the elements of my program which do not use
PDFBox work as they always have. I am able to run my src as a Java
Application without error, until my code attempts to use the external
library.

Regards,
Allen Marshall


On Thu, Jun 6, 2024 at 1:29 AM Andreas Lehmkühler 
wrote:

>
>
> Am 05.06.24 um 15:21 schrieb AllenM10:
> > Hi Andreas,
> >
> > Indeed, it should work that simply! However I have attempted to follow
> > those steps now about five times over using as many and more tutorials,
> and
> > after completely uninstalling and reinstalling my IDE. I am not
> attempting
> > to use Maven or Gradle. The loader (which I've confirmed is located in
> one
> > of the jar files I am utilizing) is still giving me issues with a
> > ClassNotFoundException. I am making sure to follow online sources
> speaking
> > about version 3.0 and above.
> Are you able to edit your source and eclipse compiles the source for
> you? The project configuration is ok it if works.
>
> To run your class, mark it on the left hand side in the package
> explorer, open the context menu and choose "Run As -> Java Application"
>
> That's it.
>
> Andreas
>
> >
> > I'll look into more tutorials related to Eclipse and unrelated to PDFBox
> to
> > see if I can learn anything useful. Thank you for your reply!
> >
> > On Wed, Jun 5, 2024, 02:17 Andreas Lehmkühler 
> > wrote:
> >
> >> Hi,
> >>
> >> all IDEs I know are working similar.
> >>
> >> Create a new project using File -> New -> Java Project, follow the
> >> instructions
> >>
> >> Add all needed jars to your environment using Project -> Properties ->
> >> Java Build Path -> Libraries
> >>
> >> That's it for a simple project without using any build tool like maven
> >> or gradle.
> >>
> >> There are tons of tutorials on how to setup/configure an eclipse
> >> project. It isn't hard to find them using your favorite search engine.
> >>
> >>
> >>
> >> Am 03.06.24 um 19:44 schrieb AllenM10:
> >>> Tilman,
> >>>
> >>> Thank you very much for your response! I happened upon both of those
> >>> links while trying to solve the problem on my own, but ran through them
> >>> again now since I've just recently freshly installed my Eclipse to make
> >>> sure it wasn't an issue with my program.
> >>>
> >>> I tried moving my .jars from Modulepath into Classpath and got a
> >>> slightly different error message. This time it's a NoClassDefFoundError
> >>> caused by a ClassNotFoundException. I've attached a new screenshot with
> >>> that error shown.
> >>>
> >>> On the second link, I attempted Solution 2's "Example working JavaFX
> >>> application with maven," and am still getting the same error.
> >>>
> >>> I strongly suspect this is something to do with the pom.xml file I see
> >>> floating around. Only one tutorial I happened across suggested
> adjusting
> >>> it slightly for a PDFBox install. Is that a step I might be missing? I
> >>> hadn't thought much of it since it isn't mentioned anywhere else.
> >>>
> >>> Thank you for your time.
> >>>
> >>> Regards,
> >>> Allen Marshall
> >>>
> >>>
> >>> On Mon, Jun 3, 2024 at 9:53 AM Tilman Hausherr  >>> <mailto:thaush...@t-online.de>> wrot

WG: Re: PDFBox Loader Issue

2024-06-05 Thread Tilman Hausherr


Please try to reload  the pdfbox jar file

Tilman 





-- Original-Nachricht --
Von: AllenM10 
Betreff: Re: PDFBox Loader Issue
Datum: 05.06.2024, 15:22 Uhr
An: users@pdfbox.apache.org

Hi Andreas,

Indeed, it should work that simply! However I have attempted to follow
those steps now about five times over using as many and more tutorials, and
after completely uninstalling and reinstalling my IDE. I am not attempting
to use Maven or Gradle. The loader (which I've confirmed is located in one
of the jar files I am utilizing) is still giving me issues with a
ClassNotFoundException. I am making sure to follow online sources speaking
about version 3.0 and above.

I'll look into more tutorials related to Eclipse and unrelated to PDFBox to
see if I can learn anything useful. Thank you for your reply!

On Wed, Jun 5, 2024, 02:17 Andreas Lehmkühler 
wrote:

> Hi,
>
> all IDEs I know are working similar.
>
> Create a new project using File -> New -> Java Project, follow the
> instructions
>
> Add all needed jars to your environment using Project -> Properties ->
> Java Build Path -> Libraries
>
> That's it for a simple project without using any build tool like maven
> or gradle.
>
> There are tons of tutorials on how to setup/configure an eclipse
> project. It isn't hard to find them using your favorite search engine.
>
>
>
> Am 03.06.24 um 19:44 schrieb AllenM10:
> > Tilman,
> >
> > Thank you very much for your response! I happened upon both of those
> > links while trying to solve the problem on my own, but ran through them
> > again now since I've just recently freshly installed my Eclipse to make
> > sure it wasn't an issue with my program.
> >
> > I tried moving my .jars from Modulepath into Classpath and got a
> > slightly different error message. This time it's a NoClassDefFoundError
> > caused by a ClassNotFoundException. I've attached a new screenshot with
> > that error shown.
> >
> > On the second link, I attempted Solution 2's "Example working JavaFX
> > application with maven," and am still getting the same error.
> >
> > I strongly suspect this is something to do with the pom.xml file I see
> > floating around. Only one tutorial I happened across suggested adjusting
> > it slightly for a PDFBox install. Is that a step I might be missing? I
> > hadn't thought much of it since it isn't mentioned anywhere else.
> >
> > Thank you for your time.
> >
> > Regards,
> > Allen Marshall
> >
> >
> > On Mon, Jun 3, 2024 at 9:53 AM Tilman Hausherr  > <mailto:thaush...@t-online.de>> wrote:
> >
> > Hi,
> >
> > I don't use eclipse myself. There have been questions about it from
> > time
> > to time, I assume that this IDE isn't intuitive enough to "just
> work".
> > Here an answer about this:
> >
> https://stackoverflow.com/questions/17452442/how-do-i-use-pdfbox-with-eclipse-does-it-package-in-jar-files
> <
> https://stackoverflow.com/questions/17452442/how-do-i-use-pdfbox-with-eclipse-does-it-package-in-jar-files
> >
> >
> > Is this "personal register" related to the specific project? Or is it
> > just a register of libraries you like? Or is it the name of your
> > project?
> >
> > This question
> >
> https://stackoverflow.com/questions/77175834/i-cant-seem-to-get-pdfbox-3-0-0-jar-in-my-classpath-library-to-import-to-my-mai
> <
> https://stackoverflow.com/questions/77175834/i-cant-seem-to-get-pdfbox-3-0-0-jar-in-my-classpath-library-to-import-to-my-mai
> >
> > doesn't answer but it shows a "class path" which your screenshot
> > doesn't
> > show so maybe investigate that, i.e. whether the libraries are in
> your
> > classpath at runtime.
> >
> > pdfbox-3.0.2.jar is the correct file (one of them). You likely won't
> > need preflight, pdfdebuger and xmpbox.
> >
> > Tilman
> >
> > On 03.06.2024 15:33, AllenM10 wrote:
> >  > Acknowledged!
> >  >
> >  > I've uploaded the screenshot to Google Drive. It should be
> > viewable at this
> >  > link: [
> >  >
> >
> https://drive.google.com/file/d/1DhlyOqRgSB-Vo6Dldij0qjS2e9KkSSsI/view?usp=sharing
> <
> https://drive.google.com/file/d/1DhlyOqRgSB-Vo6Dldij0qjS2e9KkSSsI/view?usp=sharing
> >
> >  > ]
> >  >
> >  > pdfbox-3.0.2.jar certainly sounds like the file you mean, and I
> > can't find
> >  > a pdfbox.jar file on the website. Perhaps I added them to Eclipse
> 

Re: PDFBox Loader Issue

2024-06-03 Thread Tilman Hausherr
pom.xml is if you are using maven. You don't need it (assuming that 
eclipse can also combine stuff manually). If you do use it, make sure 
that when you're running the program that your .jar files are in the 
classpath.


You can try this manually, i.e.  java -cp   register.class

if "register.class" exists after the build.

As I said, I don't know eclipse. There's probably an easier answer. I 
use netbeans which IMHO is easier to use for beginners but that's just 
my opinion, I'm the only one in the team who uses it 


Tilman

On 03.06.2024 19:44, AllenM10 wrote:

Tilman,

Thank you very much for your response! I happened upon both of those 
links while trying to solve the problem on my own, but ran through 
them again now since I've just recently freshly installed my Eclipse 
to make sure it wasn't an issue with my program.


I tried moving my .jars from Modulepath into Classpath and got a 
slightly different error message. This time it's a 
NoClassDefFoundError caused by a ClassNotFoundException. I've attached 
a new screenshot with that error shown.


On the second link, I attempted Solution 2's "Example working JavaFX 
application with maven," and am still getting the same error.


I strongly suspect this is something to do with the pom.xml file I see 
floating around. Only one tutorial I happened across suggested 
adjusting it slightly for a PDFBox install. Is that a step I might be 
missing? I hadn't thought much of it since it isn't mentioned anywhere 
else.


Thank you for your time.

Regards,
Allen Marshall


On Mon, Jun 3, 2024 at 9:53 AM Tilman Hausherr  
wrote:


Hi,

I don't use eclipse myself. There have been questions about it
from time
to time, I assume that this IDE isn't intuitive enough to "just
work".
Here an answer about this:

https://stackoverflow.com/questions/17452442/how-do-i-use-pdfbox-with-eclipse-does-it-package-in-jar-files

Is this "personal register" related to the specific project? Or is it
just a register of libraries you like? Or is it the name of your
project?

This question

https://stackoverflow.com/questions/77175834/i-cant-seem-to-get-pdfbox-3-0-0-jar-in-my-classpath-library-to-import-to-my-mai
doesn't answer but it shows a "class path" which your screenshot
doesn't
show so maybe investigate that, i.e. whether the libraries are in
your
classpath at runtime.

pdfbox-3.0.2.jar is the correct file (one of them). You likely won't
need preflight, pdfdebuger and xmpbox.

Tilman

On 03.06.2024 15:33, AllenM10 wrote:
> Acknowledged!
>
> I've uploaded the screenshot to Google Drive. It should be
viewable at this
> link: [
>

https://drive.google.com/file/d/1DhlyOqRgSB-Vo6Dldij0qjS2e9KkSSsI/view?usp=sharing
> ]
>
> pdfbox-3.0.2.jar certainly sounds like the file you mean, and I
can't find
> a pdfbox.jar file on the website. Perhaps I added them to Eclipse
> incorrectly?
>
    > Regards,
> Allen Marshall
>
>
> On Sun, Jun 2, 2024 at 11:10 PM Tilman
Hausherr
> wrote:
>
>> Hi,
>>
>> It's in the pdfbox jar. Your screenshot was removed because most
>> attachments are removed from the mailing list. You need to have
these
>> inline (thunderbird does this, not sure about others) or upload
them to
>> a sharehoster, e.g.https://imgur.com/  for images.
>>
>> Tilman
>>
>> On 02.06.2024 16:24, AllenM10 wrote:
>>> Good day,
>>>
>>> I have been having trouble attempting to load in text from a
PDF using
>>> the 3.0.2 version of PDFBox. It seems my profile cannot detect
>>> "org.apache.pdfbox.Loader" or I have neglected to install it
(in which
>>> case, I wonder where one might find it?) and is throwing a
>>> ClassNotFoundException. I am attaching a screenshot which
shows both
>>> my Referenced Libraries dropdown, containing the .jar files I
>>> downloaded from [https://pdfbox.apache.org/download.html ], as
well
>>> as the error message in question and the portion of my code which
>>> throws the error.
>>>
>>> Where can I find the loader class, to properly use it? I
apologize for
>>> my barely rudimentary understanding of these systems.
>>>
>>> Regards,
>>> Allen Marshall
>>>
>>>
-
>>> To unsubscribe,e-mail:users-unsubscr...@pdfbox.apache.org
<mailto:e-mail%3ausers-unsubscr...@pdfbox.apache.org>
>>> For additional commands,e-mail:users-h...@pdfbox.apache.org
<mailto:e-mail%3ausers-h...@pdfbox.apache.org>
>>


-
To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:users-h...@pdfbox.apache.org




Re: Request for pdfbox clarifications

2024-06-03 Thread Tilman Hausherr

On 03.06.2024 15:42, karunakaran mirnalini wrote:

Hi, We are using pdfbox library and have few clarifications on writing a AWT 
book to the pdf document. Please let us know how to proceed.


I haven't understood the question, but maybe 
https://github.com/rototor/pdfbox-graphics2d is what you want.


Tilman


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: PDFBox Loader Issue

2024-06-03 Thread Tilman Hausherr

Hi,

I don't use eclipse myself. There have been questions about it from time 
to time, I assume that this IDE isn't intuitive enough to "just work". 
Here an answer about this:

https://stackoverflow.com/questions/17452442/how-do-i-use-pdfbox-with-eclipse-does-it-package-in-jar-files

Is this "personal register" related to the specific project? Or is it 
just a register of libraries you like? Or is it the name of your project?


This question
https://stackoverflow.com/questions/77175834/i-cant-seem-to-get-pdfbox-3-0-0-jar-in-my-classpath-library-to-import-to-my-mai
doesn't answer but it shows a "class path" which your screenshot doesn't 
show so maybe investigate that, i.e. whether the libraries are in your 
classpath at runtime.


pdfbox-3.0.2.jar is the correct file (one of them). You likely won't 
need preflight, pdfdebuger and xmpbox.


Tilman

On 03.06.2024 15:33, AllenM10 wrote:

Acknowledged!

I've uploaded the screenshot to Google Drive. It should be viewable at this
link: [
https://drive.google.com/file/d/1DhlyOqRgSB-Vo6Dldij0qjS2e9KkSSsI/view?usp=sharing
]

pdfbox-3.0.2.jar certainly sounds like the file you mean, and I can't find
a pdfbox.jar file on the website. Perhaps I added them to Eclipse
incorrectly?

Regards,
Allen Marshall


On Sun, Jun 2, 2024 at 11:10 PM Tilman Hausherr
wrote:


Hi,

It's in the pdfbox jar. Your screenshot was removed because most
attachments are removed from the mailing list. You need to have these
inline (thunderbird does this, not sure about others) or upload them to
a sharehoster, e.g.https://imgur.com/  for images.

Tilman

On 02.06.2024 16:24, AllenM10 wrote:

Good day,

I have been having trouble attempting to load in text from a PDF using
the 3.0.2 version of PDFBox. It seems my profile cannot detect
"org.apache.pdfbox.Loader" or I have neglected to install it (in which
case, I wonder where one might find it?) and is throwing a
ClassNotFoundException. I am attaching a screenshot which shows both
my Referenced Libraries dropdown, containing the .jar files I
downloaded from [https://pdfbox.apache.org/download.html  ], as well
as the error message in question and the portion of my code which
throws the error.

Where can I find the loader class, to properly use it? I apologize for
my barely rudimentary understanding of these systems.

Regards,
Allen Marshall

-
To unsubscribe,e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands,e-mail:users-h...@pdfbox.apache.org




Re: File size issue when we use arial-unicode-ms font with pdfbox

2024-06-03 Thread Tilman Hausherr

There were two answers
https://lists.apache.org/thread/pn1hvwvlszz9h74gy9vv7nx7gt30qlbt

Tilman

On 03.06.2024 14:23, Anil Basavaraju wrote:

Hi Pdfbox team,

Can you please check our request below.

Thank You,
Anil B

From: Anil Basavaraju
Sent: Thursday, May 30, 2024 9:06 PM
To: users@pdfbox.apache.org
Cc: Sudarshan Desai 
Subject: File size issue when we use arial-unicode-ms font with pdfbox

Hi Pdfbox team,

We have a requirement to prefill the pdf with different languages including CJK 
(Chinese, Japanese, Korean) and some special characters.
So we embedded arial-unicode-ms.ttf with pdfbox. The ttf file is about 22.7 MB. 
We are able to prefill the pdf form with the languages and special characters.
But the issue is the size of the pdf file becomes around 27 MB. Before 
prefilling the pdf file size is around 2 MB.
We need to upload this pdf file after prefilling, which is taking more time 
which is not a good user experience.
Is there any way where we can reduce the pdf file size using pdfbox?

Note: If we use any other font/ttf file which is smaller in size, it is not 
supporting CJK.

Thanks,
Anil




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: PDFBox Loader Issue

2024-06-02 Thread Tilman Hausherr

Hi,

It's in the pdfbox jar. Your screenshot was removed because most 
attachments are removed from the mailing list. You need to have these 
inline (thunderbird does this, not sure about others) or upload them to 
a sharehoster, e.g. https://imgur.com/ for images.


Tilman

On 02.06.2024 16:24, AllenM10 wrote:

Good day,

I have been having trouble attempting to load in text from a PDF using 
the 3.0.2 version of PDFBox. It seems my profile cannot detect 
"org.apache.pdfbox.Loader" or I have neglected to install it (in which 
case, I wonder where one might find it?) and is throwing a 
ClassNotFoundException. I am attaching a screenshot which shows both 
my Referenced Libraries dropdown, containing the .jar files I 
downloaded from [ https://pdfbox.apache.org/download.html ], as well 
as the error message in question and the portion of my code which 
throws the error.


Where can I find the loader class, to properly use it? I apologize for 
my barely rudimentary understanding of these systems.


Regards,
Allen Marshall

-
To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:users-h...@pdfbox.apache.org




Re: File size issue when we use arial-unicode-ms font with pdfbox

2024-05-30 Thread Tilman Hausherr

Hi,

It's not possible and yes, it's a weakness. I think I had some wild hack 
years ago that needed a change in PDFBox itself but I can't find it anymore.


What might work if you can build from source: expand PDTextField so you 
can pass a font (which you create with the subset flag on); pass this 
font to AppearanceGeneratorHelper; there change the line "PDFont font = 
defaultAppearance.getFont();"; when done with the font, call font.subset().


Tilman

On 30.05.2024 17:36, Anil Basavaraju wrote:

Hi Pdfbox team,

We have a requirement to prefill the pdf with different languages including CJK 
(Chinese, Japanese, Korean) and some special characters.
So we embedded arial-unicode-ms.ttf with pdfbox. The ttf file is about 22.7 MB. 
We are able to prefill the pdf form with the languages and special characters.
But the issue is the size of the pdf file becomes around 27 MB. Before 
prefilling the pdf file size is around 2 MB.
We need to upload this pdf file after prefilling, which is taking more time 
which is not a good user experience.
Is there any way where we can reduce the pdf file size using pdfbox?

Note: If we use any other font/ttf file which is smaller in size, it is not 
supporting CJK.

Thanks,
Anil




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: issue migrating to PDFBox 3.0.2, 'Charsets'

2024-05-29 Thread Tilman Hausherr

On 29.05.2024 13:29, Rich Stafford wrote:

case PATTERN_CROSS_DIAGONAL:
 oOS.write("0 0 m 10 10 l 10 0 m 0 10 l s".getBytes(Charsets.US_ASCII));
   break;


What is the equivalent expression for PDFBox 3.0?



java.​nio.​charset.​StandardCharsets.US_ASCII

Tilman


Re: Radio Button not set correctly

2024-05-28 Thread Tilman Hausherr

On 28.05.2024 17:27, Martin Resch wrote:

Hi Tilman,

thanks a lot for the analysis!

So I am assuming correct that you will raise a bug ticket?


https://issues.apache.org/jira/browse/PDFBOX-5831

I'll wait a day or two because I'm not the acroform guy, and then I'll 
fix it myself.



I am not the creator of this PDF. We have to deal with the official PDFs 
provided by the government institution that you mentioned. More PDFs can be 
found here:https://www.bundesfreiwilligendienst.de/service/downloads
Hope that they haven’t more that fishy PDFs published.


I changed the PDF so that the /Opt entry has "A" and "B" instead of "1" and "2" 
and then it works.

For my knowledge in the meantime for a potential workaround: how did you do 
that?


I edited the PDF with NOTEPAD++. But it should also be able this way:

field.setExportValues(List.of("A", "B"));

the field must be cast to a PDRadioButton. However I noticed that after 
doing this change, I was no longer able to edit it with Adobe Reader.


Another problem is that I get a sort of error message:

This might be because of UR3 usage rights signature.

Tilman






Best regards
Martin



Am 28.05.2024 um 05:24 schrieb Tilman Hausherr:

On 27.05.2024 21:01, Tilman Hausherr wrote:

I'll have another look tomorrow when I'm more awake. I just looked and it happens like you wrote. I 
traced through the code and it seemed to work properly, i.e. going through different paths for 
"1" and "2" (looking for dictionary elements 0 and 1) but the result was always 
the same which contradicts the observation, but that is the fascination in debugging.

I had another look. The values from the /Opt entry are 1 and 2, the values at the 
dictionary level are 0 and 1. Our software somehow gets confused when 1 is used because 
it appears in both: when the value "1" is set, then PDButton.updateByOption() 
is called twice (!!!), once with value 1 and once with 2.

I changed the PDF so that the /Opt entry has "A" and "B" instead of "1" and "2" 
and then it works.

So I'd say it's a PDFBox bug. The good thing is that the copyright of that PDF 
would be with a government institution (BAFzA) so we can use it as a test.

Are you the creator of that PDF?

Tilman


-
To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:users-h...@pdfbox.apache.org



-
To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:users-h...@pdfbox.apache.org



Re: Radio Button not set correctly

2024-05-27 Thread Tilman Hausherr

On 27.05.2024 21:01, Tilman Hausherr wrote:
I'll have another look tomorrow when I'm more awake. I just looked and 
it happens like you wrote. I traced through the code and it seemed to 
work properly, i.e. going through different paths for "1" and "2" 
(looking for dictionary elements 0 and 1) but the result was always 
the same which contradicts the observation, but that is the 
fascination in debugging. 


I had another look. The values from the /Opt entry are 1 and 2, the 
values at the dictionary level are 0 and 1. Our software somehow gets 
confused when 1 is used because it appears in both: when the value "1" 
is set, then PDButton.updateByOption() is called twice (!!!), once with 
value 1 and once with 2.


I changed the PDF so that the /Opt entry has "A" and "B" instead of "1" 
and "2" and then it works.


So I'd say it's a PDFBox bug. The good thing is that the copyright of 
that PDF would be with a government institution (BAFzA) so we can use it 
as a test.


Are you the creator of that PDF?

Tilman


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Radio Button not set correctly

2024-05-27 Thread Tilman Hausherr
I'll have another look tomorrow when I'm more awake. I just looked and 
it happens like you wrote. I traced through the code and it seemed to 
work properly, i.e. going through different paths for "1" and "2" 
(looking for dictionary elements 0 and 1) but the result was always the 
same which contradicts the observation, but that is the fascination in 
debugging.


Tilman

On 27.05.2024 15:39, Martin Resch wrote:

Hi,

had someone the chance to look into my issue in the meantime?
Appreciate your support and thanks in advance!

Best regards
Martin


Am 06.05.24 um 16:00 schrieb Martin Resch:

Sorry, please find the PDF here: https://my.hidrive.com/lnk/WbVpc1Bde

I am using PDFBox 3.0.2.

Best Regards
Martin

Am 06.05.24 um 15:55 schrieb sahy...@fileaffairs.de:

Dear Martin,

Am Montag, dem 06.05.2024 um 15:53 +0200 schrieb Martin Resch:

sorry, PDF attached


could you upload the PDF to a public location as the mailing list
doesn't support attachments.

which version of PDFBox are you using?

BR
Maruan



Martin Resch  hat am 06.05.2024 15:49 CEST
geschrieben:


Hi,

I am reading a PDF, want to set one of the radio buttons included
in the PDF and save the PDF in a new file.

My PDF has a group of two radio buttons included. The field name is
Formular1[0].Seite1[0].TF_P[0].Optionsfeldliste[0]
Valid values are:[1, 2] and Off


This is my code:
String filename = "AU_Erklaerung_final.pdf";
PDDocument pdfDocument = Loader.loadPDF(new File(filename));
PDAcroForm acroForm =
pdfDocument.getDocumentCatalog().getAcroForm();
acroForm.getField(“Formular1[0].Seite1[0].TF_P[0].Optionsfeldliste[
0]”).setValue(“1”);
//
acroForm.getField("Formular1[0].Seite1[0].TF_P[0].Optionsfeldliste[
0]").setValue("2");// ((PDRadioButton)
acroForm.getField("Formular1[0].Seite1[0].TF_P[0].Optionsfeldliste[
0]")).setValue(0);// ((PDRadioButton)
acroForm.getField("Formular1[0].Seite1[0].TF_P[0].Optionsfeldliste[
0]")).setValue(1); pdfDocument.save(filename +
System.currentTimeMillis() + “.pdf”);


I have tried:
pdfDocument.getDocumentCatalog().getAcroForm().getField(“Formular1[
0].Seite1[0].TF_P[0].Optionsfeldliste[0]”).setValue(“1”);
pdfDocument.getDocumentCatalog().getAcroForm().getField(“Formular1[
0].Seite1[0].TF_P[0].Optionsfeldliste[0]”).setValue(“2”);
// and
((PDRadioButton)pdfDocument.getDocumentCatalog().getAcroForm().getF
ield(“Formular1[0].Seite1[0].TF_P[0].Optionsfeldliste[0]”)).setValu
e(0);
((PDRadioButton)pdfDocument.getDocumentCatalog().getAcroForm().getF
ield(“Formular1[0].Seite1[0].TF_P[0].Optionsfeldliste[0]”)).setValu
e(1);

Regardless which value I set (either via string or int), it is
always the same radio buttion select in my PDF (the lower one).


Can anybody support? Thanks a lot in advance!

Best regards
Martin

---
--
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org






-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



AW: question about pdfbox-app-3.0.2 and PDFToImage

2024-05-27 Thread Tilman Hausherr
It's only "render," not PDFToImage .

https://pdfbox.apache.org/3.0/commandline.html

Tilman 

Gesendet mit der Telekom Mail App

-- Original-Nachricht --
Von: Guillaume Favier 
Betreff: question about pdfbox-app-3.0.2 and PDFToImage
Datum: 27.05.2024, 14:36 Uhr
An: users@pdfbox.apache.org

Hello,

I find the tool great, but I have issues figuring the correct syntax
to PDFToImage option from command line.
I have trying the following the java -jar .\pdfbox-app-3.0.2.jar PDFToImage
render -color=GRAY -i=""  -startPage=1 -prefix=txt
-quality=1 -cropbox 100 100 100 100
Did you mean: pdfbox fromimage or pdfbox export:images or pdfbox import:fdf?
And I keep the following message "*Did you mean: pdfbox fromimage or pdfbox
export:images or pdfbox import:fdf?*"
I am obviously missing something but googling and has not help me so far.
Any help would be very much appreciated.

G


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: PDFBox bug report: PDDocument.load(inputFile) crashes when parsing malformed ItalicAngle

2024-05-25 Thread Tilman Hausherr

Hi,

Why do you think this is a bug, and what would you expect instead? 
Please upload your file to a sharehoster, attachments are blocked.


Tilman

On 25.05.2024 19:47, Lucky Python wrote:

I'd like to report a bug.

[Description]
PDDocument.load(inputFile) crashes when parsing malformed ItalicAngle. 
It is 100% reproducible with the attached PDF file as the 
inputFile parameter.


[PDFBox versions]
Reproduced with both PDFBox 2.0.27 & 2.0.31

[Stack trace]
Exception in thread "main" ..
Caused by: java.io.IOException: Error expected floating point number 
actual='-12.-1'




Re: AW: Question about commit "PDFBOX-5660: add warning / exception, as suggested by mkl in SO 78307200"

2024-05-18 Thread Tilman Hausherr

Hi,

It's fixed now.
Thanks again

Tilman

On 17.05.2024 11:43, pascal.schumac...@t-systems.com wrote:

Yes. The commit was made on Apr 19, that should be after the latest releases. I 
only discovered it after building from trunk.

Thank you very much for the quick response!

Kind regards,
Pascal

-Ursprüngliche Nachricht-
Von: Tilman Hausherr 
Gesendet: Freitag, 17. Mai 2024 11:37
An: users@pdfbox.apache.org
Betreff: Re: Question about commit "PDFBOX-5660: add warning / exception, as 
suggested by mkl in SO 78307200"

I've created https://issues.apache.org/jira/browse/PDFBOX-5822 . This isn't in 
the released versions, correct? Then we have to thank you even more for 
discovering this.

Tilman


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Question about commit "PDFBOX-5660: add warning / exception, as suggested by mkl in SO 78307200"

2024-05-17 Thread Tilman Hausherr
I've created https://issues.apache.org/jira/browse/PDFBOX-5822 . This 
isn't in the released versions, correct? Then we have to thank you even 
more for discovering this.


Tilman


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Question about commit "PDFBOX-5660: add warning / exception, as suggested by mkl in SO 78307200"

2024-05-17 Thread Tilman Hausherr

Hi,

Thanks for finding this. Sadly there were no tests. I'll investigate.


Tilman

On 17.05.2024 10:25, pascal.schumac...@t-systems.com wrote:

Hi,

concerning commit: "PDFBOX-5660: add warning / exception, as suggested by mkl in SO 
78307200" 
(https://github.com/apache/pdfbox/commit/5c0abf94367c12c9ac0b464046784d456ce4caf5)

After this commit this code:

for (int pageNumber = 0; pageNumber < pdDocument.getNumberOfPages(); 
pageNumber++) {
 PDPage pdPage = pdDocument.getPage(pageNumber);
 ...
 String textForRegion = extractText(pdPage, rect);}

private static String extractText(PDPage pdPage, Rectangle2D rect) throws 
IOException {
 String regionName = "rectangle";
 PDFTextStripperByArea textStripper = new PDFTextStripperByArea();
 textStripper.setSortByPosition(true);
 textStripper.addRegion(regionName, rect);
 textStripper.extractRegions(pdPage);
 return textStripper.getTextForRegion(regionName);
}

Which worked with PDF Box 3 and trunk before this change now fails with:

java.lang.IllegalArgumentException: Parameter must be 1-based, but is 0
at 
org.apache.pdfbox.text.PDFTextStripper.setStartPage(PDFTextStripper.java:956)
at 
org.apache.pdfbox.text.PDFTextStripperByArea.extractRegions(PDFTextStripperByArea.java:117)
...

I believe 0 should still be allowed, or am I missing something?

Thanks and kind regards,
Pascal

By the way: Thank you very much for providing PDFBox.

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Error Log related question

2024-05-15 Thread Tilman Hausherr

Hi,

Isn't it possible to do this with commons-logging or with the actual 
logging that you're using?


If you tell what error messages you get we may be able to tell you what 
the problem is.


Tilman

On 15.05.2024 15:28, Tony Pilote wrote:

Hello,

We have been using PDFBox and would have question regarding error logging. We 
have cases where the conversion is successful but still see error logs that 
indicate to us that the generate file may have missing parts.

Have you or would you consider (if we were to push a PR) to provide these 
errors as a “summary” or to let us register an error handler so we can track 
and report rather then only be able to observe in the logs?

Thank you in advance for your time,

Tony




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Question Regarding Removal of PositionWrapper and TextNormalize Classes

2024-05-14 Thread Tilman Hausherr

Hi,

I don't know. You should find out what this does in your code, create a 
test, and find out if / why the test fails when using the 2.0 stripper, 
and whether copying the source code of that class and use it in your own 
postprocessing helps.


Tilman

On 14.05.2024 10:21, Krishna Shankar wrote:

Hi ,
Do we have any alternative for TextNormalize.
As we are using it in our code for a long and we are upgrading to  PdfBox
2.0.30.

Thanks,
Krishna

On Tue, 14 May 2024 at 12:42 PM, Tilman Hausherr 
wrote:


Hi,

PositionWrapper is still in the code. However it's private. This and the
removal of TextNormalize was done 10 years ago in
https://issues.apache.org/jira/browse/PDFBOX-2384 before 2.0.0, so you're
a bit late.

Nobody can tell you what to do because we don't know why you needed it.

Tilman

On 14.05.2024 07:24, Krishna Shankar wrote:

Dear Apache PDFBox Users,

I hope this email finds you well.

I have been using Apache PDFBox for quite long, and I recently noticed

that

the PositionWrapper and TextNormalize classes were removed after version
2.0.0.

I am curious about the rationale behind this decision and would

appreciate

any insights or explanations regarding the removal of these classes.
Understanding the reasoning behind such changes would greatly help me in
adapting my code and workflows accordingly.

Please mention if there are any alternatives for that.

Thank you very much for your time and assistance. I look forward to

hearing

from you and continuing to be a part of the Apache PDFBox community.

Best regards, Krishna



-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org





-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Question Regarding Removal of PositionWrapper and TextNormalize Classes

2024-05-14 Thread Tilman Hausherr

Hi,

PositionWrapper is still in the code. However it's private. This and the 
removal of TextNormalize was done 10 years ago in
https://issues.apache.org/jira/browse/PDFBOX-2384 before 2.0.0, so you're a bit 
late.

Nobody can tell you what to do because we don't know why you needed it.

Tilman

On 14.05.2024 07:24, Krishna Shankar wrote:

Dear Apache PDFBox Users,

I hope this email finds you well.

I have been using Apache PDFBox for quite long, and I recently noticed that
the PositionWrapper and TextNormalize classes were removed after version
2.0.0.

I am curious about the rationale behind this decision and would appreciate
any insights or explanations regarding the removal of these classes.
Understanding the reasoning behind such changes would greatly help me in
adapting my code and workflows accordingly.

Please mention if there are any alternatives for that.

Thank you very much for your time and assistance. I look forward to hearing
from you and continuing to be a part of the Apache PDFBox community.

Best regards, Krishna




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: IllegalStateException are thrown by surrogate pair character.

2024-05-07 Thread Tilman Hausherr

Hello Toshiaki,

Thank you, I've committed that code as well.

Tilman


On 07.05.2024 16:15, Toshiaki Ito wrote:

Hi,

Additional suggestions.


throw new IllegalStateException(
"could not find the glyphId for the character: " + codePoint);

This part, before the fix, was outputting the character that caused the error.
After the fix, however, the code point value was output, making it
difficult to understand the cause.
Therefore, we made a change to get the actual character from the code
point and output it.

I also created a test (assumed to be added to TestFontEmbedding.java).
LiberationSans-Regular.ttf does not contain Japanese characters, and
we are checking for exceptions and output of expected messages.


"あ" -> Character.isBmpCodePoint() == true
"鸽" -> Character.isValidCodePoint() == true


 update code  PDAbstractContentStream.java  applyGSUBRules 

 int glyphId = cmapLookup.getGlyphId(codePoint);
 if (glyphId <= 0)
 {
 String source;
 if (Character.isBmpCodePoint(codePoint))
 {
source = String.valueOf((char) codePoint);
 }
 else if (Character.isValidCodePoint(codePoint))
 {
source = new String(new int[]{codePoint},0,1);
 }
 else
 {
 source = "?";
 }
 throw new IllegalStateException(
 "could not find the glyphId for the character:
" + source);
 }
 originalGlyphIds.add(glyphId);


 Unit Test 

 @Test
 void testSurrogatePairCharacterExceptionIsBmpCodePoint() throws IOException
 {
 final String message = "あ";

 try (PDDocument doc = new PDDocument())
 {
 PDPage page = new PDPage();
 doc.addPage(page);
 PDFont font = PDType0Font.load(doc,
this.getClass().getResourceAsStream("/org/apache/pdfbox/resources/ttf/LiberationSans-Regular.ttf"));

 try (PDPageContentStream contents = new
PDPageContentStream(doc, page))
 {
 contents.beginText();
 contents.setFont(font, 64);
 contents.newLineAtOffset(100, 700);
 contents.showText(message);
 contents.endText();
 }

 fail();
 }
 catch (IllegalStateException e)
 {
 assertEquals("could not find the glyphId for the
character: あ", e.getMessage());
 }
 catch (Exception e)
 {
 fail();
 }
 }

 @Test
 void testSurrogatePairCharacterExceptionIsValidCodePoint() throws
IOException
 {
 final String message = "鸽";
 try (PDDocument doc = new PDDocument())
 {
 PDPage page = new PDPage();
 doc.addPage(page);
 PDFont font = PDType0Font.load(doc,
this.getClass().getResourceAsStream("/org/apache/pdfbox/resources/ttf/LiberationSans-Regular.ttf"));

 try (PDPageContentStream contents = new
PDPageContentStream(doc, page))
 {
 contents.beginText();
 contents.setFont(font, 64);
 contents.newLineAtOffset(100, 700);
 contents.showText(message);
 contents.endText();
 }

 fail();
 }
 catch (IllegalStateException e)
 {
 assertEquals("could not find the glyphId for the
character: 鸽" ,e.getMessage());
 }
 catch (Exception e)
 {
 fail();
 }
 }

2024年5月5日(日) 18:00 Toshiaki Ito :

Hi, Tilman.

I used the snapshot "3.0.3-20240505.072852-59" and got the expected results!
I also tried a few other Kanji characters besides "鸽" and none of
them had any problems!

I am glad I could contribute :)

2024年5月5日(日) 16:32 Tilman Hausherr :

Hello Toshiaki,

It's been committed and available as a snapshot:
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/3.0.3-SNAPSHOT/

I've also added a test for the 2.0 version to avoid we break this in the
future.

Thanks again
Tilman

On 04.05.2024 22:06, Toshiaki Ito wrote:

Hi, Tilman.

Thank you for checking and correcting the attached code.
I look forward to waiting for it to be committed!

2024年5月5日(日) 2:05 Tilman Hausherr:

Hello,

I can confirm that your proposed change works, it also passes the
"private" tests that aren't in the repository. Thank you so much in
solving this! I'll commit these soon (probably tomorrow) and will report
it here. Another (smaller) good news is that one of the fonts we use for
tests (ipafont) has the glyph, I have prepared a small test also based
on your code.

Re: Radio Button not set correctly

2024-05-06 Thread Tilman Hausherr

On 06.05.2024 15:53, Martin Resch wrote:

sorry, PDF attached


You need to upload to a sharehoster

Tilman


Re: IllegalStateException are thrown by surrogate pair character.

2024-05-05 Thread Tilman Hausherr

Hello Toshiaki,

It's been committed and available as a snapshot:
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/3.0.3-SNAPSHOT/

I've also added a test for the 2.0 version to avoid we break this in the 
future.


Thanks again
Tilman

On 04.05.2024 22:06, Toshiaki Ito wrote:

Hi, Tilman.

Thank you for checking and correcting the attached code.
I look forward to waiting for it to be committed!

2024年5月5日(日) 2:05 Tilman Hausherr:

Hello,

I can confirm that your proposed change works, it also passes the
"private" tests that aren't in the repository. Thank you so much in
solving this! I'll commit these soon (probably tomorrow) and will report
it here. Another (smaller) good news is that one of the fonts we use for
tests (ipafont) has the glyph, I have prepared a small test also based
on your code.

Tilman

On 04.05.2024 16:39, Tilman Hausherr wrote:

On 04.05.2024 15:21, Toshiaki Ito wrote:

By the way, with pdbox 2.0.31, the same code produces the expected
output.

Ouch, I can confirm that. I have created a new ticket:

https://issues.apache.org/jira/browse/PDFBOX-5812

Tilman



-
To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:users-h...@pdfbox.apache.org





Re: IllegalStateException are thrown by surrogate pair character.

2024-05-04 Thread Tilman Hausherr

Hello,

I can confirm that your proposed change works, it also passes the 
"private" tests that aren't in the repository. Thank you so much in 
solving this! I'll commit these soon (probably tomorrow) and will report 
it here. Another (smaller) good news is that one of the fonts we use for 
tests (ipafont) has the glyph, I have prepared a small test also based 
on your code.


Tilman

On 04.05.2024 16:39, Tilman Hausherr wrote:

On 04.05.2024 15:21, Toshiaki Ito wrote:
By the way, with pdbox 2.0.31, the same code produces the expected 
output.


Ouch, I can confirm that. I have created a new ticket:

https://issues.apache.org/jira/browse/PDFBOX-5812

Tilman




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: IllegalStateException are thrown by surrogate pair character.

2024-05-04 Thread Tilman Hausherr

On 04.05.2024 15:21, Toshiaki Ito wrote:

By the way, with pdbox 2.0.31, the same code produces the expected output.


Ouch, I can confirm that. I have created a new ticket:

https://issues.apache.org/jira/browse/PDFBOX-5812

Tilman


Re: IllegalStateException are thrown by surrogate pair character.

2024-05-04 Thread Tilman Hausherr

Hi,

Is it this one? 鸽

According to my understanding of 
https://www.compart.com/en/unicode/U+29E3D you should use \u29E3D  or 鸽 
directly. However I tried this with your font and with MingLiU and MS 
Mincho and it didn't work either. Is this a very standard glyph? Or 
something unusual? So I don't know if this is a bug on our side, missing 
feature or a different problem.


Tilman

On 04.05.2024 07:21, 伊東寿晃 wrote:

Hi,

In pdfbox 3.0, an IllegalStateException occurs when trying to output
surrogate pair characters.
According to the exception, it seems that one Kanji character is
processed as two chars.

Is this a bug?
Is there any possible workaround on the program side?


 Conditions 
JDK: 21
PDFBox: 3.0.0 / 3.0.1 / 3.0.2
Font: Noto Sans Japanese (https://fonts.google.com/noto/specimen/Noto+Sans+JP)
Font and glyph preview :
https://fonts.google.com/noto/specimen/Noto+Sans+JP?preview.text=%F0%A9%B8%BD

 Test code 
   public static void main(String[] args) throws IOException {

 final String fontPath = "NotoSansJP-Regular.ttf";
 final String out = "output.pdf";

 // Atka Mackerel in Japanese kanji. (surrogate pair)
 final String message = "\uD867\uDE3D";

 try (PDDocument doc = new PDDocument()) {
   PDPage page = new PDPage();
   doc.addPage(page);
   PDFont font = PDType0Font.load(doc, new File(fontPath));

   try (PDPageContentStream contents = new PDPageContentStream(doc, page)) {
 contents.beginText();
 contents.setFont(font, 64);
 contents.newLineAtOffset(100, 700);
 contents.showText(message);
 contents.endText();
   }

   doc.save(out);
   System.out.println(out + " created!");
 }
   }


 StackTrace 
Exception in thread "main" java.lang.IllegalStateException: could not
find the glyphId for the character: ?
 at 
org.apache.pdfbox.pdmodel.PDAbstractContentStream.applyGSUBRules(PDAbstractContentStream.java:1651)
 at 
org.apache.pdfbox.pdmodel.PDAbstractContentStream.encodeForGsub(PDAbstractContentStream.java:1632)
 at 
org.apache.pdfbox.pdmodel.PDAbstractContentStream.showTextInternal(PDAbstractContentStream.java:302)
 at 
org.apache.pdfbox.pdmodel.PDAbstractContentStream.showText(PDAbstractContentStream.java:266)
 at 
org.apache.pdfbox.pdmodel.PDPageContentStream.showText(PDPageContentStream.java:37)
 at org.example.App.main(App.java:30)



My English isn't so good so feel free to ask me if there is anything unclear.

--
Toshiaki Ito
Mail:evolut...@1024kb.cx

-
To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:users-h...@pdfbox.apache.org



Re: possible regression in PDFBox 3.0.2

2024-05-03 Thread Tilman Hausherr

On 03.05.2024 10:40, Kai Keggenhoff wrote:


Since we switched to 3.0.2 (from 3.0.0, we skipped 3.0.1) we 
encountered several PDFs which produce an IOException when saved :



Please try with a 3.0.3 snapshot

https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/3.0.3-SNAPSHOT/


Tilman


Re: Problem finding an AcroForm field

2024-05-02 Thread Tilman Hausherr

It's a radio button but without the radio flag?!

Tilman

On 02.05.2024 12:42, Ulf Dittmer wrote:

Yes, that's the one for the "pro Stunde" option. But the one for the "pro
Monat" option is missing.

They're both connected, in that checking one manually will uncheck the
other. But setting *any* value programmatically only causes the first one
to be set.

Ulf

On Thu, May 2, 2024 at 12:38 PM sahy...@fileaffairs.de <
sahy...@fileaffairs.de> wrote:


That's chbx_46_Arbeitsentgelt

BR
Maruan

Am Donnerstag, dem 02.05.2024 um 12:15 +0200 schrieb Ulf Dittmer:

Sorry, I didn't realize that. It's a form from a German government
agency,
and can be found at



https://www.arbeitsagentur.de/datei/erklaerung-zum-beschaeftigungsverhaeltnis_ba047549.pdf

Ulf

On Thu, May 2, 2024 at 12:05 PM sahy...@fileaffairs.de <
sahy...@fileaffairs.de> wrote:


Hi,

can you upload the PDF in question to a public location to take a
view.
Attachments won't work for the mailing list.

BR
Maruan

Am Donnerstag, dem 02.05.2024 um 12:01 +0200 schrieb Ulf Dittmer:

Hi-

I'm running the PrintFields example code
(
https://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/ja
va
/org/apache/pdfbox/examples/interactive/form/PrintFields.java) to
find all the form field names for a PDF, but it's missing a
checkbox
that I'd need to set.

The checkbox in question is on page 5, no. 46 "pro Monat". The
"pro
Stunde" checkbox is there, as are the two text fields.

The relevant output of PrintFields is

txtf_46_Entgelt_pro_Stunde

--txtf_46_Entgelt_pro_Stunde.txtf_46_Entgelt_pro_Stunde = ,

  type=org.apache.pdfbox.pdmodel.interactive.form.PDTextField
  alternate name=46 - Höhe und Berechnungsart des Arbeitsentgelts
-
Entgelt pro Stunde (brutto in Euro), mapping name=null
flags=8388608, isNoExport=false, isReadOnly=false,
isRequired=false
chbx_46_Arbeitsentgelt

--chbx_46_Arbeitsentgelt.chbx_46_Arbeitsentgelt = Off,

  type=org.apache.pdfbox.pdmodel.interactive.form.PDCheckBox
  alternate name=46 - Höhe und Berechnungsart des Arbeitsentgelts,
mapping name=null
flags=0, isNoExport=false, isReadOnly=false, isRequired=false
txtf_46_Entgelt_pro_Monat

--txtf_46_Entgelt_pro_Monat.txtf_46_Entgelt_pro_Monat = ,

  type=org.apache.pdfbox.pdmodel.interactive.form.PDTextField
  alternate name=46 - Höhe und Berechnungsart des Arbeitsentgelts
-
Entgelt pro Monat (brutto in Euro), mapping name=null
flags=8388608, isNoExport=false, isReadOnly=false,
isRequired=false

Is the PDF broken in some way, or am I missing something?

Let me know if I can supply any further information. I'd be
thankful
for any additional information.

Ulf

-

To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org


---
--
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org





-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: How to remove an image resource from a PDF form

2024-04-26 Thread Tilman Hausherr
Re "bigger each time it's opened and then saved", hard to tell without 
the PDF. Maybe the font you're replacing is used elsewhere due to other 
code that used this font.


If you can't share the PDF, look at it with PDFDebugger or with 
NOTEPAD++. In PDFDebugger don't forget to switch between the page view 
and the internal structure view (in the view menu)


Tilman

On 26.04.2024 15:50, Jurgen Doll wrote:

Hi Tilman

My bad, I was working on the image code and just assumed that it was 
the culprit.
It turns out that I must be embedding fonts incorrectly. I use this 
code each time I open the PDF:


    var resources = Objects.requireNonNullElseGet( 
pdForm.getDefaultResources(), PDResources::new );
    var lucidaFont = PDType0Font.load( pdfDoc, 
LUCIDA_UNICODE_FILE.get() );

    var cosName = COSName.getPDFName( lucidaFont.getName() );

    resources.put( cosName, lucidaFont );
    pdForm.setDefaultResources( resources );

The above seems to make the PDF get bigger each time it's opened and 
then saved.

So I've added a check first, like this:

    if ( resources.getFont( cosName ) == null )
    {
    resources.put( cosName, lucidaFont );
    pdForm.setDefaultResources( resources );
    }

And now the PDF doesn't seem to be growing any more.

Thanks,
Jurgen



On Fri, 26 Apr 2024 14:58:26 +0200, Tilman Hausherr 
 wrote:



Do you save directly or incrementally?

If directly then the old one should be gone. If not, please share the 
PDF (upload to sharehoster) and tell us why you think it's still there.


Tilman

On 26.04.2024 12:37, Jurgen Doll wrote:

Hi

I would like to know how to remove an image resource from a PDF form.

I use the following code to set an image object on a field:

    private static void setAppearance( PDDocument pdfDoc, PDField 
field, PDImageXObject image )

    {
    var widget = field.getWidgets().get(0);
    var rect = widget.getRectangle();

    var formObj = new PDFormXObject( new PDStream( pdfDoc ) );
    formObj.setResources( new PDResources() );
    formObj.setBBox( new PDRectangle
    (
    // The TransformationMatrix below enlarges the image 
slightly,
    // so adjust accordingly here otherwise the image is 
cropped.

    rect.getLowerLeftX(), rect.getLowerLeftY() - 10,
    rect.getWidth(), rect.getHeight() + 20
    ));
    formObj.setFormType(1);

    var appearanceStream = new PDAppearanceStream( 
formObj.getCOSObject() );
    var pageContentStream = new PDPageContentStream( pdfDoc, 
appearanceStream );
    pageContentStream.drawImage( image, getTransformationMatrix( 
image, widget ) );

    pageContentStream.close();

    var appearance = new PDAppearanceDictionary();
    appearance.setNormalAppearance( appearanceStream );
    appearance.getCOSObject().setDirect( true );
    widget.setAppearance( appearance );
    }

To remove the image I do "widget.setAppearance( null )" or to 
replace the image I use the above method to set the new image as the 
appearance. However I've noticed that this doesn't remove the 
previous image object/resource from the PDF. How do I go about 
removing it ?


Thanks, regards
Jurgen

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org







-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: How to remove an image resource from a PDF form

2024-04-26 Thread Tilman Hausherr

Do you save directly or incrementally?

If directly then the old one should be gone. If not, please share the 
PDF (upload to sharehoster) and tell us why you think it's still there.


Tilman

On 26.04.2024 12:37, Jurgen Doll wrote:

Hi

I would like to know how to remove an image resource from a PDF form.

I use the following code to set an image object on a field:

    private static void setAppearance( PDDocument pdfDoc, PDField 
field, PDImageXObject image )

    {
    var widget = field.getWidgets().get(0);
    var rect = widget.getRectangle();

    var formObj = new PDFormXObject( new PDStream( pdfDoc ) );
    formObj.setResources( new PDResources() );
    formObj.setBBox( new PDRectangle
    (
    // The TransformationMatrix below enlarges the image 
slightly,

    // so adjust accordingly here otherwise the image is cropped.
    rect.getLowerLeftX(), rect.getLowerLeftY() - 10,
    rect.getWidth(), rect.getHeight() + 20
    ));
    formObj.setFormType(1);

    var appearanceStream = new PDAppearanceStream( 
formObj.getCOSObject() );
    var pageContentStream = new PDPageContentStream( pdfDoc, 
appearanceStream );
    pageContentStream.drawImage( image, getTransformationMatrix( 
image, widget ) );

    pageContentStream.close();

    var appearance = new PDAppearanceDictionary();
    appearance.setNormalAppearance( appearanceStream );
    appearance.getCOSObject().setDirect( true );
    widget.setAppearance( appearance );
    }

To remove the image I do "widget.setAppearance( null )" or to replace 
the image I use the above method to set the new image as the 
appearance. However I've noticed that this doesn't remove the previous 
image object/resource from the PDF. How do I go about removing it ?


Thanks, regards
Jurgen

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Performance advice

2024-04-06 Thread Tilman Hausherr
Is the image already compressed, e.g. PNG, JPEG and b/w TIFF? Then use 
the image directly because PDFBox can use these formats without doing a 
compression, if you use the static methods from PDImageXObject.


Or is the image in memory, or from a different format (e.g. color CCITT, 
GIF)? Then you'd save the compression time by creating a PDF that has 
the image in compressed form.


Tilman

On 19.03.2024 15:46, Nicola Farina wrote:

Hi

I am using PDFBOX 2.0.30.
I need to build a sort of "filled" pdf starting from a template.
At the moment I've chosen to start with a "background" PDF and then
use PDFBOX to write on it
(see the attached examples).
The empty template pdf is basically a background image imported into a PDF.
I now wonder if it could be more efficient to start creating a new,
empty, PDF and then importing the image into it, and then write text
above all.
This application should be as fast as possible.

thanks for any tips/ideas
Nicola


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Blank page generation issue - Unknown code in Huffman RLE stream

2024-04-04 Thread Tilman Hausherr
2.0.19 is 4 years old, why are you using it? Please retry with 2.0.31. I 
tried and your file works, despite that it is broken.


Tilman

On 04.04.2024 06:13, Himanshu Jain wrote:

Hello Team,

We are using pdf-box to generate images of each page of the pdf.
While generating images we are getting warning 
"org.apache.pdfbox.contentstream.PDFStreamEngine:  Unknown code in 
Huffman RLE stream" for some of the pages and blank image is getting 
generated.
Could you please help us, find attached pdf for reproducing the issue. 
we are using following dependencies of pdf-box.



    org.apache.pdfbox
    pdfbox
    2.0.19



    org.apache.pdfbox
    pdfbox-tools
    2.0.19



    org.apache.pdfbox
    jbig2-imageio
    3.0.3


Thanks,
Himanshu

SLB-Private


-
To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:users-h...@pdfbox.apache.org




Re: Text extraction from a certain PDF does not seem to terminate

2024-04-03 Thread Tilman Hausherr
The document has been extracted while I had dinner, so there is no 
endless loop. I've created https://issues.apache.org/jira/browse/PDFBOX-5799


Tilman

On 03.04.2024 18:12, Tilman Hausherr wrote:

Rendering page 230 with PDFBox 2.0: 50 seconds

Rendering page 230 with PDFBox trunk: 2990 seconds

Rendering page 231 with PDFBox trunk: 4798 seconds

while I write this, page 230 has been extracted, it is now working on 
page 231


Tilman


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Text extraction from a certain PDF does not seem to terminate

2024-04-03 Thread Tilman Hausherr

Rendering page 230 with PDFBox 2.0: 50 seconds

Rendering page 230 with PDFBox trunk: 2990 seconds

Rendering page 231 with PDFBox trunk: 4798 seconds

while I write this, page 230 has been extracted, it is now working on 
page 231


Tilman


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Not able to to watermark on PDFs with PDF version 1.7

2024-04-03 Thread Tilman Hausherr

Hi,

Please use the 5-parameter constructor of PDPageContentStream.
If it still doesn't work, please share the file and the result (upload 
to sharehoster).


There is no version 2.3.1, maybe you meant 2.0.31?

Tilman

On 03.04.2024 09:26, Palaniappan RM wrote:

Hi team,
  I am using *pdfbox* version 2.3.1 to watermark PDFs. It works for 
PDF version 1.4 (adding a logo followed by text), but for PDF version 
1.7 no watermark is getting added, the program ends successfully 
without any error/warning message displayed. Can you please help on this?


image.png

Regards,
Palaniappan RM




Re: Type 0 font - Text extraction X PDF Debugger

2024-03-25 Thread Tilman Hausherr

On 25.03.2024 07:48, Andreas Lehmkühler wrote:

Thanks for the URLs. All of them are working with my change.

See https://issues.apache.org/jira/browse/PDFBOX-5790 for further 
details.


@Tilman Please run your tests if possible


No regressions 

Tilman





Andreas

Am 24.03.24 um 16:39 schrieb Tilman Hausherr:

Here they are, remove the XXX

https://corpora.tika.apache.org/XXXbase/docs/govdocs1/433/433525.pdf
https://corpora.tika.apache.org/XXXbase/docs/commoncrawl3/O2/O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP 

https://corpora.tika.apache.org/XXXbase/docs/commoncrawl3/R4/R4EXG25W532JHDQLJAM4HF6O532TLR7D 



The extension p1 / p3 means I split these files and used only one 
page for my own tests.


Tilman


On 24.03.2024 16:19, Andreas Lehmkühler wrote:



Am 15.03.24 um 05:35 schrieb Tilman Hausherr:
You are correct that it's the "fb" parts that are missing. (And 
some of the other tools you tried also mention this)


Just adding true results in text extraction of several files no 
longer being correct, 433525-p1.pdf 
O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP-p3.pdf PDFBOX-5540.pdf 
R4EXG25W532JHDQLJAM4HF6O532TLR7D-p1.pdf
I've found a solution which works with provided pdf and with 
PDFBOX-5540.pdf.


@Tilman I guess the other files are from our test corpus? If so, 
were exactly can I find them?


Andreas



Adding  "&& !cmap.hasCIDMappings()" after "hasUnicodeMappings()" 
brings no regressions but your text is not extracted properly.


Maybe it is possible to include yet another rule for your file, but 
there's likely more to do and there is the risk that other files no 
longer extract properly.


Tilman

On 15.03.2024 00:08, Luiz Marcelo Modesto wrote:
It seems that PDFBOX-5540 resolves a special case based on some 
dictionary

properties and chooses a predefined CMap (Identity CMap).

Reading the PDFont.java code, I think the warning "Invalid 
ToUnicode CMap

in font AvenirNextLTPro-Cn" comes from the fact that the CMap stream
doesn't contain 1 or more blocks of beginbfchar/endbfchar.

The two CMap's HashMaps (charToUnicodeOneByte and 
charToUnicodeTwoBytes)

are really empty.

But the font CMap stream contains this block:

2 begincidrange
<0001> <00FF> 1
<0100>  256
endcidrange

I'm sorry if I misunderstood, but this is a valid CMap too (it 
seems a kind

of Identity mapping too, except for the 0x00...), isn't it?

It's only shorter than the one I could have if I write several 
blocks of

beginbfchar/endbfchar.

If I make this "dumb" modification (adding "true" to conditions) 
just for a

rapid test

if (cmapName.contains("Identity") //
|| ordering.contains("Identity") //
|| COSName.IDENTITY_H.equals(encoding) //
|| COSName.IDENTITY_V.equals(encoding) || true)
{
COSDictionary encodingDict = dict.getCOSDictionary(COSName.ENCODING);
if (true || encodingDict == null || 
!encodingDict.containsKey(COSName.

DIFFERENCES))
{
// assume that if encoding is identity, then the reverse is also true
cmap = CMapManager.getPredefinedCMap(COSName.IDENTITY_H.getName());
LOG.warn("Using predefined identity CMap instead");
}
}

I've got "BCD" string like all the others

The encoding parameter is ignored when writing to the console.
mar 14, 2024 7:30:27 PM org.apache.pdfbox.pdmodel.font.PDFont
loadUnicodeCmap
ADVERTÊNCIA: Invalid ToUnicode CMap in font AvenirNextLTPro-Cn
mar 14, 2024 7:31:00 PM org.apache.pdfbox.pdmodel.font.PDFont
loadUnicodeCmap
ADVERTÊNCIA: Using predefined identity CMap instead
Página 4 de 4
Informações:  BCD

Maybe the extract text tool should been using 
begincidrange/endcidrange

information...

What do you think about?

PS.: I've read some pieces from ISO 32000-2:2020 but it is quite 
long.

Maybe I'm missing something... I'm sorry if this is the case...

Em qui., 14 de mar. de 2024 às 10:30, Luiz Marcelo Modesto <
lmodesto.w...@gmail.com> escreveu:


Ok!

I'll read PDFBOX-5540 and related issues.

Thank you very much!


Em qui, 14 de mar de 2024 10:08, Tilman Hausherr 


escreveu:


Hi,

The problem is in the ToUnicode stream, there's a log message 
"Invalid
ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode 
mappings.
PDFBox is trying a fallback solution which turns out to be 
wrong. This

is related to PDFBOX-5540 and earlier related issues.

Tilman



On 14.03.2024 13:28, Luiz Marcelo Modesto wrote:

Hi Tilman!

  Thank you very much for your attention!

  You can find the file "p4_alt.pdf" in this folder
<
https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing 


.
"Extra infos.pdf" file shows some output from PDF Debugger and 
others.


  I'm sorry, I sent the pdf file as an attachment in my first

message,

but I didn't know that it wouldn't work.



Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr <

thaush...@t-online.de>

escreveu:


Hi,

please upload your file 

Re: Type 0 font - Text extraction X PDF Debugger

2024-03-24 Thread Tilman Hausherr

Here they are, remove the XXX

https://corpora.tika.apache.org/XXXbase/docs/govdocs1/433/433525.pdf
https://corpora.tika.apache.org/XXXbase/docs/commoncrawl3/O2/O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP
https://corpora.tika.apache.org/XXXbase/docs/commoncrawl3/R4/R4EXG25W532JHDQLJAM4HF6O532TLR7D

The extension p1 / p3 means I split these files and used only one page 
for my own tests.


Tilman


On 24.03.2024 16:19, Andreas Lehmkühler wrote:



Am 15.03.24 um 05:35 schrieb Tilman Hausherr:
You are correct that it's the "fb" parts that are missing. (And some 
of the other tools you tried also mention this)


Just adding true results in text extraction of several files no 
longer being correct, 433525-p1.pdf 
O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP-p3.pdf PDFBOX-5540.pdf 
R4EXG25W532JHDQLJAM4HF6O532TLR7D-p1.pdf
I've found a solution which works with provided pdf and with 
PDFBOX-5540.pdf.


@Tilman I guess the other files are from our test corpus? If so, were 
exactly can I find them?


Andreas



Adding  "&& !cmap.hasCIDMappings()" after "hasUnicodeMappings()" 
brings no regressions but your text is not extracted properly.


Maybe it is possible to include yet another rule for your file, but 
there's likely more to do and there is the risk that other files no 
longer extract properly.


Tilman

On 15.03.2024 00:08, Luiz Marcelo Modesto wrote:
It seems that PDFBOX-5540 resolves a special case based on some 
dictionary

properties and chooses a predefined CMap (Identity CMap).

Reading the PDFont.java code, I think the warning "Invalid ToUnicode 
CMap

in font AvenirNextLTPro-Cn" comes from the fact that the CMap stream
doesn't contain 1 or more blocks of beginbfchar/endbfchar.

The two CMap's HashMaps (charToUnicodeOneByte and 
charToUnicodeTwoBytes)

are really empty.

But the font CMap stream contains this block:

2 begincidrange
<0001> <00FF> 1
<0100>  256
endcidrange

I'm sorry if I misunderstood, but this is a valid CMap too (it seems 
a kind

of Identity mapping too, except for the 0x00...), isn't it?

It's only shorter than the one I could have if I write several 
blocks of

beginbfchar/endbfchar.

If I make this "dumb" modification (adding "true" to conditions) 
just for a

rapid test

if (cmapName.contains("Identity") //
|| ordering.contains("Identity") //
|| COSName.IDENTITY_H.equals(encoding) //
|| COSName.IDENTITY_V.equals(encoding) || true)
{
COSDictionary encodingDict = dict.getCOSDictionary(COSName.ENCODING);
if (true || encodingDict == null || !encodingDict.containsKey(COSName.
DIFFERENCES))
{
// assume that if encoding is identity, then the reverse is also true
cmap = CMapManager.getPredefinedCMap(COSName.IDENTITY_H.getName());
LOG.warn("Using predefined identity CMap instead");
}
}

I've got "BCD" string like all the others

The encoding parameter is ignored when writing to the console.
mar 14, 2024 7:30:27 PM org.apache.pdfbox.pdmodel.font.PDFont
loadUnicodeCmap
ADVERTÊNCIA: Invalid ToUnicode CMap in font AvenirNextLTPro-Cn
mar 14, 2024 7:31:00 PM org.apache.pdfbox.pdmodel.font.PDFont
loadUnicodeCmap
ADVERTÊNCIA: Using predefined identity CMap instead
Página 4 de 4
Informações:  BCD

Maybe the extract text tool should been using begincidrange/endcidrange
information...

What do you think about?

PS.: I've read some pieces from ISO 32000-2:2020 but it is quite long.
Maybe I'm missing something... I'm sorry if this is the case...

Em qui., 14 de mar. de 2024 às 10:30, Luiz Marcelo Modesto <
lmodesto.w...@gmail.com> escreveu:


Ok!

I'll read PDFBOX-5540 and related issues.

Thank you very much!


Em qui, 14 de mar de 2024 10:08, Tilman Hausherr 


escreveu:


Hi,

The problem is in the ToUnicode stream, there's a log message 
"Invalid
ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode 
mappings.
PDFBox is trying a fallback solution which turns out to be wrong. 
This

is related to PDFBOX-5540 and earlier related issues.

Tilman



On 14.03.2024 13:28, Luiz Marcelo Modesto wrote:

Hi Tilman!

  Thank you very much for your attention!

  You can find the file "p4_alt.pdf" in this folder
<
https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing 


.
"Extra infos.pdf" file shows some output from PDF Debugger and 
others.


  I'm sorry, I sent the pdf file as an attachment in my first

message,

but I didn't know that it wouldn't work.



Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr <

thaush...@t-online.de>

escreveu:


Hi,

please upload your file to a sharehoster.

Tilman

On 13.03.2024 20:03, Luiz Marcelo Modesto wrote:

Hi everyone,

  I'm not sure if this is the same as FAQ "How come I am 
getting

gibberish(G38G43G36G51G5) when extracting text?"...

  I'm using PDFBox version 3.0.1 and OpenJDK Runtime 
Environment

(build 11.0.22+7-post-Ubuntu-0ubuntu222.

Re: split a password protected file

2024-03-21 Thread Tilman Hausherr

On 21.03.2024 18:59, Robert Rodini wrote:

Does this mean that splitting a password protected PDF effectively disables 
password protection?



On the result files, yes. I've never thought about it. To fix this, we'd 
need the user and the owner password (only one of the two is needed to 
decrypt).


Tilman


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: split a password protected file

2024-03-20 Thread Tilman Hausherr

On 20.03.2024 16:24, Robert Rodini wrote:

Can PDFSplit split up a password-protected file.? It seems that it cannot, but 
there is no error message.
P.S. I am using v. 2.x of PDFBox.  I will upgrade soon.


According to the usage, it should be able to (although it won't encrypt 
when saving):


Usage: java -jar pdfbox-app-x.y.z.jar PDFSplit [options] 

Options:
  -password    : Password to decrypt document
  -split    : split after this many pages (default 1, if 
startPage and endPage are unset)

  -startPage    : start page
  -endPage      : end page
  -outputPrefix  : Filename prefix for split files
      : The PDF document to use

In 3.0:

Usage: pdfbox pdfsplit [-hV] [-password[=]] [-endPage=]
   -i= [-outputPrefix=]
   [-split=] [-startPage=]
  -endPage=   end page.
  -h, --help   Show this help message and exit.
  -i, --input= the PDF file to split
  -outputPrefix=
   the filename prefix for split files.
  -password[=]
   the password to decrypt the document.
  -split=   split after this many pages (default 1, if 
startPage

 and endPage are unset).
  -startPage=
   start page.
  -V, --version    Print version information and exit.


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Flatten using PDFBOX3

2024-03-19 Thread Tilman Hausherr

Hi,

If this happened with 3.0.0 or 3.0.1 please retry with 3.0.2. If not, 
then please find a non confidential file where that happens. Also make 
sure that src and dst are different files.


Tilman

On 19.03.2024 15:26, Frédéric Ravetier wrote:

Hello,

I am trying to Flatten a PDF using PDFBox3 by doing :

private static void flattenPDF(String src, String dst) throws IOException {
 PDDocument doc = Loader.loadPDF(new RandomAccessReadBufferedFile( src ));

 PDDocumentCatalog catalog = doc.getDocumentCatalog();
 PDAcroForm acroForm = catalog.getAcroForm();
 if (acroForm == null){
 logger.debug("This document does not contains any form,
nothing to do...");
 }else {
 acroForm.setNeedAppearances(false);
 acroForm.flatten();// Flatten using pdfbox3
 }
 doc.save(dst);
 doc.close();
}

It works but it creates in some cases a document that is not readable using
PDFBox 2 where I get this error:
java.io.IOException: COSObject{525, 0} cannot be assigned to offset
1528842, this belongs to COSObject{4196, 0}

at
org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:736)
at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:231)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1233)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1133)

With the following code :

System.out.printf("\n*\n* demo.pdf\n*\n");
try (
 InputStream resource =
getClass().getResourceAsStream("/mkl/testarea/pdfbox2/extract/bad-annot-1.pdf")
) {
 //OutputStream result = new FileOutputStream(new
File(RESULT_FOLDER, "bad-pdf-sign.pdf"));
 PDDocument pdDocument = PDDocument.load(resource);
 System.out.printf("Producer of document : %s\n",
pdDocument.getDocumentInformation().getProducer());
 AccessPermission accessPermission = 
pdDocument.getCurrentAccessPermission();
 if (accessPermission.isReadOnly()) {
 System.out.printf("The document cannot be modified (read-only)");
 }

 if (!accessPermission.canModify()) {
 System.out.printf("Cannot modify the document");
 }

 if (!accessPermission.canModifyAnnotations()) {
 System.out.printf("Cannot modify the annotation");
 }

 if (!accessPermission.canFillInForm()) {
 System.out.printf("Cannot fill in form");
 }

}


Do you have any ideas why ?

I can not share the document (confidential) :(

Best regards,
Fred




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Help with NullPointerException org.apache.io.IOUtils.LOG

2024-03-15 Thread Tilman Hausherr

Searching for the error message I found this in a comment:

https://stackoverflow.com/questions/69151291/java-16-modularisation-illegalaccessexception-java-nio-spring-boot

|--add-opens java.base/java.nio=ALL-UNNAMED --add-opens 
java.base/jdk.internal.ref=ALL-UNNAMED|



Tilman

On 15.03.2024 18:48, Matthew Hardy wrote:

Hi Andreas,

I've upgraded to pdfbox 3.0.2, I'm no longer getting the 
ExceptionInilizationError when instantiating an empty PDDocument. However, I'm 
now receiving this error message-

ERROR [org.apache.pdfbox.io.IOUtils] (EE-ManagedExecutorService-default-Thread-1) 
Unmapping is not supported.: java.lang.reflect.InaccessibleObjectException: Unable to 
make public jdk.internal.ref.Cleaner java.nio.DirectByteBuffer.cleaner() accessible: 
module java.base does not "opens java.nio" to unnamed module @18f5234c

The PDDocument still instantiates, and I'm able to use it, but I'm concerned 
about this error message.

Matt Hardy
Software Developer
Perform Air International
463 South Hamilton Court
Gilbert, Arizona 85233
Phone: (480) 610-3500
Fax: (480) 610-3501
matt.ha...@performair.com
www.PerformAir.com

-Original Message-
From: Andreas Lehmkühler  
Sent: Tuesday, March 12, 2024 9:50 AM

To:users@pdfbox.apache.org
Subject: Re: Help with NullPointerException org.apache.io.IOUtils.LOG

Hi Matthew,

this is a known issue with 3.0.1, see [1] for further details.

The upcoming version 3.0.2 includes a fix. Unless nothing unforeseen happens, 
the new version will be available in about 2 days from now.

Andreas

[1]https://issues.apache.org/jira/browse/PDFBOX-5758


Am 12.03.24 um 17:40 schrieb Matthew Hardy:

Hello,

We've recently upgraded to pdfbox 3.0.1. When attempting to instantiate an 
empty PDDocument, we receive the following error.

Caused by: java.lang.NullPointerException: Cannot invoke 
"org.apache.commons.logging.Log.error(Object, java.lang.Throwable)" because 
"org.apache.pdfbox.io.IOUtils.LOG" is null
  at 
deployment.aeroxchange-edi.ear//org.apache.pdfbox.io.IOUtils.unmapper(IOUtils.java:278)
  at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:318)
  at
deployment.aeroxchange-edi.ear//org.apache.pdfbox.io.IOUtils.(
IOUtils.java:64)

This is a Jakarta EE 10 EJB maven project, running on Java 17 in Wildfly 
30.0.1.Final. commons-logging 1.2 has been added as a dependency.

Any help would be greatly appreciated!

Matt Hardy
Software Developer
Perform Air International
463 South Hamilton Court
Gilbert, Arizona 85233
Phone: (480) 610-3500
Fax: (480) 610-3501
matt.ha...@performair.com
www.PerformAir.com



-
To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:users-h...@pdfbox.apache.org


-
To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:users-h...@pdfbox.apache.org



Re: AFMParser optimization

2024-03-15 Thread Tilman Hausherr

Hi,

Thank you, done.

Tilman

On 15.03.2024 14:49, Guillaume Maillrd wrote:

Hi,

During a profiling session of my application, I found something that 
could interest you.


To speedup the AFMParser (50% gain),
the "equals" in parseCharMetric should be written in this order ( 
order of top 5 usage) :


if (nextCommand.equals(CHARMETRICS_C)) {
...
} else if (nextCommand.equals(CHARMETRICS_WX)) {
...
} else if (nextCommand.equals(CHARMETRICS_N)) {
...
} else if (nextCommand.equals(CHARMETRICS_B)) {
...
} else if (nextCommand.equals(CHARMETRICS_L)) {
...
} ...

On my setup, it removes 80k calls to "equals".

Regards,

Guillaume




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Type 0 font - Text extraction X PDF Debugger

2024-03-15 Thread Tilman Hausherr
Yes identity does work for that file. However using that logic fails to 
provide the correct results for other files with an unusuable /ToUnicode 
stream.


Yes there can be larger blocks.

My suspicion is that the tools who use "identity" for your file will 
fail for some of the files. Unless we discover yet another tweak of that 
workaround algorithm that works with all.


Tilman

On 15.03.2024 14:28, Luiz Marcelo Modesto wrote:

Thank you Tilman!

I'll try to read ISO 32000-2:2020 again to look for some kind of precedence
rules regarding the way of decoding string codes to Unicode chars.

My impression is that there are some choices but I don't remember if there
is something assertive or not. Maybe it could be just an implementation
choice.

I'll try to debug the extraction text tool to verify why using the
predefined Identity CMap works.

If I've looked at the correct CMap file
(fontbox/target/classes/org/apache/fontbox/cmap/Identity-H) it also has a
lot of blocks of beginbfchar/endbfchar. It doesn't have any
beginbfchar/endbfchar block.

All the blocks have their length limited to 256 codes, but it seems PDFBox
can support larger blocks. But, maybe the set "<0100>  256" could be
a problem...

PS.: The use of "true" was just a fast and dirty way to do a fast test, as
the beginbfchar/endbfchar block suggested to me an identity mapping.




Em sex., 15 de mar. de 2024 às 01:35, Tilman Hausherr 
escreveu:


You are correct that it's the "fb" parts that are missing. (And some of
the other tools you tried also mention this)

Just adding true results in text extraction of several files no longer
being correct, 433525-p1.pdf O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP-p3.pdf
PDFBOX-5540.pdf R4EXG25W532JHDQLJAM4HF6O532TLR7D-p1.pdf

Adding  "&& !cmap.hasCIDMappings()" after "hasUnicodeMappings()" brings
no regressions but your text is not extracted properly.

Maybe it is possible to include yet another rule for your file, but
there's likely more to do and there is the risk that other files no
longer extract properly.

Tilman

On 15.03.2024 00:08, Luiz Marcelo Modesto wrote:

It seems that PDFBOX-5540 resolves a special case based on some

dictionary

properties and chooses a predefined CMap (Identity CMap).

Reading the PDFont.java code, I think the warning "Invalid ToUnicode CMap
in font AvenirNextLTPro-Cn" comes from the fact that the CMap stream
doesn't contain 1 or more blocks of beginbfchar/endbfchar.

The two CMap's HashMaps (charToUnicodeOneByte and charToUnicodeTwoBytes)
are really empty.

But the font CMap stream contains this block:

2 begincidrange
<0001> <00FF> 1
<0100>  256
endcidrange

I'm sorry if I misunderstood, but this is a valid CMap too (it seems a

kind

of Identity mapping too, except for the 0x00...), isn't it?

It's only shorter than the one I could have if I write several blocks of
beginbfchar/endbfchar.

If I make this "dumb" modification (adding "true" to conditions) just

for a

rapid test

if (cmapName.contains("Identity") //
|| ordering.contains("Identity") //
|| COSName.IDENTITY_H.equals(encoding) //
|| COSName.IDENTITY_V.equals(encoding) || true)
{
COSDictionary encodingDict = dict.getCOSDictionary(COSName.ENCODING);
if (true || encodingDict == null || !encodingDict.containsKey(COSName.
DIFFERENCES))
{
// assume that if encoding is identity, then the reverse is also true
cmap = CMapManager.getPredefinedCMap(COSName.IDENTITY_H.getName());
LOG.warn("Using predefined identity CMap instead");
}
}

I've got "BCD" string like all the others

The encoding parameter is ignored when writing to the console.
mar 14, 2024 7:30:27 PM org.apache.pdfbox.pdmodel.font.PDFont
loadUnicodeCmap
ADVERTÊNCIA: Invalid ToUnicode CMap in font AvenirNextLTPro-Cn
mar 14, 2024 7:31:00 PM org.apache.pdfbox.pdmodel.font.PDFont
loadUnicodeCmap
ADVERTÊNCIA: Using predefined identity CMap instead
Página 4 de 4
Informações:  BCD

Maybe the extract text tool should been using begincidrange/endcidrange
information...

What do you think about?

PS.: I've read some pieces from ISO 32000-2:2020 but it is quite long.
Maybe I'm missing something... I'm sorry if this is the case...

Em qui., 14 de mar. de 2024 às 10:30, Luiz Marcelo Modesto <
lmodesto.w...@gmail.com> escreveu:


Ok!

I'll read PDFBOX-5540 and related issues.

Thank you very much!


Em qui, 14 de mar de 2024 10:08, Tilman Hausherr 
Hi,

The problem is in the ToUnicode stream, there's a log message "Invalid
ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode mappings.
PDFBox is trying a fallback solution which turns out to be wrong. This
is related to PDFBOX-5540 and earlier related issues.

Tilman



On 14.03.2024 13:28, Luiz Marcelo Modesto wrote:

Hi Tilman!

   Thank you very much for your attention!

   You can find the file "p4_alt.pdf" in this folder

Re: Bugfix for FileSystemFontProvider

2024-03-15 Thread Tilman Hausherr

Hi,

Yeah, "never happens" is a red flag. That part has been changed to use 
CRC32:

https://svn.apache.org/viewvc/pdfbox/branches/2.0/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java?revision=1916176=markup#l923

https://issues.apache.org/jira/browse/PDFBOX-5727

Tilman

On 15.03.2024 13:45, Guillaume Maillrd wrote:

Hi,

In version 2.0.30, a typo in computeHash from FileSystemFontProvider 
makes all hash to return "".

It breaks the cache logic, resulting a very slow loadDiskCache.

Please replace "SHA512" by "SHA-512" or backport the v3 code to use 
CRC32.

The "// never happens" comment looks funny.

Best regards,

Guillaume



-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Type 0 font - Text extraction X PDF Debugger

2024-03-14 Thread Tilman Hausherr
You are correct that it's the "fb" parts that are missing. (And some of 
the other tools you tried also mention this)


Just adding true results in text extraction of several files no longer 
being correct, 433525-p1.pdf O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP-p3.pdf 
PDFBOX-5540.pdf R4EXG25W532JHDQLJAM4HF6O532TLR7D-p1.pdf


Adding  "&& !cmap.hasCIDMappings()" after "hasUnicodeMappings()" brings 
no regressions but your text is not extracted properly.


Maybe it is possible to include yet another rule for your file, but 
there's likely more to do and there is the risk that other files no 
longer extract properly.


Tilman

On 15.03.2024 00:08, Luiz Marcelo Modesto wrote:

It seems that PDFBOX-5540 resolves a special case based on some dictionary
properties and chooses a predefined CMap (Identity CMap).

Reading the PDFont.java code, I think the warning "Invalid ToUnicode CMap
in font AvenirNextLTPro-Cn" comes from the fact that the CMap stream
doesn't contain 1 or more blocks of beginbfchar/endbfchar.

The two CMap's HashMaps (charToUnicodeOneByte and charToUnicodeTwoBytes)
are really empty.

But the font CMap stream contains this block:

2 begincidrange
<0001> <00FF> 1
<0100>  256
endcidrange

I'm sorry if I misunderstood, but this is a valid CMap too (it seems a kind
of Identity mapping too, except for the 0x00...), isn't it?

It's only shorter than the one I could have if I write several blocks of
beginbfchar/endbfchar.

If I make this "dumb" modification (adding "true" to conditions) just for a
rapid test

if (cmapName.contains("Identity") //
|| ordering.contains("Identity") //
|| COSName.IDENTITY_H.equals(encoding) //
|| COSName.IDENTITY_V.equals(encoding) || true)
{
COSDictionary encodingDict = dict.getCOSDictionary(COSName.ENCODING);
if (true || encodingDict == null || !encodingDict.containsKey(COSName.
DIFFERENCES))
{
// assume that if encoding is identity, then the reverse is also true
cmap = CMapManager.getPredefinedCMap(COSName.IDENTITY_H.getName());
LOG.warn("Using predefined identity CMap instead");
}
}

I've got "BCD" string like all the others

The encoding parameter is ignored when writing to the console.
mar 14, 2024 7:30:27 PM org.apache.pdfbox.pdmodel.font.PDFont
loadUnicodeCmap
ADVERTÊNCIA: Invalid ToUnicode CMap in font AvenirNextLTPro-Cn
mar 14, 2024 7:31:00 PM org.apache.pdfbox.pdmodel.font.PDFont
loadUnicodeCmap
ADVERTÊNCIA: Using predefined identity CMap instead
Página 4 de 4
Informações:  BCD

Maybe the extract text tool should been using begincidrange/endcidrange
information...

What do you think about?

PS.: I've read some pieces from ISO 32000-2:2020 but it is quite long.
Maybe I'm missing something... I'm sorry if this is the case...

Em qui., 14 de mar. de 2024 às 10:30, Luiz Marcelo Modesto <
lmodesto.w...@gmail.com> escreveu:


Ok!

I'll read PDFBOX-5540 and related issues.

Thank you very much!


Em qui, 14 de mar de 2024 10:08, Tilman Hausherr 
escreveu:


Hi,

The problem is in the ToUnicode stream, there's a log message "Invalid
ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode mappings.
PDFBox is trying a fallback solution which turns out to be wrong. This
is related to PDFBOX-5540 and earlier related issues.

Tilman



On 14.03.2024 13:28, Luiz Marcelo Modesto wrote:

Hi Tilman!

  Thank you very much for your attention!

  You can find the file "p4_alt.pdf" in this folder
<

https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing

.
"Extra infos.pdf" file shows some output from PDF Debugger and others.

  I'm sorry, I sent the pdf file as an attachment in my first

message,

but I didn't know that it wouldn't work.



Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr <

thaush...@t-online.de>

escreveu:


Hi,

please upload your file to a sharehoster.

Tilman

On 13.03.2024 20:03, Luiz Marcelo Modesto wrote:

Hi everyone,

  I'm not sure if this is the same as FAQ "How come I am getting
gibberish(G38G43G36G51G5) when extracting text?"...

  I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment
(build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).

  I'm trying to understand how this PDF chunk (from p4_fix.pdf

attached)

BT
/G1F7 6.0 Tf
94.871 773.806 Td
<004200430044> Tj
ET

  becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe
Reader, Chrome, ...) and becomes "abc" on PDFBox text extraction tool.

  Using the Poppler pdftotext (version 22.02.0) gives me "BCD" too.

  The renders that allow me to copy the text give me "BCD" text.

  It seems that PDFBox extraction tool follows the item "9.10.2
Mapping character codes to Unicode values" (ISO 32000-2:2020) but all
the others choose a different way.

   Could y

Re: Type 0 font - Text extraction X PDF Debugger

2024-03-14 Thread Tilman Hausherr

Hi,

The problem is in the ToUnicode stream, there's a log message "Invalid 
ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode mappings. 
PDFBox is trying a fallback solution which turns out to be wrong. This 
is related to PDFBOX-5540 and earlier related issues.


Tilman



On 14.03.2024 13:28, Luiz Marcelo Modesto wrote:

Hi Tilman!

 Thank you very much for your attention!

 You can find the file "p4_alt.pdf" in this folder
<https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing>.
"Extra infos.pdf" file shows some output from PDF Debugger and others.

 I'm sorry, I sent the pdf file as an attachment in my first message,
but I didn't know that it wouldn't work.



Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr 
escreveu:


Hi,

please upload your file to a sharehoster.

Tilman

On 13.03.2024 20:03, Luiz Marcelo Modesto wrote:

Hi everyone,

 I'm not sure if this is the same as FAQ "How come I am getting
gibberish(G38G43G36G51G5) when extracting text?"...

 I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment
(build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).

 I'm trying to understand how this PDF chunk (from p4_fix.pdf

attached)

   BT
   /G1F7 6.0 Tf
   94.871 773.806 Td
   <004200430044> Tj
   ET

 becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe
Reader, Chrome, ...) and becomes "abc" on PDFBox text extraction tool.

 Using the Poppler pdftotext (version 22.02.0) gives me "BCD" too.

 The renders that allow me to copy the text give me "BCD" text.

 It seems that PDFBox extraction tool follows the item "9.10.2
Mapping character codes to Unicode values" (ISO 32000-2:2020) but all
the others choose a different way.

  Could you help me to understand if there is a problem with the
PDF file, with the renders or with the extract text tool?

Thank you!



-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org





-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Type 0 font - Text extraction X PDF Debugger

2024-03-14 Thread Tilman Hausherr

Hi,

please upload your file to a sharehoster.

Tilman

On 13.03.2024 20:03, Luiz Marcelo Modesto wrote:

Hi everyone,

    I'm not sure if this is the same as FAQ "How come I am getting 
gibberish(G38G43G36G51G5) when extracting text?"...


    I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment 
(build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).


    I'm trying to understand how this PDF chunk (from p4_fix.pdf attached)

  BT
  /G1F7 6.0 Tf
  94.871 773.806 Td
  <004200430044> Tj
  ET

    becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe 
Reader, Chrome, ...) and becomes "abc" on PDFBox text extraction tool.


    Using the Poppler pdftotext (version 22.02.0) gives me "BCD" too.

    The renders that allow me to copy the text give me "BCD" text.

    It seems that PDFBox extraction tool follows the item "9.10.2 
Mapping character codes to Unicode values" (ISO 32000-2:2020) but all 
the others choose a different way.


 Could you help me to understand if there is a problem with the 
PDF file, with the renders or with the extract text tool?


Thank you!



-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: pdfbox 3.0.2 release ?

2024-03-08 Thread Tilman Hausherr

Around the end of next week if there are no last minute surprises.

Tilman

On 08.03.2024 16:07, Frédéric Ravetier wrote:

Hello,

Do you have an idea of when 3.0.2 will be released?

Have a good day,
Fred




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: OutOfMemoryException in FileSystemFontProvider (pdfbox v2.0.30)

2024-03-07 Thread Tilman Hausherr

Hello Kim,
You're welcome, please open a ticket and include your proposed solution. 
I have approved your registration. (I initially denied it because your 
text had no details whatsover)

Tilman

On 07.03.2024 14:08, Kim Hagedorn wrote:

Hello
  
  
I originally wanted to submit a defect to the PdfBox issue tracker but was redirected to this list, so here we go…
  
We experienced an OutOfMemoryError when calling

PDAcroForm.getDefaultResources().getFont(COSName); with COSName{Helv}
  
at this location:
  
main

   at java.lang.OutOfMemoryError.()V (OutOfMemoryError.java:48)
   at java.util.Arrays.copyOf([BI)[B (Arrays.java:3537)
   at java.io.ByteArrayOutputStream.ensureCapacity(I)V 
(ByteArrayOutputStream.java:100)
   at java.io.ByteArrayOutputStream.write([BII)V 
(ByteArrayOutputStream.java:130)
   at 
org.apache.pdfbox.io.IOUtils.copy(Ljava/io/InputStream;Ljava/io/OutputStream;)J 
(IOUtils.java:70)
   at org.apache.pdfbox.io.IOUtils.toByteArray(Ljava/io/InputStream;)[B 
(IOUtils.java:52)
   at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.addTrueTypeFontImpl(Lorg/apache/fontbox/ttf/TrueTypeFont;Ljava/io/File;)V
 (FileSystemFontProvider.java:773)
   at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.access$1400(Lorg/apache/pdfbox/pdmodel/font/FileSystemFontProvider;Lorg/apache/fontbox/ttf/TrueTypeFont;Ljava/io/File;)V
 (FileSystemFontProvider.java:60)
   at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider$1.process(Lorg/apache/fontbox/ttf/TrueTypeFont;)V
 (FileSystemFontProvider.java:686)
   at 
org.apache.fontbox.ttf.TrueTypeCollection.processAllFonts(Lorg/apache/fontbox/ttf/TrueTypeCollection$TrueTypeFontProcessor;)V
 (TrueTypeCollection.java:106)
   at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.addTrueTypeCollection(Ljava/io/File;)V
 (FileSystemFontProvider.java:681)
   at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.scanFonts(Ljava/util/List;)V
 (FileSystemFontProvider.java:398)
   at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.(Lorg/apache/pdfbox/pdmodel/font/FontCache;)V
 (FileSystemFontProvider.java:372)
   at 
org.apache.pdfbox.pdmodel.font.FontMapperImpl$DefaultFontProvider.()V 
(FontMapperImpl.java:141)
   at 
org.apache.pdfbox.pdmodel.font.FontMapperImpl.getProvider()Lorg/apache/pdfbox/pdmodel/font/FontProvider;
 (FontMapperImpl.java:160)
   at 
org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFont(Lorg/apache/pdfbox/pdmodel/font/FontFormat;Ljava/lang/String;)Lorg/apache/fontbox/FontBoxFont;
 (FontMapperImpl.java:430)
   at 
org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFontBoxFont(Ljava/lang/String;)Lorg/apache/fontbox/FontBoxFont;
 (FontMapperImpl.java:393)
   at 
org.apache.pdfbox.pdmodel.font.FontMapperImpl.getFontBoxFont(Ljava/lang/String;Lorg/apache/pdfbox/pdmodel/font/PDFontDescriptor;)Lorg/apache/pdfbox/pdmodel/font/FontMapping;
 (FontMapperImpl.java:367)
   at de. <...> 
.getFontBoxFont(Ljava/lang/String;Lorg/apache/pdfbox/pdmodel/font/PDFontDescriptor;)Lorg/apache/pdfbox/pdmodel/font/FontMapping;
 (PdfFontManager.java:152)
   at org.apache.pdfbox.pdmodel.font.PDType1Font.(Ljava/lang/String;)V 
(PDType1Font.java:146)
   at org.apache.pdfbox.pdmodel.font.PDType1Font.()V 
(PDType1Font.java:79)
   at 
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(Lorg/apache/pdfbox/cos/COSDictionary;Lorg/apache/pdfbox/pdmodel/ResourceCache;)Lorg/apache/pdfbox/pdmodel/font/PDFont;
 (PDFontFactory.java:76)
   at 
org.apache.pdfbox.pdmodel.PDResources.getFont(Lorg/apache/pdfbox/cos/COSName;)Lorg/apache/pdfbox/pdmodel/font/PDFont;
 (PDResources.java:171)
   at de. <...> .initializeFonts()V (PdfFontHelper.java:66)
  
  
The reason seemed to be that PdfBox initializes a FontCache when getFont is called and this scans _all_ fonts on the system. This also loads some large system fonts (AppleColorEmoji is 189,9MB). Each font gets copied into a single large byte array at the location below and this causes an OutOfMemoryError at this point in the code.
  
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider#addTrueTypeFontImpl:773

InputStream is = ttf.getOriginalData();
byte[] ba = IOUtils.toByteArray(is);
is.close();
String hash = computeHash(ba);
  
I think this would be easily fixed by using a DigestInputStream instead of a byte array to compute hashes at this location. I have tested this locally and it seemed to work. I could send a patch file or submit a pull request, if it helps.
  
  
Best regards
  
  
Kim Hagedorn
  
-

To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Feature request for filtering TextPosition in PDFTextStripperByArea and PDFTextStripper

2024-03-05 Thread Tilman Hausherr
I think I did something similar in 2018 that you might use, see the 
FilteredTextStripper class in ExtractText.java . That one only extracts 
text with angle 0.



/**
 * TextStripper that only processes glyphs that have angle 0.
 */
class FilteredTextStripper extends PDFTextStripper
{
    FilteredTextStripper() throws IOException
    {
    }

    @Override
    protected void processTextPosition(TextPosition text)
    {
    int angle = ExtractText.getAngle(text);
    if (angle == 0)
    {
    super.processTextPosition(text);
    }
    }
}



    static int getAngle(TextPosition text)
    {
    // should this become a part of TextPosition?
    Matrix m = text.getTextMatrix().clone();
    m.concatenate(text.getFont().getFontMatrix());
    return (int) 
Math.round(Math.toDegrees(Math.atan2(m.getShearY(), m.getScaleY(;

    }


Tilman


On 05.03.2024 11:52, Hengyu Weng wrote:

Sometimes the watermark will overlap with normal text which we want to
extract, so it would be great if it is possible to insert a filter and skip
some useless TextPositons (e.g. the text of the watermark may have a
rotation). I think the 'writePage' method in 'PDFTextStripper' is an
appropriate place to add this filter, but I found it is difficult to
override this method as it refers to a lot of private members, and
PDFTextStripper extends LegacyPDFStreamEngine, which is a non-public class,
which makes me unable to copy and modify it.

Currently I'm embedding the source code of pdfbox to allow me to modify the
above classes, I believe it would be definitely better if you can
officially add an insert point or some hooks to them.

Thank you.




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Adding Annotations to Signed PDF Causes Signatures To Appear Invalid

2024-02-27 Thread Tilman Hausherr

Hi,

You're using an ordinary save(). The signature will no longer work 
because the signed file segment has changed. You need to use 
saveIncremental(). Use the method that takes a list of COSDictionaries. 
And remove the showPageNo() part, I assume Adobe will not like that 
because you're changing the looks of the page. Another problem is that 
the DocMDP value is 3, which means:


2 Permitted changes shall be filling in forms, instantiating page 
templates, and signing; other changes shall invalidate the signature.
3 Permitted changes shall be the same as for 2, as well as annotation 
creation, deletion, and modification; other changes shall invalidate the 
signature.


Thus add annotations only, don't change the page contents. Assuming that 
the annotations array already existed, you need to include the page 
COSDictionary object and each annotation COSDictionary object to the 
list mentioned earlier. (possibly the annotation appearance too).


Tilman


On 27.02.2024 11:38, Predrag Stojković wrote:

Hello all,

I'm using Apache PDFBox 3.0.1, and I tried to add annotations to an existing 
PDF document, using basically the same code as supplied in example class 
org.apache.pdfbox.examples.pdmodel.AddAnnotations.

There are only few differences to the original code, i.e. I'm loading an 
existing document using
PDDocument document = Loader.loadPDF(new RandomAccessReadBufferedFile(args[0]))
instead of making a new one, and I'm skipping the lines in example where 
contents of new file are created, because my file already has some content.

This existing document has electronic signature on it.
After running the example code, and adding annotations to the document, when I 
open that document in Adobe Acrobat Reader (should be the latest version, 
23.8.20555.0), and then try to validate signatures, it first shows an error 
message:
There was an error creating a temporary file.

Then, when I look at the Signature Panel, it show the following:
Signature is invalid:
There are errors in the formatting or information contained in this signature 
(support information: SigDict /Contents illegal data)

My code, as well as the PDF document can be found on address:
https://files.fm/u/bsm5vzg9wk

This PDF was signed most likely by custom code based on an old version of 
iText, but I have also tried on another document which was produced by Adobe 
Acrobat and signed by GetAccept service (I haven't provided this document, but 
I probably could if needed), and the behaviour is the same after adding 
annotations.

Can you please check what is happening here?

Best regards,
Predrag Stojković





-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Issue with PDFBox 3.0.0 - Unable to Extract and Add Pages

2024-02-27 Thread Tilman Hausherr

Hi,

It's like Fabian said.

Btw neither the code here nor the different(!) code in 
https://stackoverflow.com/questions/78065676/ would enable anybody to 
reproduce such a bug because it's incomplete.


Until we get this fixed, please stay with 2.0.* (2.0.30 is the current 
version), and also update your jdk, 1.8.0_91 is from 2016. The current 
version is 1.8.0_402.

You can also try a snapshot here from time to time:
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/3.0.2-SNAPSHOT/
Tilman

On 27.02.2024 08:55, Amber Prakash Verma wrote:

Dear PDFBox Team,

I hope this email finds you well. I am writing to report an issue I encountered 
while using PDFBox version 3.0.0. It appears that there is a problem when 
attempting to extract pages from one PDF and add them to another PDF.
While using the same code and PDFBox version 2.0.29, it is perfectly working 
and output PDF contains no blank pages.





Re: AW: Importing landscape format and portrait format oriented pages into the same PDF causes PDF corruption

2024-02-23 Thread Tilman Hausherr

On 21.02.2024 16:07, Fabian Zünd SI-Solutions Gmbh wrote:

Hello I manged to try it all out with the Most current build 
pdfbox-app-3.0.2-20240221.085334-88.jar

The issue persists.

Maybe i'm doing the copying of the page completely wrong?


Hi,

You did nothing wrong. Sadly, this is the problem that I mentioned in my 
last mail. I've created https://issues.apache.org/jira/browse/PDFBOX-5775


Tilman


Re: How to find coordonnates of word and apply a mask

2024-02-12 Thread Tilman Hausherr
In PDF y=0 is bottom, in java it is top. See also the javadoc of the 
text.getXXX methods. It's a bit tricky, you need to do some trial and error.


Tilman

On 12.02.2024 21:07, Frédéric Ravetier wrote:

You said Y coordinate is not the same on pdf, this is probably my problem
on my pdf. How to get the right Y for the pdf?

On my test the x seems OK but not the Y.



Le lun. 12 févr. 2024, 19:30, Frédéric Ravetier  a
écrit :


My goal is to draw on the same or a copy PDF a rectangle over the text,
for example to hide it or to draw a border around the text to show to the
user something about this text.


Le lun. 12 févr. 2024 à 19:14, Tilman Hausherr  a
écrit :


It depends what you want to get. See the DrawPrintTextLocations.java
example which shows several strategies to get the bounding boxes of
individual glyphs and draw them on the screen (not in a PDF, so the Y
coordinate is different). You would have to adjust the
"Rectangle2D.Float" code to whatever you prefer|, or adjust
|DrawPrintTextLocations to collect words like the mkl code does.

Tilman

On 12.02.2024 18:48, Frédéric Ravetier wrote:

Hello,

I'd like to find some specific words in a PDF and draw a rectangle over
these words.
I'm using PDFBox 3.0.1

I found this to locate the words :


https://github.com/mkl-public/testarea-pdfbox2/blob/master/src/test/java/mkl/testarea/pdfbox2/extract/ExtractWordCoordinates.java

As you can see in the println, :
System.out.println(builder.toString() + " [(X=" + boundingBox.getX() +
",Y=" + boundingBox.getY()
   + ") height=" + boundingBox.getHeight() + "

width=" +

boundingBox.getWidth() + "]");

I get :
MYSTRING [(X=29.862407684326172,Y=383.78765869140625)
height=7.098414897918701 width=50.3477668762207 ]

in my prototype I print this information and copy and past x, y, height,
width into a block of code hardcoded

PDPage page = document.getPage(0);
PDPageContentStream contentStream = new PDPageContentStream(document,
page, PDPageContentStream.AppendMode.APPEND, false);
contentStream.setNonStrokingColor(Color.RED);
contentStream.addRect(29.862407684326172f, 383.78765869140625f,
50.3477668762207, 7.098414897918701f);
contentStream.fill();
contentStream.close();
document.save(new FileOutputStream(src_file_path.replace(".pdf",

"-rect.pdf")));


But it does not match the text on the PDF.
I tried to replace the height by the font size but it was not really

better.

Where is my mistake ?

Best regards,
Fred




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: How to find coordonnates of word and apply a mask

2024-02-12 Thread Tilman Hausherr
It depends what you want to get. See the DrawPrintTextLocations.java 
example which shows several strategies to get the bounding boxes of 
individual glyphs and draw them on the screen (not in a PDF, so the Y 
coordinate is different). You would have to adjust the 
"Rectangle2D.Float" code to whatever you prefer|, or adjust 
|DrawPrintTextLocations to collect words like the mkl code does.


Tilman

On 12.02.2024 18:48, Frédéric Ravetier wrote:

Hello,

I'd like to find some specific words in a PDF and draw a rectangle over
these words.
I'm using PDFBox 3.0.1

I found this to locate the words :
https://github.com/mkl-public/testarea-pdfbox2/blob/master/src/test/java/mkl/testarea/pdfbox2/extract/ExtractWordCoordinates.java
As you can see in the println, :
System.out.println(builder.toString() + " [(X=" + boundingBox.getX() +
",Y=" + boundingBox.getY()
  + ") height=" + boundingBox.getHeight() + " width=" +
boundingBox.getWidth() + "]");

I get :
MYSTRING [(X=29.862407684326172,Y=383.78765869140625)
height=7.098414897918701 width=50.3477668762207 ]

in my prototype I print this information and copy and past x, y, height,
width into a block of code hardcoded

PDPage page = document.getPage(0);
PDPageContentStream contentStream = new PDPageContentStream(document,
page, PDPageContentStream.AppendMode.APPEND, false);
contentStream.setNonStrokingColor(Color.RED);
contentStream.addRect(29.862407684326172f, 383.78765869140625f,
50.3477668762207, 7.098414897918701f);
contentStream.fill();
contentStream.close();
document.save(new FileOutputStream(src_file_path.replace(".pdf", "-rect.pdf")));


But it does not match the text on the PDF.
I tried to replace the height by the font size but it was not really better.

Where is my mistake ?

Best regards,
Fred



Re: 遇到一个无法解决的bug

2024-02-05 Thread Tilman Hausherr

Hello,

Please explain your problem in englisch and mention what PDFBox version 
you are using. Apparently it's about text extraction, read this first:


https://pdfbox.apache.org/3.0/faq.html#how-come-i-am-getting-gibberish(g38g43g36g51g5)-when-extracting-text%3F

Try extracting your test with Adobe Reader. Does it work? If not, then 
we won't be able to either.


If there is an exception, please include the stack trace.

Also post a link (don't attach) to the PDF involved and explain what you 
expected and what you got instead.


Tilman


On 05.02.2024 09:56, 软件开发岗位夏志强 wrote:

public  List> readPdfString(File file ,int pageNum)  {
 List> result = Collections.synchronizedList(new 
ArrayList>());
PDDocument doc = null;
PDDocument originalDocument=null;
 try{
// 创建新的 PDF 文档
originalDocument=PDDocument.load(file);
doc = new PDDocument();
// 遍历原始文档的页面并复制到新文档
for (PDPage page : originalDocument.getPages()) {
 doc.addPage(page);
}
 doc.save(file);
doc=PDDocument.load(file);
/**
  * 为0表示读全部页的数据,大于0表示读取指定页码的数据
*/
pageNum  = pageNum == 0 ?  doc.getNumberOfPages() : pageNum;

CountDownLatch latch = new CountDownLatch(pageNum);

 for(int i=1;i<=pageNum;i++)
 {
int finalI = i;
PDDocument finalDoc = doc;
executorService.execute(() ->{
int attempts = 0;
 try {
 HashMap map = new HashMap<>();
PDFTextStripper textStripper =new PDFTextStripper();
textStripper.setSortByPosition(true); // 设置是否按文本位置排序
textStripper.setStartPage(finalI); // 设置开始页数
textStripper.setEndPage(finalI); // 设置结束页数
// 从 PDF 文档提取文本
String text = textStripper.getText(finalDoc);
 int maxAttempts = 5; // 设置最大尝试次数
while (attempts < maxAttempts&&!text.contains("兹证明")&==1) {
 text = textStripper.getText(finalDoc);
 if (text.contains("兹证明")) {
break;
}
 attempts++;
}
 map.put("id",finalI);
map.put("data",text);
result.add(map);
}catch (Exception e){
log.error("读取pdf内容失败",e);
attempts++;
}finally {
latch.countDown();
}
 });
}
 latch.await();
}catch (Exception e){
log.error(e.getMessage(),e);
}finally {
 IoUtil.close(doc);
IoUtil.close(originalDocument);
}
return result;



}
IDEA报错 org.apache.fontbox.ttf.TTFParser 151 parse  然后中文是乱码的




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: JUnit5 Compile Dependency

2024-02-02 Thread Tilman Hausherr

Hi,

Sorry about that, this has already been reported and 3.0.2 won't have 
this problem.


https://issues.apache.org/jira/browse/PDFBOX-5722

Tilman

On 02.02.2024 15:26, Willy Mwangi wrote:

Hello there,

We have experienced a bug with version 3.0.1 of PDFBOX whereby it comes
with a compile dependency of JUnit5 compared to the previous version 3.0.0
which had JUnit5 scoped for tests only. This leads to the failure of
running JUnit4 tests unless you explicitly exclude JUnit5.

Kind regards,
--

*Willy Mwangi*

*Developer Integrations, Java*




BERLIN · BOSTON · TOKYO


Acrolinx GmbH

Invalidenstr. 73

10557 Berlin

Germany


Managing Director: Volker Smid

Registered Office: Berlin

Commercial Register: Berlin-Charlottenburg

Registration Number:  HRB 84183


*Follow Us: *  






-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Loading a PDF using InputStream

2024-02-01 Thread Tilman Hausherr

P.S.: thank you for having investigated and reported this!

Tilman

On 01.02.2024 16:06, Tilman Hausherr wrote:
Oh. I had looked at the trunk and not at 3.0. That was likely a 
mistake in refactoring. Fixed in


 https://issues.apache.org/jira/browse/PDFBOX-5757

and you get get a snapshot here
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/3.0.2-SNAPSHOT/ 



Tilman


On 01.02.2024 15:25, Lars Juel Jensen wrote:
That is weird.. The source file I am looking at for version 3.0.1 
does not

pass it:
-->
https://github.com/apache/pdfbox/blob/3.0.1/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java#L91 



On Wed, Jan 31, 2024 at 4:57 PM Tilman Hausherr 
wrote:


On 31.01.2024 16:19, Lars Juel Jensen wrote:

Well that's my problem.. It works with PDFBox2 with reasonable sized

files.
When it comes to the big ones it crashes.. So reading the migration 
guide
for PDFBox3.0 I thought I saw some light in the tunnel as it says I 
can

create my own reader and stream cache. I see that I can provide my own
RandomAccessReader when I call Loader.loadPDF, but the loadPDF method

that

takes a StreamCacheCreate function does not work as promised as the
StreamCacheCreateFunction is not passed from PDFParser to COSParser in

the
PDFParser constructor. This works in v3.0.0, but not in v3.0.1. I 
guess

this is a bug?

I don't know if there is a bug, but it is passed:

  public PDFParser(RandomAccessRead source, String
decryptionPassword, InputStream keyStore,
  String alias, StreamCacheCreateFunction
streamCacheCreateFunction) throws IOException
  {
  super(source, decryptionPassword, keyStore, alias,
streamCacheCreateFunction);
  }

and here's COSParser:

  public COSParser(RandomAccessRead source, String password,
InputStream keyStore,
  String keyAlias, StreamCacheCreateFunction
streamCacheCreateFunction) throws IOException
  {
  super(source);
  this.password = password;
  this.keyAlias = keyAlias;
  fileLen = source.length();
  keyStoreInputStream = keyStore;
  init(streamCacheCreateFunction);
  }

If you think 3.0.1 has a bigger memory footprint than 3.0.0, can you
create a scenario to reproduce this? Preferably without using a 
container.


Tilman

On Wed, Jan 31, 2024 at 3:46 PM Tilman Hausherr 


wrote:


On 31.01.2024 14:48, Lars Juel Jensen wrote:

This creates another problem for me. I am running PDFBox in a

kubernetes
cluster on premises with limited resources. I can not setup 
persistent
volume claims nor ephemeral volumes, and I can not change how my 
pods

are

started. I have limited resources and an emptyDir that is mounted on

/tmp

where the temporary files go. The emptyDir is mapped to a portion of

the

kubernetes node's memory, and this memory is shared with many other
services. All in all - I need to keep a very low memory and tempFile
footprint, hence the InputStream. Using RandomAccessReadBuffer 
with an
InputStream loads the entire PDF into memory, and I can encounter 
PDF

documents that can be over 1GB in size. So loading everything into

memory

is not an option.

You can try to create your own class extending RandomAccessRead.

If your /tmp is mapped on main memory, then it doesn't make sense 
to use

a temp file at all, you're just wasting time.

Btw PDFBox 2 was also loading the whole PDF file into memory (or 
into a

scratch file) and had an even bigger footprint because it was also
parsing the complete PDF. So if your project was working with 
PDFBox 2

then it should work with PDFBox 3.

Tilman




On Wed, Jan 31, 2024 at 10:10 AM Tilman Hausherr <

thaush...@t-online.de>

wrote:


On 31.01.2024 09:50, Lars Juel Jensen wrote:

In PDFBox2 I could do:

PDDocument.load(inputStream, 
MemoryUsageSetting.setupTempFileOnly())


But there is no equivalent to this in PDFBox3. How do I read a PDF

from

an

inputstream?


|Loader.loadPDF(new RandomAccessReadBuffer(inputStream),
IOUtils.createTempFileOnlyStreamCache());|


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org





-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Loading a PDF using InputStream

2024-02-01 Thread Tilman Hausherr
Oh. I had looked at the trunk and not at 3.0. That was likely a mistake 
in refactoring. Fixed in


 https://issues.apache.org/jira/browse/PDFBOX-5757

and you get get a snapshot here
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/3.0.2-SNAPSHOT/

Tilman


On 01.02.2024 15:25, Lars Juel Jensen wrote:

That is weird.. The source file I am looking at for version 3.0.1 does not
pass it:
-->
https://github.com/apache/pdfbox/blob/3.0.1/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java#L91

On Wed, Jan 31, 2024 at 4:57 PM Tilman Hausherr 
wrote:


On 31.01.2024 16:19, Lars Juel Jensen wrote:

Well that's my problem.. It works with PDFBox2 with reasonable sized

files.

When it comes to the big ones it crashes.. So reading the migration guide
for PDFBox3.0 I thought I saw some light in the tunnel as it says I can
create my own reader and stream cache. I see that I can provide my own
RandomAccessReader when I call Loader.loadPDF, but the loadPDF method

that

takes a StreamCacheCreate function does not work as promised as the
StreamCacheCreateFunction is not passed from PDFParser to COSParser in

the

PDFParser constructor. This works in v3.0.0, but not in v3.0.1. I guess
this is a bug?

I don't know if there is a bug, but it is passed:

  public PDFParser(RandomAccessRead source, String
decryptionPassword, InputStream keyStore,
  String alias, StreamCacheCreateFunction
streamCacheCreateFunction) throws IOException
  {
  super(source, decryptionPassword, keyStore, alias,
streamCacheCreateFunction);
  }

and here's COSParser:

  public COSParser(RandomAccessRead source, String password,
InputStream keyStore,
  String keyAlias, StreamCacheCreateFunction
streamCacheCreateFunction) throws IOException
  {
  super(source);
  this.password = password;
  this.keyAlias = keyAlias;
  fileLen = source.length();
  keyStoreInputStream = keyStore;
  init(streamCacheCreateFunction);
  }

If you think 3.0.1 has a bigger memory footprint than 3.0.0, can you
create a scenario to reproduce this? Preferably without using a container.

Tilman


On Wed, Jan 31, 2024 at 3:46 PM Tilman Hausherr 
wrote:


On 31.01.2024 14:48, Lars Juel Jensen wrote:

This creates another problem for me. I am running PDFBox in a

kubernetes

cluster on premises with limited resources. I can not setup persistent
volume claims nor ephemeral volumes, and I can not change how my pods

are

started. I have limited resources and an emptyDir that is mounted on

/tmp

where the temporary files go. The emptyDir is mapped to a portion of

the

kubernetes node's memory, and this memory is shared with many other
services. All in all - I need to keep a very low memory and tempFile
footprint, hence the InputStream. Using RandomAccessReadBuffer with an
InputStream loads the entire PDF into memory, and I can encounter PDF
documents that can be over 1GB in size. So loading everything into

memory

is not an option.

You can try to create your own class extending RandomAccessRead.

If your /tmp is mapped on main memory, then it doesn't make sense to use
a temp file at all, you're just wasting time.

Btw PDFBox 2 was also loading the whole PDF file into memory (or into a
scratch file) and had an even bigger footprint because it was also
parsing the complete PDF. So if your project was working with PDFBox 2
then it should work with PDFBox 3.

Tilman




On Wed, Jan 31, 2024 at 10:10 AM Tilman Hausherr <

thaush...@t-online.de>

wrote:


On 31.01.2024 09:50, Lars Juel Jensen wrote:

In PDFBox2 I could do:

PDDocument.load(inputStream, MemoryUsageSetting.setupTempFileOnly())

But there is no equivalent to this in PDFBox3. How do I read a PDF

from

an

inputstream?


|Loader.loadPDF(new RandomAccessReadBuffer(inputStream),
IOUtils.createTempFileOnlyStreamCache());|


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org





-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Modifying the order of AcroForm Fields and/or associated Widget Annotations...

2024-01-31 Thread Tilman Hausherr

On 31.01.2024 16:50, Dwayne Parks wrote:
I'll post them on a shared file site and provide the links here, if it 
would be helpful.  Do you have any recommendations for such a site?  
Thanks!


In the past I used filedropper.com but it doesn't seem to work anymore.

Try google drive if you have a google account. Make sure one doesn't 
need to be logged in, i.e. try the public link in a private browser session.


Tilman


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Loading a PDF using InputStream

2024-01-31 Thread Tilman Hausherr

On 31.01.2024 16:19, Lars Juel Jensen wrote:

Well that's my problem.. It works with PDFBox2 with reasonable sized files.
When it comes to the big ones it crashes.. So reading the migration guide
for PDFBox3.0 I thought I saw some light in the tunnel as it says I can
create my own reader and stream cache. I see that I can provide my own
RandomAccessReader when I call Loader.loadPDF, but the loadPDF method that
takes a StreamCacheCreate function does not work as promised as the
StreamCacheCreateFunction is not passed from PDFParser to COSParser in the
PDFParser constructor. This works in v3.0.0, but not in v3.0.1. I guess
this is a bug?


I don't know if there is a bug, but it is passed:

    public PDFParser(RandomAccessRead source, String 
decryptionPassword, InputStream keyStore,
    String alias, StreamCacheCreateFunction 
streamCacheCreateFunction) throws IOException

    {
    super(source, decryptionPassword, keyStore, alias, 
streamCacheCreateFunction);

    }

and here's COSParser:

    public COSParser(RandomAccessRead source, String password, 
InputStream keyStore,
    String keyAlias, StreamCacheCreateFunction 
streamCacheCreateFunction) throws IOException

    {
    super(source);
    this.password = password;
    this.keyAlias = keyAlias;
    fileLen = source.length();
    keyStoreInputStream = keyStore;
    init(streamCacheCreateFunction);
    }

If you think 3.0.1 has a bigger memory footprint than 3.0.0, can you 
create a scenario to reproduce this? Preferably without using a container.


Tilman



On Wed, Jan 31, 2024 at 3:46 PM Tilman Hausherr 
wrote:


On 31.01.2024 14:48, Lars Juel Jensen wrote:

This creates another problem for me. I am running PDFBox in a kubernetes
cluster on premises with limited resources. I can not setup persistent
volume claims nor ephemeral volumes, and I can not change how my pods are
started. I have limited resources and an emptyDir that is mounted on /tmp
where the temporary files go. The emptyDir is mapped to a portion of the
kubernetes node's memory, and this memory is shared with many other
services. All in all - I need to keep a very low memory and tempFile
footprint, hence the InputStream. Using RandomAccessReadBuffer with an
InputStream loads the entire PDF into memory, and I can encounter PDF
documents that can be over 1GB in size. So loading everything into memory
is not an option.

You can try to create your own class extending RandomAccessRead.

If your /tmp is mapped on main memory, then it doesn't make sense to use
a temp file at all, you're just wasting time.

Btw PDFBox 2 was also loading the whole PDF file into memory (or into a
scratch file) and had an even bigger footprint because it was also
parsing the complete PDF. So if your project was working with PDFBox 2
then it should work with PDFBox 3.

Tilman




On Wed, Jan 31, 2024 at 10:10 AM Tilman Hausherr 
wrote:


On 31.01.2024 09:50, Lars Juel Jensen wrote:

In PDFBox2 I could do:

PDDocument.load(inputStream, MemoryUsageSetting.setupTempFileOnly())

But there is no equivalent to this in PDFBox3. How do I read a PDF from

an

inputstream?


|Loader.loadPDF(new RandomAccessReadBuffer(inputStream),
IOUtils.createTempFileOnlyStreamCache());|



-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org





-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Filling a form advice

2024-01-31 Thread Tilman Hausherr

Hello Nicola,

Please upload your PDF to a sharehoster, attachments are removed.

showTextWithPositioning is for horizontal positioning of individual 
glyphs, it is the "way to specify a string with some, I don't know, 
offset between the chars". (or vertical, if it is a vertical font)
it might be tricky if you are using a propotional font. Please explain 
"but the output was not the one I need" - what happened / what did you 
expect to happen?


I'm also wondering whether the PDF is an acroform document, what might 
make things easier.


Tilman

On 31.01.2024 15:07, Nicola Farina wrote:

Hi

I need to produce a kind of form filled document like the attached one.
My application receives a print request payload with all the fields of
a kind of "payment postal order".
Then a PDF document, containing the layout, is loaded.
Then I fill it using PDFBox
primitives, basically a sequence of:

newLineAtOffset
followed by
showText

In the attached example, though, there is a new requirement.
There are some areas (I have encircled in red to better identify them)
which I need to place a string whose characters must be inside boxes.
I tried to use

showTextWithPositioning

preparing an array with each character followed by a number
representing the interleaving space, but the output was not the one I
need.

Do I need to manually position each character and then move the cursor
explicitly?
Is there no way to specify a string with some, I don't know, offset
between the chars?

thanks!
Bye
Nicola


-
To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:users-h...@pdfbox.apache.org




Re: pdfbox 3.x, is it recommended to include jai-imageio when I am already using twelvemonkeys?

2024-01-31 Thread Tilman Hausherr
You should use all of these (including the jai-imageio-corewhich is 
required for jpeg2000) except the one for tiff. That one isn't needed 
but you can if you are creating TIFF files. It is not needed for 
decoding CCITT content in PDF files. (However our CCITT encoder / 
decoder is copied from the twelvemonkeys project)

AFAIK the twelvemonkeys plugin puts itself in front of the other plugins.

Tilman

On 31.01.2024 14:52, C PF wrote:

I am already using twelvemonkeys tiff and jpeg along with pdfbox 3.0.1

 
 com.twelvemonkeys.imageio
 imageio-jpeg
 3.10.1
 

 
 com.twelvemonkeys.imageio
 imageio-tiff
 3.10.1
 

In that case, is it still recommended to include jai-imageio dependencies?
to be exact:

 
 com.github.jai-imageio
 jai-imageio-jpeg2000
 1.4.0
 

I am not sure if including all 3 of them as my project's dependency will
increase my compatibility with different pdf files?

Or are they going to somehow conflict with each other and make the final
results less deterministic?



Re: Loading a PDF using InputStream

2024-01-31 Thread Tilman Hausherr

On 31.01.2024 14:48, Lars Juel Jensen wrote:

This creates another problem for me. I am running PDFBox in a kubernetes
cluster on premises with limited resources. I can not setup persistent
volume claims nor ephemeral volumes, and I can not change how my pods are
started. I have limited resources and an emptyDir that is mounted on /tmp
where the temporary files go. The emptyDir is mapped to a portion of the
kubernetes node's memory, and this memory is shared with many other
services. All in all - I need to keep a very low memory and tempFile
footprint, hence the InputStream. Using RandomAccessReadBuffer with an
InputStream loads the entire PDF into memory, and I can encounter PDF
documents that can be over 1GB in size. So loading everything into memory
is not an option.


You can try to create your own class extending RandomAccessRead.

If your /tmp is mapped on main memory, then it doesn't make sense to use 
a temp file at all, you're just wasting time.


Btw PDFBox 2 was also loading the whole PDF file into memory (or into a 
scratch file) and had an even bigger footprint because it was also 
parsing the complete PDF. So if your project was working with PDFBox 2 
then it should work with PDFBox 3.


Tilman





On Wed, Jan 31, 2024 at 10:10 AM Tilman Hausherr 
wrote:


On 31.01.2024 09:50, Lars Juel Jensen wrote:

In PDFBox2 I could do:

PDDocument.load(inputStream, MemoryUsageSetting.setupTempFileOnly())

But there is no equivalent to this in PDFBox3. How do I read a PDF from

an

inputstream?


|Loader.loadPDF(new RandomAccessReadBuffer(inputStream),
IOUtils.createTempFileOnlyStreamCache());|




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Loading a PDF using InputStream

2024-01-31 Thread Tilman Hausherr

On 31.01.2024 09:50, Lars Juel Jensen wrote:

In PDFBox2 I could do:

PDDocument.load(inputStream, MemoryUsageSetting.setupTempFileOnly())

But there is no equivalent to this in PDFBox3. How do I read a PDF from an
inputstream?



|Loader.loadPDF(new RandomAccessReadBuffer(inputStream), 
IOUtils.createTempFileOnlyStreamCache());|


Re: Fwd: Help with Incorrect Identity-H Mapping

2024-01-30 Thread Tilman Hausherr
I'm using netbeans, so I can't help there much. Here's IntelliJ's help 
page which you may already have seen:

https://www.jetbrains.com/help/idea/encoding.html

To be 100% sure of what's in your file, open it with NOTEPAD++ and use 
the hex plugin, or a hex editor. NOTEPAD++ also offers the feature to 
add a BOM to the file (if it isn't there).
Assuming that this is correct, are you using maven? If yes, then 
the maven-compiler-plugin should look somewhat like this:


    
maven-compiler-plugin
    
true
    17
    17
    UTF-8
    
    

If you're not using maven, then look here:
https://stackoverflow.com/questions/43405266/ant-with-intellij-idea-encoding-problems
https://stackoverflow.com/questions/48206942/intellij-uses-wrong-encoding

If you still can't get it to run and nobody else answers here, ask on 
stackoverflow...


Tilman

On 31.01.2024 03:15, Gino G wrote:

Thanks for the reply Tilman.

Yes, you are right, I am indeed getting 8, and not 4.

However, I've been trying to change the encoding for almost two hours now,
with no effect.
Would you happen to know any resources that can help me get this to work?

For more reference, I'm using IntelliJ and all files in my project display
"UTF-8".
I'm using the Javac compiler using version 17 without command line
parameters.
However, I've tried setting things like: encoding=UTF-8, etc. with no
success.

If this solved my issue, that would be amazing, but unfortunately I can't
get it to work.

On 2024/01/30 16:01:56 Tilman Hausherr wrote:

Also try changing the line

 cs.showText("äöüß");

to

 String s = "äöüß";
 System.out.println(s.length());
 cs.showText(s);

the output on the console should be 4. If suspect your output will be 8
if my theory is correct.

Tilman





-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Fwd: Help with Incorrect Identity-H Mapping

2024-01-30 Thread Tilman Hausherr

Also try changing the line

   cs.showText("äöüß");

to

   String s = "äöüß";
   System.out.println(s.length());
   cs.showText(s);

the output on the console should be 4. If suspect your output will be 8 
if my theory is correct.


Tilman



Re: Fwd: Help with Incorrect Identity-H Mapping

2024-01-30 Thread Tilman Hausherr

Hello Gino,

Please tell whether it happens with every font or only with that one. 
And check whether the encoding in the source code is the same passed to 
the javac compiler. I suspect your file is UTF8 but the java compiler 
expects a single byte font.


It works for me, I just tested it:

    public static void main(String[] args) throws IOException
    {
    try (PDDocument doc = new PDDocument())
    {
    PDFont font = PDType0Font.load(doc, new 
FileInputStream("/OpenSans-Regular.ttf"), false);

    PDPage page = new PDPage();
    doc.addPage(page);
    try (PDPageContentStream cs = new PDPageContentStream(doc, 
page))

    {
    cs.setFont(font, 20);
    cs.beginText();
    cs.newLineAtOffset(50, 650);
    cs.showText("äöüß");
    cs.endText();
    }
    doc.save("/gino.pdf");
    }
    }

And this is the content stream:

/F1 20 Tf
BT
  50 650 Td
  (\000\246\000\270\000\276\000\241) Tj
ET

Tilman

On 30.01.2024 15:52, Gino G wrote:

Hello there,

I'm encountering an error in how certain characters are encoded using 
PDFBox. The issue exists in all versions of PDFBox, but I'm currently 
using 3.0.1.


contentStream.showText("äöüß");

The string "äöüß" is used as a test for Unicode characters that PDFBox 
needs to render.


var resource = 
Processor.class.getResource("/OpenSans-Regular.ttf");var file = 
Paths.get(resource.toURI()).toFile(); vartargetStream = new 
FileInputStream(file); var out = 
PDType0Font.load(PageAssembler.getDocument(), targetStream, false); 
contentStream.setFont(out, 20);


To do so, I'm importing a font that I know has the glyphs for all four 
special characters (OpenSans downloaded from Google Fonts).
However, this issue can be reproduced using any other 
Unicode-supported font.


Executing the code, PDFBox renders the following character 
sequence: Ã¤Ã¶Ã¼ÃŸ.

Clearly an encoding issue.

Using the PDF Debugger, it shows the text rendered as:

/F1 20 Tf
BT
  (\000\205\000f\000\205\000x\000\205\000~\000\205\0019) Tj
ET

Now, as far as I understand from what I've learned while debugging 
this issue, \205 is the octal value that uses the glyph at position 
133 (decimal for \205) of the font with the id F1.
Again, looking at the F1 section in the PDF Debugger, the character 
listed under the code / CID / GID 133 is indeed Ã, the first 
"incorrect" character of the sequence, which is supposed to be "ä"

"ä", however, would be 166, not 133. How does PDFBox get this wrong?

As an aside, if I use showText and use toUnicode(166), PDFBox 
correctly renders "ä" in the desired font!


Looking at the "ToUnicode" part of the F1 font, the following string 
is displayed.


Could someone please help me figure out what is going on? And 
hopefully even help me fix this issue? For more help, I have attached 
the PDF document.


Best,
Gino

ToUnicode:

/CIDInit /ProcSet findresource begin
12 dict begin

begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def

/CMapName /Adobe-Identity-UCS def
/CMapType 2 def

1 begincodespacerange
<> 
endcodespacerange

100 beginbfrange
<0001> <0001> <>
<0002> <0002> <000D>
<0003> <0061> <0020>
<0062> <00C1> <00A0>
<00C2> <00F2> <0100>
<00F3> <00FF> <0132>
<0100> <0122> <013F>
<0123> <0124> <021A>
<0125> <0140> <0164>
<0141> <0141> <0192>
<0142> <0147> <01FA>
<0148> <0149> <0218>
<014A> <014B> <02C6>
<014C> <014C> <02C9>
<014D> <0152> <02D8>
<0153> <0159> <0384>
<015A> <015A> <038C>
<015B> <016E> <038E>
<016F> <019A> <03A3>
<019B> <01A6> <0401>
<01A7> <01E8> <040E>
<01E9> <01F4> <0451>
<01F5> <01F6> <045E>
<01F7> <01F8> <0490>
<01F9> <01FE> <1E80>
<01FF> <01FF> <1EF2>
<0200> <0200> <1EF3>
<0201> <0203> <2013>
<0204> <020B> <2017>
<020C> <020E> <2020>
<020F> <020F> <2026>
<0210> <0210> <2030>
<0211> <0212> <2032>
<0213> <0214> <2039>
<0215> <0215> <203C>
<0216> <0216> <2044>
<0217> <0217> <207F>
<0218> <0219> <20A3>
<021A> <021A> <20A7>
<021B> <021B> <20AC>
<021C> <021C> <2105>
<021D> <021D> <2113>
<021E> <021E> <2116>
<021F> <021F> <2122>
<0220> <0220> <2126>
<0221> <0221> <212E>
<0222> <0225> <215B>
<0226> <0226> <2202>
<0227> <0227> <2206>
<0228> <0228> <220F>
<0229> <022A> <2211>
<022B> <022B> <221A>
<022C> <022C> <221E>
<022D> <022D> <222B>
<022E> <022E> <2248>
<022F> <022F> <2260>
<0230> <0231> <2264>
<0232> <0232> <25CA>
<0235> <0235> <0326>
<0237> <0238> <2074>
<0239> <023A> <2077>
<023B> <0246> <2000>
<0247> <0247> 
<0248> <0249> 
<024A> <024A> <01F0>
<024B> <024B> <02BC>
<024C> <024D> <03D1>
<024E> <024E> <03D6>
<024F> <0250> <1E3E>
<0251> <0252> <1E00>
<0253> <0253> <02F3>
<0254> <0255> <01A0>
<0256> <0257> <01AF>
<0259> <0259> <0400>
<025A> <025A> <040D>
<025B> <025B> <0450>
<025C> <025C> <045D>
<025D> <027F> <0460>
<0280> <0287> <0488>
<0288> <02F5> <0492>
<02F6> <02FF> <0500>
<0300> <0309> <050A>
<030A> <035B> <1EA0>
<035C> <0361> <1EF4>
<0362> <0362> <20AB>
<036D> <036E> <0162>

Splitting PDF while keeping document structural information

2024-01-29 Thread Tilman Hausherr
The trunk now supports splitting the structure tree. Please test it and 
report any problems.

https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/4.0.0-SNAPSHOT/

If you're a JIRA user, you can also make your comments here:
https://issues.apache.org/jira/browse/PDFBOX-2725

I'll port it to the other versions after some time.

If you don't have Adobe Acrobat Professional, you can use PDF-XChange 
Editor to verify that all elements are there.
In PDF-XChange, click on the left toolbar on the 4th to last icon that 
looks like a key tag, and on the second to last icon that checks the PDF 
for accessibility problems.

Ideally, the old and the new file should have the same problems.

Tilman


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: The annotations generated by PDFBOX cannot be displayed in the browser, but they can be displayed in adobe pdf reader

2024-01-25 Thread Tilman Hausherr

Hi,

Please include more of your code. It does not show how this 
PDAnnotationFreeText is created, and whether you called 
*constructAppearances()* on it. Also upload your PDF to a sharehoster, 
and mention what PDFBox version you're using.

Tilman

On 26.01.2024 07:39, Tam chilun wrote:

Dear developer

I use getAnnotations().add(anno) to generate annotations, ,but they won't 
display in my browser. Do you need any other method or is it not supported yet.

 Addannotations annoadder = new Addannotations();
 PDAnnotationFreeText anno = annoadder.setanno(param);


 PDPage page = document.getPage(Integer.parseInt(param[0])-1);
 page.getAnnotations().add(anno);

best



Re: potential issue in fontbox component CmapSubtable

2024-01-17 Thread Tilman Hausherr

Hi,

I hope I'm not wrong on this, but if the second element is true 
(glyphIdToCharacterCode == null) then the third one wouldn't be 
evaluated, because there's no need. (short circuit evaluation)


Look at https://issues.apache.org/jira/browse/PDFBOX-5465 , the stack 
trace looks just like yours.


Could it be you're not really using 2.0.28 but 2.0.26 or earlier?

Tilman

On 17.01.2024 16:28, Michal Stefan wrote:

Hello,

we are using pdfbox version 2.0.28 (awesome library, thanks for that!) 
and recently we observed an issue (attached txt). Unfortunately I do 
not have the pdf (as this issue happens before the document was saved, 
so we do not have the document). However looking at the 
CmapSubtable.java even in the latest code (as well as 2.0.28 version), 
it seems like the condition is not safe:


private int getCharCode(int gid)
 {
 if (gid < 0 || glyphIdToCharacterCode == null || gid >= 
glyphIdToCharacterCode.length)
 {
 return -1;
 }
 return glyphIdToCharacterCode[gid];
 }
Exception and the code as well suggests that even if the 
glyphIdToCharacterCode is null, it's still possible that 
glyphIdToCharacerCode.length gets evaluated. What do you think please?

Best Regards,
Michal Stefan



-
To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:users-h...@pdfbox.apache.org




Re: Cannot get overlaypdf working on command line interface

2024-01-15 Thread Tilman Hausherr

Hi,

Sorry, it turns out there was a second bug, which has now been fixed. 
And this time I tested myself and it works. Please test again with the 
latest snapshot.


https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/3.0.2-SNAPSHOT/

https://issues.apache.org/jira/browse/PDFBOX-5748

Tilman


On 15.01.2024 08:28, Lukas Jans wrote:


Hello,

Sorry for not replying earlier. I tried it now with the latest 
snapshot build, but still got the same exception:


Ein Bild, das Text, Screenshot, Schrift enthält. Automatisch 
generierte Beschreibung


The merge command still works as intended, but as for overlay, I do 
not see any change.


Kind regards

Lukas



Re: merging a pre-existing file with a new page

2024-01-10 Thread Tilman Hausherr

Hi,

Please retry with 2.0.* (there use PDDocument.load()) and with a 
snapshot version of 3.0.2 because we fixed bugs related to what you mention:

https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/3.0.2-SNAPSHOT/

If it doesn't work, please try with the command line merge application, 
and please upload the PDFs to a sharehoster, and post the smallest 
possible code that reproduces the problem.


Tilman

On 11.01.2024 03:46, Vaishant Bafna wrote:

Hey!

I am using pdfbox-app-3.0.1 API for a PDF merging facility on my
application made on Java NetBeans IDE 18. However, when I am compiling and
using the 'Loader' to load my PDF files from the desktop and merge a
pre-existing batch file with a new page I would like to add, it adds the
page as a blank one! I am unable to solve this problem! Can someone please
help me with this?



-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: FW: PDFBox 3.0.1 Font changes when rendering PDF to Image

2024-01-10 Thread Tilman Hausherr

Hi,

That's why I mentioned to look at the log messages, there would be one 
mentioning that a fallback font is used.


The alternative would be to implement your own FontMapper. Call 
FontMappers.set() with your own FontMapper. To see how to implement your 
own, look at the source code of FontMapperImpl class.


All this is not trivial, probably a several days of work. The best would 
be to expand the "lastResortFont" part to support all standard 14 fonts 
instead of just having LiberationSans.


Tilman



On 10.01.2024 16:37, Lisa Moore wrote:


I think the issue is that the required font it not on the Azure 
Kubernetes image that we are now running on.   We are not allowed to 
load any fonts on this image.   Is there a way to embed the required 
font into the java code that is creating the image from the PDF file?  
The java code is included below:


*public**class*PDFToImage  {

*public**static*Object transformMessage(String baos) *throws*Exception

{

 ByteArrayOutputStream[] _imageBaos_;

*byte*[] decodedString= 
Base64./getDecoder/().decode(baos.getBytes("UTF-8"));


// Get the input stream

*try*(PDDocument pddDoc=  Loader./loadPDF/(decodedString) ){

PDFRenderer pr= *new*PDFRenderer (pddDoc);

*int*pageCount= pddDoc.getNumberOfPages();

BufferedImage bim= *new*BufferedImage(25,25, 
BufferedImage.*/TYPE_INT_ARGB/*);


ByteArrayOutputStream stream= *new*ByteArrayOutputStream();

imageBaos= *new*ByteArrayOutputStream[pageCount];

*for*(*int*page= 0; page*private**static*BufferedImage joinBufferedImage(BufferedImage img1, 
BufferedImage img2) {


// *TODO*Auto-generated method stub

*int*offset= 5;

*int*wid= Math./max/(img1.getWidth(),img2.getWidth() + offset);

*int*height= img1.getHeight() + img2.getHeight() + offset;

BufferedImage newImage= 
*new*BufferedImage(wid,height,BufferedImage.*/TYPE_INT_RGB/*);


Graphics2D g2= newImage.createGraphics();

Color oldColor= g2.getColor();

g2.setPaint(Color.*/WHITE/*);

g2.fillRect(0, 0, wid, height);

g2.setColor(oldColor);

g2.drawImage(img1, *null*, 0, 0);

g2.drawImage(img2, *null*, 0, img1.getHeight() + offset);

g2.dispose();

*return*newImage;

}

}

*From:* Tilman Hausherr 
*Sent:* Wednesday, January 10, 2024 10:17 AM
*To:* users@pdfbox.apache.org
*Subject:* Re: FW: PDFBox 3.0.1 Font changes when rendering PDF to Image

*
**  External Email - Use Caution *

Hi,

I tested with 3.0.1 and got one log message:

Unexpected XRefTable Entry: 0    24

that's because that line is " 0 24" instead of "0 24". However 
that doesn't seem to have a negative effect. Here's how the image looks:


Tilman

On 10.01.2024 15:52, Lisa Moore wrote:

A sample PDF file can be seen here:


https://www.dropbox.com/scl/fi/w5zgfrqbulungxd4dpq37/MuseTest.pdf?rlkey=jskisldanhoxf3pvcqqy6nk7b=0
  
<https://www.dropbox.com/scl/fi/w5zgfrqbulungxd4dpq37/MuseTest.pdf?rlkey=jskisldanhoxf3pvcqqy6nk7b=0>

    -Original Message-

From: Tilman Hausherr  <mailto:thaush...@t-online.de>

Sent: Wednesday, January 10, 2024 8:09 AM

To:users@pdfbox.apache.org

Subject: Re: FW: PDFBox 3.0.1 Font changes when rendering PDF to Image

   External Email - Use Caution

Hi,

We'd need the PDF file, please upload to a sharehoster. Your attachments 
(all of them) didn't get through.

Also try to use the latest snapshot


https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/3.0.2-SNAPSHOT/

and look at the log messages.

Tilman

On 10.01.2024 13:39, Lisa Moore wrote:

*From:* Lisa Moore

*Sent:* Tuesday, January 9, 2024 10:54 AM

*To:*users-h...@pdfbox.apache.org

*Subject:* PDFBox 3.0.1 Font changes when rendering PDF to Image

Hi,

I am using PDFBox to render a PDF to a .png image.  In the past,  I

used version 2.0.23 which worked without issue.  When the image is

rendered in verion 3.0.1, the text part of the PDF document does not

properly convert the Font (Times Roman).   How can I fix this issue?

I have attached the images to show the comparison of what is being

rendered in version 3.0.1 versus 2.0.23.

Thanks for any help you can provide.

Lisa Moore

-

To unsubscribe,e-mail:users-unsubscr...@pdfbox.apache.org

For additional commands,e-mail:users-h...@pdfbox.apache.org

-

To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org

For additional commands, e-mail:users-h...@pdfbox.apache.org



Re: FW: PDFBox 3.0.1 Font changes when rendering PDF to Image

2024-01-10 Thread Tilman Hausherr

Hi,

I tested with 3.0.1 and got one log message:

Unexpected XRefTable Entry: 0    24

that's because that line is " 0 24" instead of "0 24". However that 
doesn't seem to have a negative effect. Here's how the image looks:



Tilman

On 10.01.2024 15:52, Lisa Moore wrote:

A sample PDF file can be seen here:
https://www.dropbox.com/scl/fi/w5zgfrqbulungxd4dpq37/MuseTest.pdf?rlkey=jskisldanhoxf3pvcqqy6nk7b=0

-Original Message-
From: Tilman Hausherr
Sent: Wednesday, January 10, 2024 8:09 AM
To:users@pdfbox.apache.org
Subject: Re: FW: PDFBox 3.0.1 Font changes when rendering PDF to Image


   External Email - Use Caution



Hi,

We'd need the PDF file, please upload to a sharehoster. Your attachments (all 
of them) didn't get through.
Also try to use the latest snapshot
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/3.0.2-SNAPSHOT/
and look at the log messages.
Tilman

On 10.01.2024 13:39, Lisa Moore wrote:

*From:* Lisa Moore
*Sent:* Tuesday, January 9, 2024 10:54 AM
*To:*users-h...@pdfbox.apache.org
*Subject:* PDFBox 3.0.1 Font changes when rendering PDF to Image

Hi,

I am using PDFBox to render a PDF to a .png image.  In the past,  I
used version 2.0.23 which worked without issue.  When the image is
rendered in verion 3.0.1, the text part of the PDF document does not
properly convert the Font (Times Roman).   How can I fix this issue?
I have attached the images to show the comparison of what is being
rendered in version 3.0.1 versus 2.0.23.

Thanks for any help you can provide.

Lisa Moore


-
To unsubscribe,e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands,e-mail:users-h...@pdfbox.apache.org


-
To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:users-h...@pdfbox.apache.org



Re: java.io.IOException: Unknown dir object c='>' cInt=62 peek='>' peekInt=62

2024-01-10 Thread Tilman Hausherr

Hi,

This is a syntax error in the PDF. There should be another token after "/N".

Tilman

On 10.01.2024 13:19, John, Ines wrote:


Hello PdfBox-Team,

we have the following problem in our project:

When merging documents we get an exception for a certain document. 
That’s why we updated the version of pdfBox to 3.0.1. Now we can merge 
the documents but we still get the error in the logfile.


We merge documents by using *pdfMergerUtility.mergeDocuments();*

Extract from the logfile:

2024-01-10 09:36:34.396 ERROR 11764 --- [pool-1-thread-1] 
org.apache.pdfbox.cos.COSObject  : Can't dereference 
COSObject{14, 0}


java.io.IOException: Unknown dir object c='>' cInt=62 peek='>' 
peekInt=62 at offset 179966 (start offset: 179966)


   at 
org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:921) 
~[pdfbox-3.0.1.jar:3.0.1]


   at 
org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:187) 
~[pdfbox-3.0.1.jar:3.0.1]


   at 
org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:347) 
~[pdfbox-3.0.1.jar:3.0.1]


   at 
org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:263) 
~[pdfbox-3.0.1.jar:3.0.1]


   at 
org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:882) 
~[pdfbox-3.0.1.jar:3.0.1]


   at 
org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:734) 
~[pdfbox-3.0.1.jar:3.0.1]


   at 
org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:668) 
~[pdfbox-3.0.1.jar:3.0.1]


   at 
org.apache.pdfbox.pdfparser.COSParser.dereferenceCOSObject(COSParser.java:623) 
~[pdfbox-3.0.1.jar:3.0.1]


   at 
org.apache.pdfbox.cos.COSObject.getObject(COSObject.java:121) 
~[pdfbox-3.0.1.jar:3.0.1]


   at 
org.apache.pdfbox.cos.COSDictionary.getDictionaryObject(COSDictionary.java:186) 
~[pdfbox-3.0.1.jar:3.0.1]


   at 
org.apache.pdfbox.cos.COSDictionary.getCOSDictionary(COSDictionary.java:551) 
~[pdfbox-3.0.1.jar:3.0.1]


   at 
org.apache.pdfbox.pdmodel.PDDocument.getDocumentInformation(PDDocument.java:745) 
~[pdfbox-3.0.1.jar:3.0.1]


   at 
org.apache.pdfbox.multipdf.PDFMergerUtility.appendDocument(PDFMergerUtility.java:527) 
~[pdfbox-3.0.1.jar:3.0.1]


   at 
org.apache.pdfbox.multipdf.PDFMergerUtility.legacyMergeDocuments(PDFMergerUtility.java:468) 
~[pdfbox-3.0.1.jar:3.0.1]


   at 
org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:363) 
~[pdfbox-3.0.1.jar:3.0.1]


   at 
org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:343) 
~[pdfbox-3.0.1.jar:3.0.1]


We can’t share the original document of our customer with you, but we 
could manipulate an empty pdf document by inserting the problematic 
object:


14 0 obj

<< /N >>

endobj

I attached the example pdf to my email.

Kind regards,

Ines

---
 >>> business. people. technology. <<<
---

adesso SE mit Sitz in Dortmund
Vorstand: Mark Lohweber (Vors.), Kristina Gerwert,
Andreas Prenneis, Jörg Schroeder, Torsten Wegener
Vorsitzender des Aufsichtsrates: Prof. Dr. Volker Gruhn
Amtsgericht Dortmund HRB 20663

-
To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:users-h...@pdfbox.apache.org




Re: FW: PDFBox 3.0.1 Font changes when rendering PDF to Image

2024-01-10 Thread Tilman Hausherr

Hi,

We'd need the PDF file, please upload to a sharehoster. Your attachments 
(all of them) didn't get through.

Also try to use the latest snapshot
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/3.0.2-SNAPSHOT/
and look at the log messages.
Tilman

On 10.01.2024 13:39, Lisa Moore wrote:


*From:* Lisa Moore
*Sent:* Tuesday, January 9, 2024 10:54 AM
*To:* users-h...@pdfbox.apache.org
*Subject:* PDFBox 3.0.1 Font changes when rendering PDF to Image

Hi,

I am using PDFBox to render a PDF to a .png image.  In the past,  I 
used version 2.0.23 which worked without issue.  When the image is 
rendered in verion 3.0.1, the text part of the PDF document does not 
properly convert the Font (Times Roman).   How can I fix this issue?   
I have attached the images to show the comparison of what is being 
rendered in version 3.0.1 versus 2.0.23.


Thanks for any help you can provide.

Lisa Moore


-
To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:users-h...@pdfbox.apache.org




Re: Inquiry on Filling Chinese Characters in AcroForm with PDFBox 3.0.1

2024-01-05 Thread Tilman Hausherr

Hi,

I only remember that we always advise to never embed font subsets in 
AcroForm fields. Your subsetted file doesn't have the actual subset fonts.


Does this effect also happen when you don't flatten? And if you save 
first, then reload and flatten?


Tilman

On 05.01.2024 08:41, Congwei Ni wrote:


Hi Apache PDFBox Team,

I am currently working with PDFBox 3.0.1 for filling AcroForm fields 
in my PDF files, with Chinese characters. In my attempts, I've loaded 
the SimSun.ttf Chinese font into the file and set embed subset to 
false. While this approach successfully fills the Chinese characters, 
the resultant PDF file size is significantly large, which does not 
meet my requirements.


When I set embed subset to true, the file size is reduced, and the PDF 
displays correctly on my Mac. However, on Windows, the embedded 
Chinese characters in the same file appear as garbled text. Notably, 
the same SimSun.ttf font is installed on both systems.


I am seeking advice on how to meet the following requirements:

1. Correctly embed Chinese characters into AcroForm fields, ensuring 
they display accurately across most systems without any encoding issues.


2. Keep the final PDF file size under 500KB.


To illustrate my issue, I have attached the following items to this 
google drive:


https://drive.google.com/drive/folders/1vUiKt_Z1z7CwgIaL73Jki_FmAZIWOw1c?usp=drive_link

1. A PDF file generated with embed subset set to true.

2. A PDF file generated with embed subset set to false.

3. The source code I am using for embedding the font and filling the form.

4. SimSun.ttf I am using

Could you please provide guidance or suggest alternative methods to 
achieve these objectives? Any sample code would be greatly appreciated.



Thank you for your assistance.

Best regards,

Congwei


-
To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:users-h...@pdfbox.apache.org




Re: Cannot get overlaypdf working on command line interface

2024-01-05 Thread Tilman Hausherr

Hi,

The bug I found and fixed ( 
https://issues.apache.org/jira/browse/PDFBOX-5748 ) is only in the 
command line interface. Please try with a snapshot build:


https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/3.0.2-SNAPSHOT/

*(at the bottom)*

and tell whether it works now.

Tilman

On 05.01.2024 09:14, Lukas Jans wrote:

Hello again,

Thanks very much for your swift reply. I then suspect that this only applies to 
the command line interface? (Because the intention of using the command line 
was only to get a feeling of how overlay works, the main aim is to use the java 
library in a project.)

Kind regards
Luke



Re: Cannot get overlaypdf working on command line interface

2024-01-04 Thread Tilman Hausherr
Sorry, seems I read part 1, 2 and 4 but not part 3. I suspect a bug in 
OverlayPDF.java that has been there since the end of 2020 (!), but only 
in 3.0.* and the trunk, "infile" is never assigned.


Tilman

On 05.01.2024 08:00, Lukas Jans wrote:


*Tilman Hausherr*- Donnerstag, 4. Januar 2024 12:07:13 MEZ

Please use "overlay" instead of "OverlayPDF". This is a documentation bug.

(See also the "did you mean" line in the error message)

Tilman

Hello

Thanks for the reply, but that is exactly what I did and described in 
the second half of my first post. I repeat it here for completion’s sake:


I can of course adapt the command to overlay instead of overlayPDF. 
Leaving the other arguments unchanged I get the following result:


Ein Bild, das Text, Screenshot, Schrift enthält. Automatisch 
generierte Beschreibung


In order to check whether there is a mistake concerning the formatting 
of the arguments or so, I try the command merge with the same 
arguments like this:


This works as intended.

So again, what is the problem with the execution of the overlay command?

Kind regards

Luke



Re: Cannot get overlaypdf working on command line interface

2024-01-04 Thread Tilman Hausherr

Please use "overlay" instead of "OverlayPDF". This is a documentation bug.

(See also the "did you mean" line in the error message)

Tilman

On 04.01.2024 12:00, Lukas Jans wrote:


Hello

I am having troubles using the pdfbox command line interface. I have 
downloaded the pdfbox-app-3.0.1.jar and saved it locally.


I am using Windows 11 version 22H2. On my PC Java is installed as follows:

In the same local folder I have two pdfs, namely document.pdf and 
background_confidential.pdf. Now I want to overlay them using the 
command overlayPDF as described here: Apache PDFBox | Command-Line 
Tools . I 
try that as follows:


This is slightly confusing in the first place since the documentation 
says the command to be called overlayPDF and not overlay as suggested 
by the error message.


However, I can of course adapt the command to overlay instead of 
overlayPDF. Leaving the other arguments unchanged I get the following 
result:


Ein Bild, das Text, Screenshot, Schrift enthält. Automatisch 
generierte Beschreibung


In order to check whether there is a mistake concerning the formatting 
of the arguments or so, I try the command merge with the same 
arguments like this:


This works as intended.

So, first of all, is there something wrong with the documentation? 
Because it says the command is called overlayPDF but I have to enter 
overlay to make it understand what I want. And secondly, what is the 
problem with the execution of the overlay command?


Thanks for any help in advance.

Kind regards

Luke



Re: Importing landscape format and portrait format oriented pages into the same PDF causes PDF corruption

2024-01-03 Thread Tilman Hausherr
Please retry with 3.0.1 and if it still doesn't work, with the current 
snapshot version, because there have been several bugs related to 
include "foreign" pages in PDFs.

https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/3.0.2-SNAPSHOT/
Tilman

On 03.01.2024 09:43, Fabian Zünd SI-Solutions Gmbh wrote:


Good Day

The platform i’m developing for recently switched from PDFBox 2.X to 
3.0.0.


I created an add-on which generates a PDF-Documentation of the PBX for 
customers.


This PDF Contains multiple A4-Pages, some in the normail Portrait 
format, some rotated in landscape format for more space.


I use «Template» pages which are single page PDF’s. (Cover Sheet.pdf, 
Normal_page.pdf, Normal_page_landscape.pdf), of which i create a copy 
for every page in the main pdf, based on what the user’s choice for 
the documentation is.


In 2.X i used the integrated PDFCloneUtility to create a copy of the 
Template Page(s), and copy it to the main PDF using this:


PDPage SelectedPage = PDFSource.getPage(PageNumber);

  PDFCloneUtility PDC = new PDFCloneUtility(PDFTarget);

  COSDictionary PD = (COSDictionary) 
PDC.cloneForNewDocument(SelectedPage);


  PDPage ClonedPage = new PDPage(PD);

PDFTarget.addPage(ClonedPage);

But since the PDFCloneUtility is protected in 3.0.0 i switched over to 
using the PDDocument ImportPage Function.


PDPage SelectedPage = PDFSource.getDocument().getPage(PageNumber);

PDPage PDCopiedPage = PDFTarget.importPage(SelectedPage);

Everything seemed fine, when testing. But when i started to generate 
the full documentation, the finished pdf did contain all pages, but 
adobe throws a lot of errors, and all the Landscaped pages are blank.


If i only generate Portrait Pages (Generated_PDF_Portraint_only.pdf), 
or LandScape Pages (Generated_PDF_Landscape_only.pdf) everything is 
fine, but when i mix them (Generated_PDF_Mixed.pdf), the result is broken.


I don’t exactly know what could be causing this issue, i was hoping 
somebody might have some kind of clue, where this could come from.


Maybe i’m misunderstanding the importpage function, and that is not 
actually the correct way to clone pages?


Sincerely

Fabian Zünd


-
To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:users-h...@pdfbox.apache.org




Re: Splitter creates corrupted PDFs

2023-12-27 Thread Tilman Hausherr

Hi,
It has been fixed now, you can try a snapshot build
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/3.0.2-SNAPSHOT/
Tilman

On 23.12.2023 13:14, Thaler Robert wrote:

thanks for your quick reply.

I just wanted to submit this bug and do not need a JIRA account.

regards
robert

On 2023/12/21 12:08:01 Tilman Hausherr wrote:

Hi,

I remember your name, you tried to create a JIRA account with the text
"submitting a bug", which was a meaningless text unlike your subject now
which is a meaningful text.
I was able to reproduce the problem and have created a ticket:
https://issues.apache.org/jira/browse/PDFBOX-5742
If you want to participate please register again (with a meaningful text
so we can assume it's you)
Tilman

On 21.12.2023 11:40, Thaler Robert wrote:

Hi,

After upgrading to PdfBox 3.0.x Splitter creates corrupted page pdfs.
It occurs occasionally with version 3.0.0 and frequently with 3.0.1.

No problems with version 2.0.30 - so we were forced to revert to
version 2.0.x.

Hopefully this can be fixed soon - sample pdfs attached.

kind regards
Robert

-
To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:users-h...@pdfbox.apache.org





-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Splitter creates corrupted PDFs

2023-12-21 Thread Tilman Hausherr

Hi,

I remember your name, you tried to create a JIRA account with the text 
"submitting a bug", which was a meaningless text unlike your subject now 
which is a meaningful text.

I was able to reproduce the problem and have created a ticket:
https://issues.apache.org/jira/browse/PDFBOX-5742
If you want to participate please register again (with a meaningful text 
so we can assume it's you)

Tilman

On 21.12.2023 11:40, Thaler Robert wrote:

Hi,

After upgrading to PdfBox 3.0.x Splitter creates corrupted page pdfs.
It occurs occasionally with version 3.0.0 and frequently with 3.0.1.

No problems with version 2.0.30 - so we were forced to revert to 
version 2.0.x.


Hopefully this can be fixed soon - sample pdfs attached.

kind regards
Robert

-
To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:users-h...@pdfbox.apache.org




Re: Blank pages when splitting PDF with version 3.0.1

2023-12-19 Thread Tilman Hausherr

Hi,

Please retry with the current snapshot:
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/3.0.2-SNAPSHOT/

if it doesn't work, please upload your file to a sharehoster that 
doesn't require login.


Tilman

On 19.12.2023 15:53, Marco Philipp GRAF wrote:

Hello PDFBox

When we split a PDF which was produced scanning a document with version 3.0.1, 
pages are blank:

java -jar .\pdfbox-app-3.0.1.jar split -i="scan-with-images-3-pages.pdf" -startPage="2" 
-endPage="3"

The output PDF shows only two blank pages.

Doing the same split with version 2.0.30 produces a PDF with the expected 
content on the two pages:

java -jar .\pdfbox-app-2.0.30.jar PDFSplit -startPage 2 -endPage 3 '.\ 
scan-with-images-3-pages.pdf'

(the above command lines are for/from PowerShell on Windows)

Please note that we encountered this error only with certain scanned documents 
containing images. We also had issues with blank pages on split with version 
3.0.0 which we hoped were completely fixed with PDFBOX-5666.

As splitting the same PDF with version 2.0.30 works we assume this is a bug in 
version 3.0.1.

I would very much like to have added the PDF document to this mail, but 
unfortunately mails larger than 100 bytes are rejected. How can I send you 
the example PDF document for which splitting produces blank pages?

Cheers,
Marco

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: PDF to PDF/A conversion on java

2023-12-19 Thread Tilman Hausherr

On 19.12.2023 00:24, CowwoC wrote:

I'm going to need to do something like this in the near future. Are there
any good samples or documentation I can look at for this use-use?





import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import javax.xml.transform.TransformerException;
import org.apache.pdfbox.Loader;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentInformation;
import org.apache.pdfbox.pdmodel.common.PDMetadata;
import org.apache.pdfbox.pdmodel.graphics.color.PDOutputIntent;
import org.apache.xmpbox.XMPMetadata;
import org.apache.xmpbox.schema.DublinCoreSchema;
import org.apache.xmpbox.schema.PDFAIdentificationSchema;
import org.apache.xmpbox.schema.XMPBasicSchema;
import org.apache.xmpbox.type.BadFieldValueException;
import org.apache.xmpbox.xml.XmpSerializer;


public final class ConvertToPDFA
{

    private ConvertToPDFA()
    {
    }

    public static void main(String[] args) throws IOException, 
TransformerException

    {
    String file = "\\testme1.pdf";
    String file2 = "\\testme1-pdfa.pdf";

    try (PDDocument doc = Loader.loadPDF(new File(file)))
    {
    doc.setVersion(1.4f);
    // add XMP metadata
    XMPMetadata xmp = XMPMetadata.createXMPMetadata();

    try
    {
    DublinCoreSchema dc = xmp.createAndAddDublinCoreSchema();
    dc.setTitle(file);

    PDFAIdentificationSchema id = 
xmp.createAndAddPDFAIdentificationSchema();

    id.setPart(1);
    id.setConformance("B");

    PDDocumentInformation info = new PDDocumentInformation();
    info.setCreator("PDFBox");
    XMPBasicSchema basicSchema = 
xmp.createAndAddXMPBasicSchema();

    basicSchema.setCreatorTool("PDFBox");

    XmpSerializer serializer = new XmpSerializer();
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    serializer.serialize(xmp, baos, true);

    PDMetadata metadata = new PDMetadata(doc);
    metadata.importXMPMetadata(baos.toByteArray());
    doc.getDocumentCatalog().setMetadata(metadata);
    doc.setDocumentInformation(info);
    }
    catch (BadFieldValueException e)
    {
    // won't happen here, as the provided value is valid
    throw new IllegalArgumentException(e);
    }

    // sRGB output intent
    InputStream colorProfile = 
ConvertToPDFA.class.getResourceAsStream(

    "/org/apache/pdfbox/resources/pdfa/sRGB.icc");
    PDOutputIntent intent = new PDOutputIntent(doc, colorProfile);
    intent.setInfo("sRGB IEC61966-2.1");
    intent.setOutputCondition("sRGB IEC61966-2.1");
    intent.setOutputConditionIdentifier("sRGB IEC61966-2.1");
    intent.setRegistryName("http://www.color.org;);
    doc.getDocumentCatalog().addOutputIntent(intent);

    doc.save(file2);
    }
    }
}


Re: PDF to PDF/A conversion on java

2023-12-16 Thread Tilman Hausherr

On 21.11.2023 11:31, Kirandas vakkil wrote:

Hi All,

Can you please share if there is any resource on converting EXISTING PDF to
PDF/A in java.


There are commercial tools for this. PDFBox doesn't offer anything, 
however you can still do it if there are very few errors and you know 
how to fix them, and all files are from the same source. This is usually 
true for files from scanners. There you usually only have to add an 
output intent and the correct metadata.


Tilman





This will be of great help to me. Thanks in advance.

Regards,
Apache Patron




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: PDFBox 3.0.1 renderer fails on certain files

2023-12-16 Thread Tilman Hausherr
The file you mention likely has an almost empty stream. The other 
viewers don't fail, that's the difference.


There might also be a different problem (object reference mismatch), so 
it would be nice to have the file. Despite the LZW compression, the part 
that fails isn't an image in this stack trace, it's the stream of a type 
4 function for a DeviceN colorspace.


Tilman

On 16.12.2023 00:11, John Lussmyer wrote:
I have a customer that uses a LOT of PDF files.  They currently have 2 
files that are failing when we try to render them.
The same files can be viewed with Acrobat Reader or Foxit PDF with no 
errors reported.


From Acrobat Reader file info:
PDF Producer: PDFOut V3.8 – build 201 – Oct 28 2022
PDF Version: 1.6 (Acrobat 7.x)

The stacktrace makes me suspect that the file has an error in it's 
image compression data - which other readers somehow ignore.


Any suggestions?

This is the exception trace from PDFBox 3.0.1

java.io.IOException: negative array index: -1 near offset 1
   at 
org.apache.pdfbox.filter.LZWFilter.checkIndexBounds(LZWFilter.java:136) 
~[pdfbox-3.0.1.jar:3.0.1]
   at 
org.apache.pdfbox.filter.LZWFilter.doLZWDecode(LZWFilter.java:110) 
~[pdfbox-3.0.1.jar:3.0.1]
   at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:70) 
~[pdfbox-3.0.1.jar:3.0.1]
   at org.apache.pdfbox.filter.Filter.decode(Filter.java:96) 
~[pdfbox-3.0.1.jar:3.0.1]
   at org.apache.pdfbox.filter.Filter.decode(Filter.java:238) 
~[pdfbox-3.0.1.jar:3.0.1]
   at 
org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:73) 
~[pdfbox-3.0.1.jar:3.0.1]
   at 
org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:172) 
~[pdfbox-3.0.1.jar:3.0.1]
   at 
org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:166) 
~[pdfbox-3.0.1.jar:3.0.1]
   at 
org.apache.pdfbox.pdmodel.common.PDStream.createInputStream(PDStream.java:188) 
~[pdfbox-3.0.1.jar:3.0.1]
   at 
org.apache.pdfbox.pdmodel.common.PDStream.toByteArray(PDStream.java:407) 
~[pdfbox-3.0.1.jar:3.0.1]
   at 
org.apache.pdfbox.pdmodel.common.function.PDFunctionType4.(PDFunctionType4.java:51) 
~[pdfbox-3.0.1.jar:3.0.1]
   at 
org.apache.pdfbox.pdmodel.common.function.PDFunction.create(PDFunction.java:143) 
~[pdfbox-3.0.1.jar:3.0.1]
   at 
org.apache.pdfbox.pdmodel.graphics.color.PDDeviceN.(PDDeviceN.java:93) 
~[pdfbox-3.0.1.jar:3.0.1]
   at 
org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace.create(PDColorSpace.java:184) 
~[pdfbox-3.0.1.jar:3.0.1]
   at 
org.apache.pdfbox.pdmodel.PDResources.getColorSpace(PDResources.java:223) 
~[pdfbox-3.0.1.jar:3.0.1]
   at 
org.apache.pdfbox.pdmodel.PDResources.getColorSpace(PDResources.java:193) 
~[pdfbox-3.0.1.jar:3.0.1]
   at 
org.apache.pdfbox.contentstream.operator.color.SetNonStrokingColorSpace.process(SetNonStrokingColorSpace.java:56) 
~[pdfbox-3.0.1.jar:3.0.1]
   at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:892) 
~[pdfbox-3.0.1.jar:3.0.1]
   at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:530) 
~[pdfbox-3.0.1.jar:3.0.1]
   at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:505) 
~[pdfbox-3.0.1.jar:3.0.1]
   at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:152) 
~[pdfbox-3.0.1.jar:3.0.1]
   at 
org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:282) 
~[pdfbox-3.0.1.jar:3.0.1]
   at 
org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:330) 
~[pdfbox-3.0.1.jar:3.0.1]
   at 
org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:247) 
~[pdfbox-3.0.1.jar:3.0.1]
   at 
org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:233) 
~[pdfbox-3.0.1.jar:3.0.1]
   at 
com.metrixsoftware.preview.PDFBoxRenderer.render(PDFBoxRenderer.java:79) 
[bin/:?]




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Could not load font file

2023-12-14 Thread Tilman Hausherr

Hi,

The "SubstFormat" bug is not really important because it doesn't abort, 
the "Format 14 cmap table" isn't really a bug, there are usually several 
tables.

Please try a snapshot version, the "SubstFormat" bug has been fixed:
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/3.0.2-SNAPSHOT/

However I'm wondering why "LastResort.otf" fails. Could you upload that 
one somewhere? And also post the full stack trace?


Tilman

On 13.12.2023 23:30, Emily Vorderwülbeke wrote:

Hei,

I've been trying to write text into a pdf using PdfBox 3.0.1. on macOS 
14.1. When setting the value of a PDField I always get the following 
exception:


Could not load font file: /System/Library/Fonts/LastResort.otf 
java.io.IOException: Invalid character code 0xD800


After that some warnings appear, with more font files which can not be 
found


Format 14 cmap table is not supported and will be ignored
The expected SubstFormat for ExtensionSubstFormat1 subtable is 4 but 
should be 1
The expected SubstFormat for ExtensionSubstFormat1 subtable is 0 but 
should be 1

Format 14 cmap table is not supported and will be ignored
The expected SubstFormat for ExtensionSubstFormat1 subtable is 4 but 
should be 1
The expected SubstFormat for ExtensionSubstFormat1 subtable is 0 but 
should be 1

Unknown substFormat: 0
Format 14 cmap table is not supported and will be ignored

Even when I try to set the font of a PDPageContentStream in another 
example to Helvetica the same problems occur.


What exactly is going on here? As far as I know all the missing fonts 
are intalled on my Laptop.


Thanks for any help.

Best,
Emily

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Regarding CMap invalid query

2023-12-13 Thread Tilman Hausherr

On 13.12.2023 17:26, Tmy Hub wrote:
I have a pdf that has Veranda Bold Font. And Indentify H type. We 
cannot able to read that font text correctly.


It shows invalid CMap. I will attach the PDF file.

What I have to do in that. Let us know and it will greatly helpful for us.



Yes the /ToUnicode is empty. And the encoding is NOT identity:

It's not even VerdanaBold. Here's an attempt to fix a similar problem:

https://stackoverflow.com/questions/39485920/how-to-add-unicode-in-truetype0font-on-pdfbox-2-0-0


Tilman


Re: Fetch the background color for text in PDF

2023-12-05 Thread Tilman Hausherr
There is no such a thing as "the background color". The background is 
whatever you have at the area when you're putting out the glyphs. It can 
be several colors if you're overwriting an image.


Tilman




On 06.12.2023 03:23, Jeffrey Matthew wrote:

Hey Team,

I'm new to pdfbox and working on it right now.
I wanted to know if we have sample code to fetch the background color for a
text with pdfBox.?
The example provided in :
https://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/util/PrintTextColors.java?view=markup=1904918

didn't work even though the mode is Fill(0).

PDFBox version: 3.0.0

Thanks,
Jeffrey




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Font operation takes a long time with 3.0.1

2023-12-05 Thread Tilman Hausherr
Thanks for the feedback. It turns out that there's another error 
(checksum was empty because MessageDigest doesn't support CRC32), which 
has been fixed now, please test again (delete the file first). The 
second-to-last field should now not be empty.


It also teaches an important lesson: a "// never happens" segment should 
have an output.


Tilman

On 05.12.2023 11:34, Kjetil Ødegaard wrote:

Nice! Tested it now and I can confirm that it fixes the issue. I see good
performance even from the first operation.

Checked the cache file and there is a line for this font there now:

➜  ~ grep -i NotoSansKannada .pdfbox.cache
*skipexception*|TTF||0|0|0|0|0||/System/Library/Fonts/NotoSansKannada.ttc||1700331239000

Thanks for the quick response, great work!

BR Kjetil

tir. 5. des. 2023 kl. 09:55 skrev Tilman Hausherr :


Thanks, new snapshot build here:

https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/3.0.2-SNAPSHOT/


Ticket:
https://issues.apache.org/jira/browse/PDFBOX-5727

Tilman

On 05.12.2023 08:41, Kjetil Ødegaard wrote:

To clarify, this stack trace is not printed anywhere. I got it from
stepping into the code and invoking printStackTrace() on the exception to
get the whole stack. See complete stack trace below.

I agree with your theory, it matches what I'm seeing. These fonts are

never

added to the cache file, so the cache file is always rebuilt.

I double checked the cache file again and there is no trace of these two
fonts, but lots of entries for other fonts (of different weights). I see
from the timestamp on the file that it is rebuilt on every run.

BR Kjetil

java.io.EOFException
at


org.apache.fontbox.ttf.TTFDataStream.readUnsignedShort(TTFDataStream.java:154)

at


org.apache.fontbox.ttf.TTFDataStream.readUnsignedShortArray(TTFDataStream.java:188)

at


org.apache.fontbox.ttf.GlyphSubstitutionTable.readMultipleSubstitutionSubtable(GlyphSubstitutionTable.java:412)

at


org.apache.fontbox.ttf.GlyphSubstitutionTable.readLookupSubtable(GlyphSubstitutionTable.java:263)

at


org.apache.fontbox.ttf.GlyphSubstitutionTable.readLookupTable(GlyphSubstitutionTable.java:313)

at


org.apache.fontbox.ttf.GlyphSubstitutionTable.readLookupList(GlyphSubstitutionTable.java:247)

at


org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:102)

at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:365)
at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:165)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:144)
at


org.apache.fontbox.ttf.TrueTypeCollection.getFontAtIndex(TrueTypeCollection.java:127)

at


org.apache.fontbox.ttf.TrueTypeCollection.processAllFonts(TrueTypeCollection.java:109)

at


org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.addTrueTypeCollection(FileSystemFontProvider.java:665)

at


org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.scanFonts(FileSystemFontProvider.java:396)

at


org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.(FileSystemFontProvider.java:367)

at


org.apache.pdfbox.pdmodel.font.FontMapperImpl$DefaultFontProvider.(FontMapperImpl.java:139)

at


org.apache.pdfbox.pdmodel.font.FontMapperImpl.getProvider(FontMapperImpl.java:158)

at


org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFont(FontMapperImpl.java:416)

at


org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFontBoxFont(FontMapperImpl.java:379)

at


org.apache.pdfbox.pdmodel.font.FontMapperImpl.getFontBoxFont(FontMapperImpl.java:353)

at

org.apache.pdfbox.pdmodel.font.PDType1Font.(PDType1Font.java:127)

tir. 5. des. 2023 kl. 05:03 skrev Tilman Hausherr 
Please do also post the full (for pdfbox / fontbox) stack trace. I have
a theory why it happens, which is that addTrueTypeCollection() does not
add the font as "*skipexception*" to the cache file because it's not
done in the exception handler.

Tilman

On 04.12.2023 21:17, Tilman Hausherr wrote:

Does the stack trace appear at every start? If yes then it's a bug.
The intent of the current code is that bad fonts aren't retried. The
font cache file should contain a line with "*skipexception*" for that
font. Can you look at it for the two font files?

I could change SHA512 to CRC32. It has the advantage that it won't
trigger people who heard about MD5 

I made a test and CRC32 is 20% faster.

Tilman

On 04.12.2023 18:48, Gili Tzabari wrote:

I think the commit contains a typo:


872
<

https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java?revision=1912514=markup=1912514#l872

  private static String computeHash(byte[] ba)
873
<

https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java?revision=1912514=markup=1912514#l873

  {
874
<

https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java?revision=1912514=mark

Re: Font operation takes a long time with 3.0.1

2023-12-05 Thread Tilman Hausherr

Thanks, new snapshot build here:
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/3.0.2-SNAPSHOT/ 



Ticket:
https://issues.apache.org/jira/browse/PDFBOX-5727

Tilman

On 05.12.2023 08:41, Kjetil Ødegaard wrote:

To clarify, this stack trace is not printed anywhere. I got it from
stepping into the code and invoking printStackTrace() on the exception to
get the whole stack. See complete stack trace below.

I agree with your theory, it matches what I'm seeing. These fonts are never
added to the cache file, so the cache file is always rebuilt.

I double checked the cache file again and there is no trace of these two
fonts, but lots of entries for other fonts (of different weights). I see
from the timestamp on the file that it is rebuilt on every run.

BR Kjetil

java.io.EOFException
at
org.apache.fontbox.ttf.TTFDataStream.readUnsignedShort(TTFDataStream.java:154)
at
org.apache.fontbox.ttf.TTFDataStream.readUnsignedShortArray(TTFDataStream.java:188)
at
org.apache.fontbox.ttf.GlyphSubstitutionTable.readMultipleSubstitutionSubtable(GlyphSubstitutionTable.java:412)
at
org.apache.fontbox.ttf.GlyphSubstitutionTable.readLookupSubtable(GlyphSubstitutionTable.java:263)
at
org.apache.fontbox.ttf.GlyphSubstitutionTable.readLookupTable(GlyphSubstitutionTable.java:313)
at
org.apache.fontbox.ttf.GlyphSubstitutionTable.readLookupList(GlyphSubstitutionTable.java:247)
at
org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:102)
at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:365)
at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:165)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:144)
at
org.apache.fontbox.ttf.TrueTypeCollection.getFontAtIndex(TrueTypeCollection.java:127)
at
org.apache.fontbox.ttf.TrueTypeCollection.processAllFonts(TrueTypeCollection.java:109)
at
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.addTrueTypeCollection(FileSystemFontProvider.java:665)
at
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.scanFonts(FileSystemFontProvider.java:396)
at
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.(FileSystemFontProvider.java:367)
at
org.apache.pdfbox.pdmodel.font.FontMapperImpl$DefaultFontProvider.(FontMapperImpl.java:139)
at
org.apache.pdfbox.pdmodel.font.FontMapperImpl.getProvider(FontMapperImpl.java:158)
at
org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFont(FontMapperImpl.java:416)
at
org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFontBoxFont(FontMapperImpl.java:379)
at
org.apache.pdfbox.pdmodel.font.FontMapperImpl.getFontBoxFont(FontMapperImpl.java:353)
at org.apache.pdfbox.pdmodel.font.PDType1Font.(PDType1Font.java:127)

tir. 5. des. 2023 kl. 05:03 skrev Tilman Hausherr :


Please do also post the full (for pdfbox / fontbox) stack trace. I have
a theory why it happens, which is that addTrueTypeCollection() does not
add the font as "*skipexception*" to the cache file because it's not
done in the exception handler.

Tilman

On 04.12.2023 21:17, Tilman Hausherr wrote:

Does the stack trace appear at every start? If yes then it's a bug.
The intent of the current code is that bad fonts aren't retried. The
font cache file should contain a line with "*skipexception*" for that
font. Can you look at it for the two font files?

I could change SHA512 to CRC32. It has the advantage that it won't
trigger people who heard about MD5 

I made a test and CRC32 is 20% faster.

Tilman

On 04.12.2023 18:48, Gili Tzabari wrote:

I think the commit contains a typo:


872
<

https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java?revision=1912514=markup=1912514#l872>


 private static String computeHash(byte[] ba)
873
<

https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java?revision=1912514=markup=1912514#l873>


 {
874
<

https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java?revision=1912514=markup=1912514#l874>


 MessageDigest md;
875
<

https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java?revision=1912514=markup=1912514#l875>


 try
876
<

https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java?revision=1912514=markup=1912514#l876>


 {
877
<

https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java?revision=1912514=markup=1912514#l877>


 md = MessageDigest.getInstance("SHA512");
878
<

https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java?revision=1912514=markup=1912514#l878>


 byte[] md5 = md.digest(ba);
879
<

https://svn.ap

Re: Font operation takes a long time with 3.0.1

2023-12-04 Thread Tilman Hausherr
Please do also post the full (for pdfbox / fontbox) stack trace. I have 
a theory why it happens, which is that addTrueTypeCollection() does not 
add the font as "*skipexception*" to the cache file because it's not 
done in the exception handler.


Tilman

On 04.12.2023 21:17, Tilman Hausherr wrote:
Does the stack trace appear at every start? If yes then it's a bug. 
The intent of the current code is that bad fonts aren't retried. The 
font cache file should contain a line with "*skipexception*" for that 
font. Can you look at it for the two font files?


I could change SHA512 to CRC32. It has the advantage that it won't 
trigger people who heard about MD5 


I made a test and CRC32 is 20% faster.

Tilman

On 04.12.2023 18:48, Gili Tzabari wrote:

I think the commit contains a typo:


872 
<https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java?revision=1912514=markup=1912514#l872> 
private static String computeHash(byte[] ba)
873 
<https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java?revision=1912514=markup=1912514#l873> 
{
874 
<https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java?revision=1912514=markup=1912514#l874> 
MessageDigest md;
875 
<https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java?revision=1912514=markup=1912514#l875> 
try
876 
<https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java?revision=1912514=markup=1912514#l876> 
{
877 
<https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java?revision=1912514=markup=1912514#l877> 
md = MessageDigest.getInstance("SHA512");
878 
<https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java?revision=1912514=markup=1912514#l878> 
byte[] md5 = md.digest(ba);
879 
<https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java?revision=1912514=markup=1912514#l879> 
return Hex.getString(md5);
880 
<https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java?revision=1912514=markup=1912514#l880> 
}
881 
<https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java?revision=1912514=markup=1912514#l881> 
catch (NoSuchAlgorithmException ex)
882 
<https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java?revision=1912514=markup=1912514#l882> 
{
883 
<https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java?revision=1912514=markup=1912514#l883> 
// never happens
884 
<https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java?revision=1912514=markup=1912514#l884> 
return "";
885 
<https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java?revision=1912514=markup=1912514#l885> 
}
886 
<https://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java?revision=1912514=markup=1912514#l886> 
}


You shouldn't need to use SHA512 to detect changes by a non-malicious 
actor. MD5 should be plenty, and even CRC32 would be enough. I 
suggest downgrading the hash complexity.


Gili

On 2023-12-04 10:21, Kjetil Ødegaard wrote:

Hi,

I tried to upgrade an app to PDFBox 3.0.1 and I see a performance 
issue.


It only affects the first PDF operation (after that it's quite 
fast), but

it's a bit annoying since it takes about 20 seconds (on my M1 Macboox).

Profiling reveals that this Kotlin code triggers the delay:

 val font = PDType1Font(Standard14Fonts.FontName.COURIER)

The thread dump shows that almost all time is spent in this method:

org.apache.pdfbox.pdmodel.font.FileSystemFontProvider#computeHash

I assume that this is related to PDFBOX-5684.

Is this possible to work around? Or is it possible to fix?

BR Kjetil




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Font operation takes a long time with 3.0.1

2023-12-04 Thread Tilman Hausherr
Does the stack trace appear at every start? If yes then it's a bug. The 
intent of the current code is that bad fonts aren't retried. The font 
cache file should contain a line with "*skipexception*" for that font. 
Can you look at it for the two font files?


I could change SHA512 to CRC32. It has the advantage that it won't 
trigger people who heard about MD5 


I made a test and CRC32 is 20% faster.

Tilman

On 04.12.2023 18:48, Gili Tzabari wrote:

I think the commit contains a typo:


872 
 
private static String computeHash(byte[] ba)
873 
 
{
874 
 
MessageDigest md;
875 
 
try
876 
 
{
877 
 
md = MessageDigest.getInstance("SHA512");
878 
 
byte[] md5 = md.digest(ba);
879 
 
return Hex.getString(md5);
880 
 
}
881 
 
catch (NoSuchAlgorithmException ex)
882 
 
{
883 
 
// never happens
884 
 
return "";
885 
 
}
886 
 
}


You shouldn't need to use SHA512 to detect changes by a non-malicious 
actor. MD5 should be plenty, and even CRC32 would be enough. I suggest 
downgrading the hash complexity.


Gili

On 2023-12-04 10:21, Kjetil Ødegaard wrote:

Hi,

I tried to upgrade an app to PDFBox 3.0.1 and I see a performance issue.

It only affects the first PDF operation (after that it's quite fast), 
but

it's a bit annoying since it takes about 20 seconds (on my M1 Macboox).

Profiling reveals that this Kotlin code triggers the delay:

 val font = PDType1Font(Standard14Fonts.FontName.COURIER)

The thread dump shows that almost all time is spent in this method:

org.apache.pdfbox.pdmodel.font.FileSystemFontProvider#computeHash

I assume that this is related to PDFBOX-5684.

Is this possible to work around? Or is it possible to fix?

BR Kjetil




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



AW: Font operation takes a long time with 3.0.1

2023-12-04 Thread Tilman Hausherr
This should happen only once in 3.0.1, unless you're working with a container 
without font cache file in the image.

SHA512 checksum is done only if the file modification date of a font file has 
changed, then we check whether the content has changed.

Tilman

-- Original-Nachricht --
Von: Kjetil Ødegaard 
Betreff: Font operation takes a long time with 3.0.1
Datum: 04.12.2023, 16:21 Uhr
An: users@pdfbox.apache.org

Hi,

I tried to upgrade an app to PDFBox 3.0.1 and I see a performance issue.

It only affects the first PDF operation (after that it's quite fast), but
it's a bit annoying since it takes about 20 seconds (on my M1 Macboox).

Profiling reveals that this Kotlin code triggers the delay:

val font = PDType1Font(Standard14Fonts.FontName.COURIER)

The thread dump shows that almost all time is spent in this method:

org.apache.pdfbox.pdmodel.font.FileSystemFontProvider#computeHash

I assume that this is related to PDFBOX-5684.

Is this possible to work around? Or is it possible to fix?

BR Kjetil


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Odd OCG error

2023-11-22 Thread Tilman Hausherr

Great. The problem mentioned by Andreas will be fixed in the next version.
Tilman

On 22.11.2023 17:59, John Lussmyer wrote:
Thanks, that really helps.  Since we are too close to release to try a 
newer PDFBox jar,
I just added this little bit of code to our system so these PDF's will 
work. (the if statement before creating the "PDOptionalContentGroup".)



    if (!dict.getItem(COSName.TYPE).equals(COSName.OCG)) {
        dict.setItem(COSName.TYPE, COSName.OCG);
    }
    PDOptionalContentGroup grp = new PDOptionalContentGroup(dict);


On 11/21/2023 10:52 PM, Andreas Lehmkühler wrote:


Am 21.11.23 um 21:26 schrieb John Lussmyer:

Ugh, formatting mess.
For more info, this is the "addOCGs:OCG" log line just before the 
error message:


10:53:09.765 [etrix SwingWorker[0]] DEBUG ImposedPDFEngine - 
addOCGs: OCG 
COSDictionary{COSName{Name}:COSObject{COSNull{}};COSName{Type}:COSObject{COSName{OCG}};}
The value for the type is an indirect object. Usally such values are 
direct objects. The type check fails as it expects a direct object as 
type value.




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Odd OCG error

2023-11-21 Thread Tilman Hausherr
Please retry with the 3.0.1 snapshot, there were bugs fixed related to 
combining files. If there bug is still there please create a ticket in JIRA


Tilman

On 21.11.2023 19:56, John Lussmyer wrote:
I'm using PDFBox 3.0.0 to combine some PDF files.  One of the files 
uses an Optional Content Group.
Note that this code has been working just fine for many other files 
both with and without OCG's.


For this file, I get this exception:

java.lang.IllegalArgumentException: Provided dictionary is not of type 
'COSName{OCG}'


    at 
org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentGroup.(PDOptionalContentGroup.java:48) 
~[pdfbox-3.0.0.jar:3.0.0]


Code:

*if*(obj*instanceof*COSDictionary) {

COSDictionary dict= (COSDictionary) obj;

COSName dType= dict.getCOSName(COSName.*/TYPE/*);

*if*(dType== *null*) {

*continue*;

}

*if*(dType.equals(COSName.*/OCG/*)) {

*/log/*.debug("addOCGs: OCG {}", dict);

PDOptionalContentGroup grp= *new*PDOptionalContentGroup(dict);

ocProps.addGroup(grp);

ocProps.setGroupEnabled(grp, layersON.contains(grp.getName()));

changed= *true*;

}

}

 It's failing on the "new PDOptionalContentGroup(dict)" call.
Any ideas on why?




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



  1   2   3   4   5   6   7   8   9   10   >