date:20201129

[jira] [Updated] (PDFBOX-4999) Dangerous COSDictionary.addAll(COSDictionary) method

2020-11-29 Thread Jira



 [ 
https://issues.apache.org/jira/browse/PDFBOX-4999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-4999:
---
Fix Version/s: 3.0.0 PDFBox
   2.0.22

> Dangerous COSDictionary.addAll(COSDictionary) method
> 
>
> Key: PDFBOX-4999
> URL: https://issues.apache.org/jira/browse/PDFBOX-4999
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.21, 3.0.0 PDFBox
>Reporter: Michael Klink
>Priority: Critical
> Fix For: 2.0.22, 3.0.0 PDFBox
>
>
> The method {{COSDictionary.addAll(COSDictionary)}} creates the impression, by 
> name and by JavaDoc comment,
> {code:java}
> /**
>  * This will add all of the dictionaries keys/values to this dictionary.
> ...
> {code}
> that it can be used for exactly that, adding all key/value pairs from the 
> argument dictionary to the current one, replacing old entries for the same 
> keys.
>  If one looks at the implementation, though, one is in for a surprise:
> {code:java}
> /**
>  * This will add all of the dictionaries keys/values to this dictionary.
>  * Only called when adding keys to a trailer that already exists.
>  *
>  * @param dic The dictionaries to get the keys from.
>  */
> public void addAll(COSDictionary dic)
> {
> dic.forEach((key, value) ->
> {
> /*
>  * If we're at a second trailer, we have a linearized pdf file, 
> meaning that the first Size entry represents
>  * all of the objects so we don't need to grab the second.
>  */
> if (!COSName.SIZE.equals(key) || !items.containsKey(COSName.SIZE))
> {
> setItem(key, value);
> }
> });
> }
> {code}
> Here existing *Size* entries explicitly are not replaced!
> This appears to be a relic from times when PDFBox parsed PDF documents front 
> to back, ignoring cross reference streams, for improved results with 
> linearized files when merging trailer dictionaries.
> Nowadays this exceptional treatment of *Size* does not make any sense 
> anymore, see [this stack overflow 
> answer|https://stackoverflow.com/a/64502740/1729265].
> Furthermore, this method is used in other contexts than creating trailer 
> unions, even some PDFBox methods use it to create arbitrary dictionary unions:
> * 
> {{org.apache.pdfbox.pdmodel.PDDocument.assignAcroFormDefaultResource(PDAcroForm,
>  COSDictionary)}}
> * {{org.apache.pdfbox.filter.JPXFilter.decode(InputStream, OutputStream, 
> COSDictionary, int, DecodeOptions)}}
> * {{org.apache.pdfbox.examples.interactive.form.FieldTriggers.main(String[])}}
> * 
> {{org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.PDImageXObject(PDStream,
>  PDResources)}}
> * 
> {{org.apache.pdfbox.pdmodel.graphics.image.PDInlineImage.PDInlineImage(COSDictionary,
>  byte[], PDResources)}}
> * 
> {{org.apache.pdfbox.pdmodel.graphics.image.PDInlineImageTest.testInlineImage()}}
> * {{org.apache.pdfbox.pdfparser.XrefTrailerResolver.setStartxref(long)}}
> (This list is offered by eclipse as callers of that method. There may be 
> other, hidden calls.)
> Thus, this exception should be removed after all usages of that method in 
> PDFBox have been analyzed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Re: [DISCUSS] Move static functions from COSArrayList

2020-11-29 Thread Andreas Lehmkuehler


Am 25.12.19 um 12:57 schrieb Andreas Lehmkuehler:

Am 23.12.19 um 15:54 schrieb Maruan Sahyoun:
I'd like to remove some of the static functions which work on/return a 
COSArray from COSArrayList and move these to COSArray.


For 2.x I'd deprecate them. In 3.0 I'd remove them.

Sample of currently used code

  public static COSArray convertStringListToCOSNameCOSArray( List 
strings )

 {
 COSArray retval = new COSArray();
 for (String string : strings)
 {
 retval.add(COSName.getPDFName(string));
 }
 return retval;
 }

As one can see - there is no relation to COSArrayList

Method naming could propably also be simplified a bit such as 
COSArray.withCOSNames


WDYT?
We have to simplify as much as possible/reasonable. Saying that, I'm pro moving 
and renaming that methods
In the context of PDFBOX-4954 I moved those methods as proposed but didn't 
remember that we discussed to rename them as well. I've just caught up on the 
renaming.


Andreas



Andreas



BR
Maruan


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-4954) Reduce the usage of COSArrayList

2020-11-29 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/PDFBOX-4954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240518#comment-17240518
 ] 

ASF subversion and git services commented on PDFBOX-4954:
-

Commit 1883944 from le...@apache.org in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1883944 ]

PDFBOX-4954: rename static convert methods as proposed by Maruan some time ago 
on dev@

> Reduce the usage of COSArrayList
> 
>
> Key: PDFBOX-4954
> URL: https://issues.apache.org/jira/browse/PDFBOX-4954
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing
>Affects Versions: 2.0.21, 3.0.0 PDFBox
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: PDFBOX-3448-null-widths.pdf
>
>
> PDFBOX-4723 is about some issues with the usage of the COSArrayList and it 
> looks like there a several occasions where we might use a simple ArrayList 
> instead to minimize the impact of the described issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5029) Tika - Issues extracting Arabic script from pdf

2020-11-29 Thread Tilman Hausherr (Jira)



[ 
https://issues.apache.org/jira/browse/PDFBOX-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240465#comment-17240465
 ] 

Tilman Hausherr commented on PDFBOX-5029:
-

No I did not use the script. I don't have python installed. I have tika but I 
wanted to test with the latest PDFBox version because IMHO it can only be a 
PDFBox problem, if it is.

> Tika - Issues extracting Arabic script from pdf
> ---
>
> Key: PDFBOX-5029
> URL: https://issues.apache.org/jira/browse/PDFBOX-5029
> Project: PDFBox
>  Issue Type: Bug
> Environment: Windows - Anaconda / Spyder
>Reporter: Christian 
>Priority: Major
> Attachments: PDFBOX-5029-not-sorted-2.0.21.txt, 
> PDFBOX-5029-not-sorted-trunk.txt, PDFBOX-5029-sorted-2.0.21.txt, 
> PDFBOX-5029-sorted-trunk.txt, extracting_text_asian_pdf.py, test.pdf, 
> test_scraped.utf8
>
>
> I'm working on building a corpus of Uygur texts and some of the content is 
> coming from pdf files. I wrote a short python script to scrape text from pdf 
> using tika-python. The script is Arabic, and the output looks good but there 
> is one major problem: there are many missing spaces between words and I 
> really do not know how to address this issue. I am attaching a pdf file, the 
> script to scrape its text and the output (test_scraped.utf8). Thanks in 
> advance for your help.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5029) Tika - Issues extracting Arabic script from pdf

2020-11-29 Thread Christian (Jira)



[ 
https://issues.apache.org/jira/browse/PDFBOX-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240362#comment-17240362
 ] 

Christian  edited comment on PDFBOX-5029 at 11/29/20, 8:26 PM:
---

Thanks Tilman, will do - tomorrow I will be in touch with a colleague of mine 
who is a native speaker and I will provide you the exact lines and missing 
spaces (if any). I guess the files to look at are "the sorted" ones. Did you 
use my script to extract the text? 


was (Author: faggionato):
Thanks Tilman, will do - tomorrow I will be in touch with a native speaker and 
I will provide you the exact lines and missing spaces.

> Tika - Issues extracting Arabic script from pdf
> ---
>
> Key: PDFBOX-5029
> URL: https://issues.apache.org/jira/browse/PDFBOX-5029
> Project: PDFBox
>  Issue Type: Bug
> Environment: Windows - Anaconda / Spyder
>Reporter: Christian 
>Priority: Major
> Attachments: PDFBOX-5029-not-sorted-2.0.21.txt, 
> PDFBOX-5029-not-sorted-trunk.txt, PDFBOX-5029-sorted-2.0.21.txt, 
> PDFBOX-5029-sorted-trunk.txt, extracting_text_asian_pdf.py, test.pdf, 
> test_scraped.utf8
>
>
> I'm working on building a corpus of Uygur texts and some of the content is 
> coming from pdf files. I wrote a short python script to scrape text from pdf 
> using tika-python. The script is Arabic, and the output looks good but there 
> is one major problem: there are many missing spaces between words and I 
> really do not know how to address this issue. I am attaching a pdf file, the 
> script to scrape its text and the output (test_scraped.utf8). Thanks in 
> advance for your help.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Issue Comment Deleted] (PDFBOX-5029) Tika - Issues extracting Arabic script from pdf

2020-11-29 Thread Christian (Jira)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian  updated PDFBOX-5029:
---
Comment: was deleted

(was: Also, what is the difference between the sorted and not-sorted files you 
attached? Did you use my script to extract the text? Thanks again.)

> Tika - Issues extracting Arabic script from pdf
> ---
>
> Key: PDFBOX-5029
> URL: https://issues.apache.org/jira/browse/PDFBOX-5029
> Project: PDFBox
>  Issue Type: Bug
> Environment: Windows - Anaconda / Spyder
>Reporter: Christian 
>Priority: Major
> Attachments: PDFBOX-5029-not-sorted-2.0.21.txt, 
> PDFBOX-5029-not-sorted-trunk.txt, PDFBOX-5029-sorted-2.0.21.txt, 
> PDFBOX-5029-sorted-trunk.txt, extracting_text_asian_pdf.py, test.pdf, 
> test_scraped.utf8
>
>
> I'm working on building a corpus of Uygur texts and some of the content is 
> coming from pdf files. I wrote a short python script to scrape text from pdf 
> using tika-python. The script is Arabic, and the output looks good but there 
> is one major problem: there are many missing spaces between words and I 
> really do not know how to address this issue. I am attaching a pdf file, the 
> script to scrape its text and the output (test_scraped.utf8). Thanks in 
> advance for your help.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5029) Tika - Issues extracting Arabic script from pdf

2020-11-29 Thread Christian (Jira)



[ 
https://issues.apache.org/jira/browse/PDFBOX-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240363#comment-17240363
 ] 

Christian  commented on PDFBOX-5029:


Also, what is the difference between the sorted and not-sorted files you 
attached? Did you use my script to extract the text? Thanks again.

> Tika - Issues extracting Arabic script from pdf
> ---
>
> Key: PDFBOX-5029
> URL: https://issues.apache.org/jira/browse/PDFBOX-5029
> Project: PDFBox
>  Issue Type: Bug
> Environment: Windows - Anaconda / Spyder
>Reporter: Christian 
>Priority: Major
> Attachments: PDFBOX-5029-not-sorted-2.0.21.txt, 
> PDFBOX-5029-not-sorted-trunk.txt, PDFBOX-5029-sorted-2.0.21.txt, 
> PDFBOX-5029-sorted-trunk.txt, extracting_text_asian_pdf.py, test.pdf, 
> test_scraped.utf8
>
>
> I'm working on building a corpus of Uygur texts and some of the content is 
> coming from pdf files. I wrote a short python script to scrape text from pdf 
> using tika-python. The script is Arabic, and the output looks good but there 
> is one major problem: there are many missing spaces between words and I 
> really do not know how to address this issue. I am attaching a pdf file, the 
> script to scrape its text and the output (test_scraped.utf8). Thanks in 
> advance for your help.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5029) Tika - Issues extracting Arabic script from pdf

2020-11-29 Thread Christian (Jira)



[ 
https://issues.apache.org/jira/browse/PDFBOX-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240362#comment-17240362
 ] 

Christian  commented on PDFBOX-5029:


Thanks Tilman, will do - tomorrow I will be in touch with a native speaker and 
I will provide you the exact lines and missing spaces.

> Tika - Issues extracting Arabic script from pdf
> ---
>
> Key: PDFBOX-5029
> URL: https://issues.apache.org/jira/browse/PDFBOX-5029
> Project: PDFBox
>  Issue Type: Bug
> Environment: Windows - Anaconda / Spyder
>Reporter: Christian 
>Priority: Major
> Attachments: PDFBOX-5029-not-sorted-2.0.21.txt, 
> PDFBOX-5029-not-sorted-trunk.txt, PDFBOX-5029-sorted-2.0.21.txt, 
> PDFBOX-5029-sorted-trunk.txt, extracting_text_asian_pdf.py, test.pdf, 
> test_scraped.utf8
>
>
> I'm working on building a corpus of Uygur texts and some of the content is 
> coming from pdf files. I wrote a short python script to scrape text from pdf 
> using tika-python. The script is Arabic, and the output looks good but there 
> is one major problem: there are many missing spaces between words and I 
> really do not know how to address this issue. I am attaching a pdf file, the 
> script to scrape its text and the output (test_scraped.utf8). Thanks in 
> advance for your help.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5027) Protect/Encrypt PDF with multiple certificates on command line

2020-11-29 Thread jakatal (Jira)



[ 
https://issues.apache.org/jira/browse/PDFBOX-5027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240360#comment-17240360
 ] 

jakatal commented on PDFBOX-5027:
-

Well, I just worked a lot with docker today, and realized that repeating 
parameters is not that uncommon anymore these days - maybe I am too oldschool.

Also seeing that there is some issues with forbidden characters and separator 
characters, I tend to vote more for Tilmans idea.

Also the code modification seems not very dramatic either, pdfbox/Encrypt.java 
(starting from line 63):

 
{code:java}
private void encrypt( String[] args ) throws IOException, CertificateException
{
if( args.length < 1 )
{
usage();
}
else
{
AccessPermission ap = new AccessPermission();String 
infile = null;
String outfile = null;
-   String certFile = null;
+   List listCertFile = new ArrayList();
@SuppressWarnings({"squid:S2068"})
String userPassword = "";
@SuppressWarnings({"squid:S2068"})
String ownerPassword = "";int keyLength = 256;  
  PDDocument document = null;try
{
for( int i=0; i {
  try (InputStream inStream = new 
FileInputStream(certFile))
  {
  X509Certificate certificate = (X509Certificate) 
cf.generateCertificate(inStream);
  recip.setX509(certificate);
  }   
ppp.addRecipient(recip);

+   });
 ppp.setEncryptionKeyLength(keyLength);
document.protect(ppp);
}
else
{
StandardProtectionPolicy spp =
new StandardProtectionPolicy(ownerPassword, 
userPassword, ap);
spp.setEncryptionKeyLength(keyLength);
document.protect(spp);
}
document.save( outfile );
}
else
{
System.err.println( "Error: Document is already encrypted." 
);
}
}
{code}
 

 

> Protect/Encrypt PDF with multiple certificates on command line
> --
>
> Key: PDFBOX-5027
> URL: https://issues.apache.org/jira/browse/PDFBOX-5027
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Crypto
>Affects Versions: 2.0.21
>Reporter: jakatal
>Priority: Trivial
> Fix For: 2.0.22, 3.0.0 PDFBox
>
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Hi,
> PDFBox has (obviously) the ability to protect a file with several 
> certificates by adding teh recipient's certificates one after another:
>  
>  
> {code:java}
> //Class PublicKeyProtectionPolicy has 
> public void addRecipient(PublicKeyRecipient recipient)
> {recipients.add(recipient);}
> {code}
> For the commandline tool functionality, it just offers "-cert" with the 
> option to add a SINGLE certificate. I expect that in most serious use cases 
> actually two certificates are used to protect the document (the actual 
> recipient and the creator who wants to be able still to open the document as 
> well).
>  
> I propose to extend the command line functionality (Encrypt.java) by having 
> an iteration through several cert files, e.g. separated by special character.
>  
> Thanks.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5027) Protect/Encrypt PDF with multiple certificates on command line

2020-11-29 Thread Tilman Hausherr (Jira)



[ 
https://issues.apache.org/jira/browse/PDFBOX-5027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240317#comment-17240317
 ] 

Tilman Hausherr commented on PDFBOX-5027:
-

sure!

> Protect/Encrypt PDF with multiple certificates on command line
> --
>
> Key: PDFBOX-5027
> URL: https://issues.apache.org/jira/browse/PDFBOX-5027
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Crypto
>Affects Versions: 2.0.21
>Reporter: jakatal
>Priority: Trivial
> Fix For: 2.0.22, 3.0.0 PDFBox
>
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Hi,
> PDFBox has (obviously) the ability to protect a file with several 
> certificates by adding teh recipient's certificates one after another:
>  
>  
> {code:java}
> //Class PublicKeyProtectionPolicy has 
> public void addRecipient(PublicKeyRecipient recipient)
> {recipients.add(recipient);}
> {code}
> For the commandline tool functionality, it just offers "-cert" with the 
> option to add a SINGLE certificate. I expect that in most serious use cases 
> actually two certificates are used to protect the document (the actual 
> recipient and the creator who wants to be able still to open the document as 
> well).
>  
> I propose to extend the command line functionality (Encrypt.java) by having 
> an iteration through several cert files, e.g. separated by special character.
>  
> Thanks.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Resolved] (PDFBOX-4836) Reduce the usage of ScatchFileBuffer when parsing a pdf

2020-11-29 Thread Jira



 [ 
https://issues.apache.org/jira/browse/PDFBOX-4836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler resolved PDFBOX-4836.

Resolution: Fixed

I guess there is still room for improvement, but the main goal to remove the 
usage of a ScratchFile when reading a pdf was achieved.

In favour of releasing 3.0.0 I'm closing this ticket, maybe new ones will follow

> Reduce the usage of ScatchFileBuffer when parsing a pdf
> ---
>
> Key: PDFBOX-4836
> URL: https://issues.apache.org/jira/browse/PDFBOX-4836
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: EDGE11896203.pdf, image-2020-05-17-16-40-28-712.png, 
> raw_image_demo.pdf
>
>
> Instead of using a scatchfile buffer to read a COSStream the parser should 
> use the source directly to reduce the memory footprint



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-4952) PDF compression - object stream creation

2020-11-29 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/PDFBOX-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240308#comment-17240308
 ] 

ASF subversion and git services commented on PDFBOX-4952:
-

Commit 1883936 from le...@apache.org in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1883936 ]

PDFBOX-4952: sonar fixes

> PDF compression - object stream creation
> 
>
> Key: PDFBOX-4952
> URL: https://issues.apache.org/jira/browse/PDFBOX-4952
> Project: PDFBox
>  Issue Type: New Feature
>  Components: PDModel
>Affects Versions: 2.0.21
>Reporter: Christian Appl
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: image-2020-09-07-09-47-30-172.png, 
> image-2020-09-07-10-05-15-631.png
>
>
> I implemented a basic starting point to realize a PDF compression based on 
> PDFBox 2.0.22-SNAPSHOT
> I want to use this ticket, to ask if you would be interested in such a 
> feature and whether you would be interested to merge it into PDFBox.
> This is sort of a POC, only implementing some very basic functionality, that 
> surely must and could be extended further and it does only implement some 
> very basic and simplistic Unit Tests.
>  However it is able to reduce the size of resulting documents, and creates 
> objectstreams as defined in the PDF reference manual.
> *What it currently does:*
>  It provides the bundling and compression of objects to objectstreams -and 
> further applies simple content compression to a small selection of contents-.
> -To realize content compression, it provides a simple interface and abstract 
> class for "ContentCompressor"s which search a document for specific content, 
> that could be compressed and do compress that contents.-
> -Currently two content compressors exist:-
>  -_ImageCompressor_-
>  -Searches for simple images, that could be compressed using DCT.-
> -_UnencodedStreamCompressor_-
>  -Searches the document for yet unencoded streams and applies a Flate 
> compression where necessary.-
> -Both compressors can be parameterized using a centralized 
> "CompressParameters" instance which is passed to a new "saveCompressed" 
> method of PDDocument.-
> The compression is based on, modifies and is realized by a set of extensions 
> for the "COSWriter" class. Basically it organizes objects, that are passed to 
> the COSWriter in objectStreams -and applies content optimization where 
> necessary and possible-.
> Currently this does support encryption, but does not support linearization of 
> the compressed documents.
> *Caveat:*
>  If this feature is interesting to you, then I would not expect you to simply 
> merge this fork into 2.0.22. I am expecting that you would like to have some 
> details and concepts changed and am ready to implement changes that would be 
> required for this to work to your liking.
> *POC:*
>  4 resulting documents can be found in "target/test-output/compression" when 
> "COSDocumentCompressionTest" is run.
> *The Pull request can be found on Github at:*
>  [https://github.com/apache/pdfbox/pull/86]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5027) Protect/Encrypt PDF with multiple certificates on command line

2020-11-29 Thread Maruan Sahyoun (Jira)



[ 
https://issues.apache.org/jira/browse/PDFBOX-5027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240213#comment-17240213
 ] 

Maruan Sahyoun commented on PDFBOX-5027:


Can't we go for a predefined separator e.g. Comma as documentation wise it's 
easier to document that one should separate by a certain delimiter regardless 
of the platform use. 

> Protect/Encrypt PDF with multiple certificates on command line
> --
>
> Key: PDFBOX-5027
> URL: https://issues.apache.org/jira/browse/PDFBOX-5027
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Crypto
>Affects Versions: 2.0.21
>Reporter: jakatal
>Priority: Trivial
> Fix For: 2.0.22, 3.0.0 PDFBox
>
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Hi,
> PDFBox has (obviously) the ability to protect a file with several 
> certificates by adding teh recipient's certificates one after another:
>  
>  
> {code:java}
> //Class PublicKeyProtectionPolicy has 
> public void addRecipient(PublicKeyRecipient recipient)
> {recipients.add(recipient);}
> {code}
> For the commandline tool functionality, it just offers "-cert" with the 
> option to add a SINGLE certificate. I expect that in most serious use cases 
> actually two certificates are used to protect the document (the actual 
> recipient and the creator who wants to be able still to open the document as 
> well).
>  
> I propose to extend the command line functionality (Encrypt.java) by having 
> an iteration through several cert files, e.g. separated by special character.
>  
> Thanks.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Re: [Heads-Up] Documentation

2020-11-29 Thread Andreas Lehmkuehler


Am 24.11.20 um 09:05 schrieb sahy...@fileaffairs.de:

Am Dienstag, den 24.11.2020, 08:25 +0100 schrieb Andreas Lehmkuehler:

Am 22.11.20 um 21:19 schrieb sahy...@fileaffairs.de:

Dear Dev team,

in order to provide a base to slowly enhance our documentation I'm
currently working on an addition to our site generator which
already
works in my local repo. This will allow to add code snippets from
our
examples into the generated docs. To use it the following code
needs to
be put into a document where the code shall appear (as an example
I'm
using a reference to the CreateCheckBox.java example for current
trunk.

``` java
{% codesnippet 'interactive/form/CreateCheckBox.java' 'trunk' %}
```

In addition - in order to be able to only put parts of the code
into
the documentation the following comments can be added to the java
code

//DOC-START
...
//DOC-END

The DOC-START/DOC-END pair can be placed multiple times into the
java
code. Everything between these special comment lines will be added
the
other content will be omitted. This will allow us to skip license
header, import statements etc. to concentrate on the important
bits.

This way we have the benefit of testable code but also the ability
to
reuse that in our docs.

WDYT?

I like the idea, thanks for the effort.

Just out of curiosity, how does the process work? Do those pages
include the
code snippets dynamically or are the pages still static, so that we
have to
regenerate the website after each change within the relevant code
pieces?


the code snippets are embedded when the site is generated i.e. not
fetched at runtime. Fetching at runtime would be doable of course.
Given that when we do a release the examples don't change anymore for
that release I think the static approach is suitable.

Everything is fine as proposed, I just wanted to know how it works :-)

Andreas


BR
Maruan



Andreas


BR
Maruan





---
--
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org





-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-4999) Dangerous COSDictionary.addAll(COSDictionary) method

Re: [DISCUSS] Move static functions from COSArrayList

[jira] [Commented] (PDFBOX-4954) Reduce the usage of COSArrayList

[jira] [Commented] (PDFBOX-5029) Tika - Issues extracting Arabic script from pdf

[jira] [Comment Edited] (PDFBOX-5029) Tika - Issues extracting Arabic script from pdf

[jira] [Issue Comment Deleted] (PDFBOX-5029) Tika - Issues extracting Arabic script from pdf

[jira] [Commented] (PDFBOX-5029) Tika - Issues extracting Arabic script from pdf

[jira] [Commented] (PDFBOX-5029) Tika - Issues extracting Arabic script from pdf

[jira] [Commented] (PDFBOX-5027) Protect/Encrypt PDF with multiple certificates on command line

[jira] [Commented] (PDFBOX-5027) Protect/Encrypt PDF with multiple certificates on command line

[jira] [Resolved] (PDFBOX-4836) Reduce the usage of ScatchFileBuffer when parsing a pdf

[jira] [Commented] (PDFBOX-4952) PDF compression - object stream creation

[jira] [Commented] (PDFBOX-5027) Protect/Encrypt PDF with multiple certificates on command line

Re: [Heads-Up] Documentation

14 matches

Site Navigation

Mail list logo

Footer information