[jira] [Assigned] (PDFBOX-4999) Dangerous COSDictionary.addAll(COSDictionary) method

2020-12-01 Thread Jira


 [ 
https://issues.apache.org/jira/browse/PDFBOX-4999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler reassigned PDFBOX-4999:
--

Assignee: Andreas Lehmkühler

> Dangerous COSDictionary.addAll(COSDictionary) method
> 
>
> Key: PDFBOX-4999
> URL: https://issues.apache.org/jira/browse/PDFBOX-4999
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.21, 3.0.0 PDFBox
>Reporter: Michael Klink
>Assignee: Andreas Lehmkühler
>Priority: Critical
> Fix For: 2.0.22, 3.0.0 PDFBox
>
>
> The method {{COSDictionary.addAll(COSDictionary)}} creates the impression, by 
> name and by JavaDoc comment,
> {code:java}
> /**
>  * This will add all of the dictionaries keys/values to this dictionary.
> ...
> {code}
> that it can be used for exactly that, adding all key/value pairs from the 
> argument dictionary to the current one, replacing old entries for the same 
> keys.
>  If one looks at the implementation, though, one is in for a surprise:
> {code:java}
> /**
>  * This will add all of the dictionaries keys/values to this dictionary.
>  * Only called when adding keys to a trailer that already exists.
>  *
>  * @param dic The dictionaries to get the keys from.
>  */
> public void addAll(COSDictionary dic)
> {
> dic.forEach((key, value) ->
> {
> /*
>  * If we're at a second trailer, we have a linearized pdf file, 
> meaning that the first Size entry represents
>  * all of the objects so we don't need to grab the second.
>  */
> if (!COSName.SIZE.equals(key) || !items.containsKey(COSName.SIZE))
> {
> setItem(key, value);
> }
> });
> }
> {code}
> Here existing *Size* entries explicitly are not replaced!
> This appears to be a relic from times when PDFBox parsed PDF documents front 
> to back, ignoring cross reference streams, for improved results with 
> linearized files when merging trailer dictionaries.
> Nowadays this exceptional treatment of *Size* does not make any sense 
> anymore, see [this stack overflow 
> answer|https://stackoverflow.com/a/64502740/1729265].
> Furthermore, this method is used in other contexts than creating trailer 
> unions, even some PDFBox methods use it to create arbitrary dictionary unions:
> * 
> {{org.apache.pdfbox.pdmodel.PDDocument.assignAcroFormDefaultResource(PDAcroForm,
>  COSDictionary)}}
> * {{org.apache.pdfbox.filter.JPXFilter.decode(InputStream, OutputStream, 
> COSDictionary, int, DecodeOptions)}}
> * {{org.apache.pdfbox.examples.interactive.form.FieldTriggers.main(String[])}}
> * 
> {{org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.PDImageXObject(PDStream,
>  PDResources)}}
> * 
> {{org.apache.pdfbox.pdmodel.graphics.image.PDInlineImage.PDInlineImage(COSDictionary,
>  byte[], PDResources)}}
> * 
> {{org.apache.pdfbox.pdmodel.graphics.image.PDInlineImageTest.testInlineImage()}}
> * {{org.apache.pdfbox.pdfparser.XrefTrailerResolver.setStartxref(long)}}
> (This list is offered by eclipse as callers of that method. There may be 
> other, hidden calls.)
> Thus, this exception should be removed after all usages of that method in 
> PDFBox have been analyzed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4709) PDFBox prints text poorly in comparison to Adobe, Chrome, other apps

2020-12-01 Thread Tres Finocchiaro (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241887#comment-17241887
 ] 

Tres Finocchiaro commented on PDFBOX-4709:
--

{quote}Surprisingly, this also happens in PDFBox 1.8, which uses font 
operations (or so I thought) instead of vector graphics. (*update: no, it uses 
graphics.drawGlyphVector()*)
{quote}
 

Whoops, I missed that updated note and sent our openjdk provider down a path of 
studying 1.8 I'm not sure if this information is helpful, but they've 
gotten back to us with some interesting findings, quoting:
{quote}[...] the reason why the drawGlyphVector uses fill(g.getOutline(x, y)) 
method in 
[PathGraphics.java#L650|https://github.com/openjdk/jdk/blob/461c5fc63708638c8f50aa89a298c2f45efd2a97/src/java.desktop/share/classes/sun/print/PathGraphics.java#L650]
 because PDFBox sets a transform to the GlyphVector in[ 
PDSimpleFont.java#L349|https://github.com/apache/pdfbox/blob/41ae21bd4c3f304373d3b05f63af5325df248019/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/PDSimpleFont.java#L349]and
 it prevents calling drawString() method in [PathGraphics.java#L957
|https://github.com/openjdk/jdk/blob/461c5fc63708638c8f50aa89a298c2f45efd2a97/src/java.desktop/share/classes/sun/print/PathGraphics.java#L957]

... 

The idea to make it work is to set the the glyph transform to the graphics 
instead of the GlyphVector itself. Here is an example which I added to my 
PDFBox 1.8.16 fork [pdfbox/commit

|https://github.com/AlexanderScherbatiy/pdfbox/commit/4877e6606d28a439bec52fdf028f439d340e2700]

I investigated it to understand what happens with a text printed by PDFBox 1.8. 
As a result, the PDFBox 2.0. needs to be fixed not only by using 
2d.drawGlyphVector() but also applying glyph vector transforms to graphics.
{quote}
 

[~tilman] is this information useful or helpful?  If not, we can take this 
advice downstream and try to modify the behavior there.

 

!4d6a9ed5-45ef-4742-a21f-0903a883e2e8.jpeg!

 

 

> PDFBox prints text poorly in comparison to Adobe, Chrome, other apps
> 
>
> Key: PDFBOX-4709
> URL: https://issues.apache.org/jira/browse/PDFBOX-4709
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.14
> Environment: Windows 10
> AdoptOpenJDK 11
> PDFBox 2.0.14
>Reporter: Lite M Finocchiaro
>Priority: Major
>  Labels: Zebra
> Attachments: 4d6a9ed5-45ef-4742-a21f-0903a883e2e8.jpeg, 
> Adoberesult.jpg, Drug-print.pdf, Drug-print.pdf, IMG_20191219_130048_2.jpg, 
> PDFBoxVSgraphicsobj.jpg, PDFBoxresult.jpg, PrintedWithPDFBox.pdf, Screen Shot 
> 2019-12-22 at 2.20.54 PM.png, Screen Shot 2019-12-22 at 2.21.00 PM.png, 
> linux-thermal-test-graphics-frc-4pt.pdf, linux-thermal-test-graphics-frc.pdf, 
> linux-thermal-test-pdfbox-4pt.pdf
>
>
> When printing a PDF from PDFBox to a Zebra GK420d thermal label printer, the 
> text from the PDF is blurry and appears to have the top and bottom cut off 
> compared to printing the same file through Adobe Acrobat.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-4709) PDFBox prints text poorly in comparison to Adobe, Chrome, other apps

2020-12-01 Thread Tres Finocchiaro (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tres Finocchiaro updated PDFBOX-4709:
-
Attachment: 4d6a9ed5-45ef-4742-a21f-0903a883e2e8.jpeg

> PDFBox prints text poorly in comparison to Adobe, Chrome, other apps
> 
>
> Key: PDFBOX-4709
> URL: https://issues.apache.org/jira/browse/PDFBOX-4709
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.14
> Environment: Windows 10
> AdoptOpenJDK 11
> PDFBox 2.0.14
>Reporter: Lite M Finocchiaro
>Priority: Major
>  Labels: Zebra
> Attachments: 4d6a9ed5-45ef-4742-a21f-0903a883e2e8.jpeg, 
> Adoberesult.jpg, Drug-print.pdf, Drug-print.pdf, IMG_20191219_130048_2.jpg, 
> PDFBoxVSgraphicsobj.jpg, PDFBoxresult.jpg, PrintedWithPDFBox.pdf, Screen Shot 
> 2019-12-22 at 2.20.54 PM.png, Screen Shot 2019-12-22 at 2.21.00 PM.png, 
> linux-thermal-test-graphics-frc-4pt.pdf, linux-thermal-test-graphics-frc.pdf, 
> linux-thermal-test-pdfbox-4pt.pdf
>
>
> When printing a PDF from PDFBox to a Zebra GK420d thermal label printer, the 
> text from the PDF is blurry and appears to have the top and bottom cut off 
> compared to printing the same file through Adobe Acrobat.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-4709) PDFBox prints text poorly in comparison to Adobe, Chrome, other apps

2020-12-01 Thread Tres Finocchiaro (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tres Finocchiaro updated PDFBOX-4709:
-
Attachment: 4d6a9ed5-45ef-4742-a21f-0903a883e2e8.jpeg

> PDFBox prints text poorly in comparison to Adobe, Chrome, other apps
> 
>
> Key: PDFBOX-4709
> URL: https://issues.apache.org/jira/browse/PDFBOX-4709
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.14
> Environment: Windows 10
> AdoptOpenJDK 11
> PDFBox 2.0.14
>Reporter: Lite M Finocchiaro
>Priority: Major
>  Labels: Zebra
> Attachments: 4d6a9ed5-45ef-4742-a21f-0903a883e2e8.jpeg, 
> Adoberesult.jpg, Drug-print.pdf, Drug-print.pdf, IMG_20191219_130048_2.jpg, 
> PDFBoxVSgraphicsobj.jpg, PDFBoxresult.jpg, PrintedWithPDFBox.pdf, Screen Shot 
> 2019-12-22 at 2.20.54 PM.png, Screen Shot 2019-12-22 at 2.21.00 PM.png, 
> linux-thermal-test-graphics-frc-4pt.pdf, linux-thermal-test-graphics-frc.pdf, 
> linux-thermal-test-pdfbox-4pt.pdf
>
>
> When printing a PDF from PDFBox to a Zebra GK420d thermal label printer, the 
> text from the PDF is blurry and appears to have the top and bottom cut off 
> compared to printing the same file through Adobe Acrobat.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-4709) PDFBox prints text poorly in comparison to Adobe, Chrome, other apps

2020-12-01 Thread Tres Finocchiaro (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tres Finocchiaro updated PDFBOX-4709:
-
Attachment: (was: 4d6a9ed5-45ef-4742-a21f-0903a883e2e8.jpeg)

> PDFBox prints text poorly in comparison to Adobe, Chrome, other apps
> 
>
> Key: PDFBOX-4709
> URL: https://issues.apache.org/jira/browse/PDFBOX-4709
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.14
> Environment: Windows 10
> AdoptOpenJDK 11
> PDFBox 2.0.14
>Reporter: Lite M Finocchiaro
>Priority: Major
>  Labels: Zebra
> Attachments: 4d6a9ed5-45ef-4742-a21f-0903a883e2e8.jpeg, 
> Adoberesult.jpg, Drug-print.pdf, Drug-print.pdf, IMG_20191219_130048_2.jpg, 
> PDFBoxVSgraphicsobj.jpg, PDFBoxresult.jpg, PrintedWithPDFBox.pdf, Screen Shot 
> 2019-12-22 at 2.20.54 PM.png, Screen Shot 2019-12-22 at 2.21.00 PM.png, 
> linux-thermal-test-graphics-frc-4pt.pdf, linux-thermal-test-graphics-frc.pdf, 
> linux-thermal-test-pdfbox-4pt.pdf
>
>
> When printing a PDF from PDFBox to a Zebra GK420d thermal label printer, the 
> text from the PDF is blurry and appears to have the top and bottom cut off 
> compared to printing the same file through Adobe Acrobat.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5029) Tika - Issues extracting Arabic script from pdf

2020-12-01 Thread Christian (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241476#comment-17241476
 ] 

Christian  commented on PDFBOX-5029:


Hi Tilman, in your "sorted" files there are spaces between words but the word 
order in a sentence is backward - also the text is not following the column 
order in the pdf file but is jumping from "first line-first column to first 
line-second column to first line- third column" and so on. 
In addition there is a problem with the positioning of some vowel sign on the 
top of consonants - sometimes is correct sometimes is wrong even for the same 
combination of vowel+consonant. Same with the order of some "consonant 
clusters" - I'm not sure if it's correct to describe it this way, but it would 
be like the word "the" rendered as "hte" if that makes sense.

The "not sorted" files are even worse with missing spaces and reverse word 
order + letters in each word are backward. There is no "column issue" in this 
case. I summarize it with two examples:

Ex - sorted files: "the cat is red" --> "red is cat the" + the column issue.
Ex - not sorted files: "the cat is red" --> "dersitaceht" (no column issue)

In terms of "accuracy" my original utf-8 file attached above has no column 
issue and words have the right order in the sentences. We noticed also that the 
first word for each line in the first pdf column is missing. This does not make 
things easier I guess. 

Ex - test_scraped.utf8 file: "the cat is red" -> "catisred" (no column issue + 
missing first word)

Thanks again for your help.

> Tika - Issues extracting Arabic script from pdf
> ---
>
> Key: PDFBOX-5029
> URL: https://issues.apache.org/jira/browse/PDFBOX-5029
> Project: PDFBox
>  Issue Type: Bug
> Environment: Windows - Anaconda / Spyder
>Reporter: Christian 
>Priority: Major
> Attachments: PDFBOX-5029-not-sorted-2.0.21.txt, 
> PDFBOX-5029-not-sorted-trunk.txt, PDFBOX-5029-sorted-2.0.21.txt, 
> PDFBOX-5029-sorted-trunk.txt, extracting_text_asian_pdf.py, test.pdf, 
> test_scraped.utf8
>
>
> I'm working on building a corpus of Uygur texts and some of the content is 
> coming from pdf files. I wrote a short python script to scrape text from pdf 
> using tika-python. The script is Arabic, and the output looks good but there 
> is one major problem: there are many missing spaces between words and I 
> really do not know how to address this issue. I am attaching a pdf file, the 
> script to scrape its text and the output (test_scraped.utf8). Thanks in 
> advance for your help.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org