[ 
https://issues.apache.org/jira/browse/PDFBOX-2996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15030659#comment-15030659
 ] 

Lars Torunski commented on PDFBOX-2996:
---------------------------------------

I can reproduce my tests results as documented in diff-delta.png. Using WinDiff 
to look into the differences between the results I can see the weird 
differences and the glyphs that none of us both understand also.

In my opinion the number of the deltas can be used as an measurement of the 
sorting algorithms. And when legacy merge sort is the base line, which was used 
until PDFBOX-1512, then bubble sort should be used as a substitution of it.

Otherwise the iterative quick sort with choosing the right index for the pivot 
is the best choice and substitution for the current recursive quick sort. This 
would solve the issue PDFBOX-2996, but you should reminder that Java 5&6 are 
having different text extraction results than Java 7+ on certain PDF files.

> StackOverflow in Quicksort
> --------------------------
>
>                 Key: PDFBOX-2996
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2996
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.10, 2.0.0
>         Environment: Java 7
>            Reporter: Manuel Aristaran
>         Attachments: 001991.pdf, Lars-v0-PDFBOX-2996.patch, 
> Lars-v1-PDFBOX-2996.patch, Lars-v2-PDFBOX-2996.patch, QuickSort.java, 
> TestSortingAlgorithms.java, artikel1_20_arab.pdf-sorted-bubble.txt, 
> artikel1_20_arab.pdf-sorted-diff.txt, 
> artikel1_20_arab.pdf-sorted-iter-withRightPivot.txt, 
> artikel1_20_arab.pdf-sorted-iter.txt, 
> artikel1_20_arab.pdf-sorted-java8-legacyMergeSort.txt, 
> artikel1_20_arab.pdf-sorted-java8-timsort.txt, 
> artikel1_20_arab.pdf-sorted-qs-iterative-withMiddlePivot.txt, 
> artikel1_20_arab.pdf-sorted-qs-iterative-withRightPivot.txt, 
> artikel1_20_arab.pdf-sorted-qs-recursive.txt, 
> artikel1_20_arab.pdf-sorted-rekur.txt, diff-delta.png, failing_sort.pdf, 
> quicksort.patch
>
>
> Running PDFTextStripper through ExtractText triggers a StackOverflow 
> exception in the QuickSort implementation for [this particular 
> document|https://www.dropbox.com/s/6crie7y5gqadwa5/1.pdf?dl=0].
> To reproduce: {{java -jar pdfbox-app-1.8.11-SNAPSHOT.jar ExtractText -sort 
> failing_sort.pdf}}
> (Related to PDFBOX-1512)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to