[ 
https://issues.apache.org/jira/browse/PDFBOX-2996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Torunski updated PDFBOX-2996:
----------------------------------
    Attachment: artikel1_20_arab.pdf-sorted-iter-withRightPivot.txt

Just added the output file of the iterative quick sorting algrithm with a pivot 
choosen from the right index. The output changes again.

artikel1_20_arab.pdf-sorted-iter-withRightPivot.txt

Even if the spd file isn't the right test file for us, but it shows how the 
different sorting algorithms have an impact on the text extraction. Maybe the 
stability of the algorithms have an impact on that. Hence the output file of 
Java 6 would be a good reference. Unfortunately I haven't Java 6 on my Mac 
also. I'm going to use the useLegacyMergeSort option to extract the file.

When a stable sorting algorithm is realy needed, which was used in Java 6, then 
the fix in PDFBOX-1512 with quick sort was the wrong choice. My hypothesis is 
that the quick sorting algorithm with the right index for the pivot runs for 
most PDFs, but not for all. 


> StackOverflow in Quicksort
> --------------------------
>
>                 Key: PDFBOX-2996
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2996
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.10, 2.0.0
>         Environment: Java 7
>            Reporter: Manuel Aristaran
>         Attachments: 001991.pdf, Lars-v0-PDFBOX-2996.patch, 
> Lars-v1-PDFBOX-2996.patch, Lars-v2-PDFBOX-2996.patch, QuickSort.java, 
> artikel1_20_arab.pdf-sorted-bubble.txt, artikel1_20_arab.pdf-sorted-diff.txt, 
> artikel1_20_arab.pdf-sorted-iter-withRightPivot.txt, 
> artikel1_20_arab.pdf-sorted-iter.txt, artikel1_20_arab.pdf-sorted-rekur.txt, 
> failing_sort.pdf, quicksort.patch
>
>
> Running PDFTextStripper through ExtractText triggers a StackOverflow 
> exception in the QuickSort implementation for [this particular 
> document|https://www.dropbox.com/s/6crie7y5gqadwa5/1.pdf?dl=0].
> To reproduce: {{java -jar pdfbox-app-1.8.11-SNAPSHOT.jar ExtractText -sort 
> failing_sort.pdf}}
> (Related to PDFBOX-1512)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to