[ https://issues.apache.org/jira/browse/PDFBOX-2996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lars Torunski updated PDFBOX-2996: ---------------------------------- Attachment: artikel1_20_arab.pdf-sorted-iter-withRightPivot.txt Just added the output file of the iterative quick sorting algrithm with a pivot choosen from the right index. The output changes again. artikel1_20_arab.pdf-sorted-iter-withRightPivot.txt Even if the spd file isn't the right test file for us, but it shows how the different sorting algorithms have an impact on the text extraction. Maybe the stability of the algorithms have an impact on that. Hence the output file of Java 6 would be a good reference. Unfortunately I haven't Java 6 on my Mac also. I'm going to use the useLegacyMergeSort option to extract the file. When a stable sorting algorithm is realy needed, which was used in Java 6, then the fix in PDFBOX-1512 with quick sort was the wrong choice. My hypothesis is that the quick sorting algorithm with the right index for the pivot runs for most PDFs, but not for all. > StackOverflow in Quicksort > -------------------------- > > Key: PDFBOX-2996 > URL: https://issues.apache.org/jira/browse/PDFBOX-2996 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.8.10, 2.0.0 > Environment: Java 7 > Reporter: Manuel Aristaran > Attachments: 001991.pdf, Lars-v0-PDFBOX-2996.patch, > Lars-v1-PDFBOX-2996.patch, Lars-v2-PDFBOX-2996.patch, QuickSort.java, > artikel1_20_arab.pdf-sorted-bubble.txt, artikel1_20_arab.pdf-sorted-diff.txt, > artikel1_20_arab.pdf-sorted-iter-withRightPivot.txt, > artikel1_20_arab.pdf-sorted-iter.txt, artikel1_20_arab.pdf-sorted-rekur.txt, > failing_sort.pdf, quicksort.patch > > > Running PDFTextStripper through ExtractText triggers a StackOverflow > exception in the QuickSort implementation for [this particular > document|https://www.dropbox.com/s/6crie7y5gqadwa5/1.pdf?dl=0]. > To reproduce: {{java -jar pdfbox-app-1.8.11-SNAPSHOT.jar ExtractText -sort > failing_sort.pdf}} > (Related to PDFBOX-1512) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org