[ https://issues.apache.org/jira/browse/PDFBOX-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler updated PDFBOX-951: -------------------------------------- Fix Version/s: 1.8.0 > Text extraction has issues on some pdfs > --------------------------------------- > > Key: PDFBOX-951 > URL: https://issues.apache.org/jira/browse/PDFBOX-951 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.4.0 > Reporter: Dusan Radojevic > Priority: Minor > Fix For: 1.8.0 > > Attachments: file1.pdf, file2.pdf, file3.pdf, PDFBOX951-file3.txt > > > Hi, > i have noticed a big improvement in latest releases. Extraction is better but > still has some problems. > I have attached some files where i had problems. > This is the code i use when extracting text: > PDFTextStripper stripper = new PDFTextStripper(); > stripper.setStartPage( 1 ); > stripper.setEndPage( 1 ); > stripper.setSortByPosition(true); > stripper.setWordSeparator("~"); > stripper.writeText(doc, sw); > And here are some extracted lines from file1.pdf (I have skipped few lines > and made them shorter because the problem is on the beggining of the line): > 01. ENGLESKA CARLING KUP~KONA^AN ISHOD~DUPLA [ANSA~PRVO POL.... > 02. Dan~^as~R.B.~1/2 > finala~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-... > 03. Uto~20:45~2101~ARSENAL 0~1 > IPSWICH~1.15~6.25~12.00~1.00~4.03~1.30~3.10~11.00~1.35~29.0~50~4.50~1.... > 04. Sre~20:45~2102~BIRMINGHAM 1~2 WEST > HAM~2.10~3.10~3.15~1.25~1.26~1.56~2.60~1.95~3.85~3..... > 05. ENGLESKA 1~KONA^AN GOLOVI ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL.~... > 06. Dan~^as~R.B.~igre bez > uslova~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-.... > 07. Uto~20:30~2001~BLACKPOOL~MANCHESTER > UTD~7.50~4.20~1.36~2.69~1.12~1.00~7.00~2.35~1.75~13.0..... > 08. Uto~20:45~2002~WIGAN ASTON > VILLA~2.80~3.00~2.35~1.45~1.28~1.32~3.65~1.93~2.9..... > 09. > Sre~21:00~2003~LIVERPOOL~FULHAM~1.50~3.50~6.05~1.05~1.20~2.22~1.95~2.13~6.55... > 10. ENGLESKA 2~KONA^AN GOLOVI A TIMA DUPLA ISHOD~DUPLA [ANSA~PRVO POL.~45' / > 90'~POL.... > 11. > Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2~1p2+2p2+~0-1..... > 12. Uto~20:45~2171~DONCASTER > BARNSLEY~1.95~3.10~3.55~1.20~1.26~1.65~2.50~2.00~4.25~3.15~13.0~30~4.70~.... > 13. Uto~20:45~2172NOTTINGHAM FOREST~BRISTOL > CITY~1.55~3.45~5.50~1.07~1.21~2.12~2.00~2.12~.... > 14. ENGLESKA 3~KONA^AN DUPLA [ANSA~PRVO POL.~45' / 90'~GOLOVI LA > ISHOD~POL.~.... > 15. > Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2.... > 16. Uto~20:45~2201~BRIGHTON COLCHESTER > 1.70~3.30~4.45~1.12~1.23~1.89~2.15~2.10~5.20~..... > 17. Uto~20:45~2202~HARTLEPOOL~NOTTS COUNTY~2.50~3.05~2.60~1.37~1.27..... > 18. Uto~20:45~2203~LEYTON MK > DONS~2.20~3.10~2.95~1.29~1.26~1.51~2.75~1.98~3..... > 19. Uto~20:45~2204~SHEFFIELD WED~YEOVIL > 1.67~3.35~4.70~1.11~1.22~1.96~2.10~2.10..... > Lines 07, 09, 17 are extracted well and well formated. > Lines 03 and 04 share the same problem, there are unnecessary spaces which > should be line separators (in my case "~" separates words). I have seen this > in other documents. > Lines 08 and 18 for example doesn't have word separator ("~") between two > team names. The space in the document between "Wigan" and "Aston Villa" > words is realy big. > Lines 16 and 19 doesn't have word separator between second team name and > first quota (COLCHESTER 1.70 and YEOVIL 1.67) > FILE2.pdf has another problem. Words are shuffled. Here is top line from this > file: > G~KONA^ANERMANY~DUPLA~PRVO~GOL GOL45' / 90'~UKUPNO GOLOVA~KOMBI > 1~ISHOD~ŠANSA~POLUVREME~NE GOL > games are extracted excelent from this file but lines that should be headers > are messed up: > GERMANY~HENDIKEP~DRUGO~UKUPNO~UKUPNO~PRVI~ZADNJI > DOMAC > GOLOVA~ GOLOVA~GOST > UKUPNO~DUPLA GOLOVA~VIŠE DAJE~WINNER 1~0:1~POLUVREME~PRVO~DRUGO~LOVA > POLUVREME~ POLUVREME~GO~JE~GIONL~GOL~POBEDA DA > And FILE3.pdf has some real issues. Nothing is extracted as it should be. > Here are some lines: > g~I~ o~ d~s~r~v~p~o~h > .~S~a~.~K~ v~"~o~L~v~i~"~r~e~s~.~a~n~o~K~č~a~ n~i~d~s~o~h~l~p~u~a~D~ > š~a~s~n~a~r~e~m~l~P~v~e~ u~o~-~ r~a~K~j~r~e~m~l~ v~e~u~o~p~a~o~ > v~l~g~r~v~p~o~n~p~u~o~o~k~U~ o~n~p~u~k~U~a~r~e~m~l~ v~l~g~ > g~r~u~d~v~e~u~o~p~o~o~o > g~A~ p~F~a~d~n~C~E~l~u~n > r~e~m~l~a~v~e~n~u~o~p > 9~0 > I hope this will help you people improve pdfbox. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira