[ https://issues.apache.org/jira/browse/PDFBOX-838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved PDFBOX-838. ---------------------------------- Resolution: Cannot Reproduce Fix Version/s: (was: 1.3.0) Assignee: Jukka Zitting This seems to have been fixed in some other issue, as I can't reproduce the problem anymore with the latest trunk. > Problem with text extraction > ---------------------------- > > Key: PDFBOX-838 > URL: https://issues.apache.org/jira/browse/PDFBOX-838 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.2.1 > Reporter: Dusan Radojevic > Assignee: Jukka Zitting > Attachments: listaMeridian.pdf, listaMillenium.pdf > > > I want to make a parser that will parse some bookie pdf list with odds. I > have two files. One is working flawlessly and the other one have problems > although the two files are almost in identical form. The file uploaded > (listaMillenium.pdf) has problems with text extraction and the other file > (listaMeridian.pdf) is working fine. > This is the code i used: > try { > doc = PDDocument.load("listaMillenium.pdf"); > > PDFTextStripper stripper = new PDFTextStripper(); > > stripper.setStartPage( 6 ); > stripper.setEndPage( 6 ); > > stripper.setSortByPosition(true); > stripper.setShouldSeparateByBeads(true); > stripper.setSuppressDuplicateOverlappingText(true); > stripper.setWordSeparator("~"); > stripper.writeText(doc, sw); > } finally { > if (doc != null) { > doc.close(); > } > } > On page 6 of the uploaded document (listaMillenium.pdf) you can see the > output lines like this: > nedelja 37 - 14.09. Utorak, 15.09. Sreda i 16.09. Četvrtak~strana 6 > ~Football~UEFA Europa League~Rezultat~KONAČAN ISHOD~DUPLA > ŠANSA~POLUVREME-KRAJ~Hen~HENDIKEP > ~dan~čas~šifra~45~90~1~X~2~1X~12~X2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2~H~H1~HX~H2 > ~Cet~19:00~4041*~Salzburg~Man. > City~5.60~3.25~1.60~2.06~1.24~1.07~10.5~13.5~32.0~10.5~5.65~4.25~35.0~13.0~2.50~1~2.06~3.50~2.07 > ~Cet~19:00~4042*~Juventus~Lech > P.~1.20~5.25~10.5~1.08~3.50~1.50~21.0~70.0~4.75~9.00~20.0~40.0~19.0~27.0~-1~1.40~3.85~3.50 > ~Cet~19:00~4043*~Aris~Atl. > Madrid~3.50~3.20~1.95~1.67~1.25~1.21~7.00~13.0~30.0~7.25~5.05~4.80~30.0~13.0~3.25~1~1.67~3.30~2.80 > ~Cet~19:00~4044*~Leverkusen~Rosenborg~1.35~4.00~8.30~1.01~1.16~2.70~1.95~17.0~50.0~4.05~7.00~17.0~35.0~15.0~15.0~-1~1.63~3.70~2.70 > ~Cet~19:00~4045*~Lille~Sporting > L.~1.80~3.20~4.10~1.15~1.25~1.80~2.95~13.0~30.0~4.65~5.25~7.95~30.0~13.0~7.80~-1~2.45~3.45~1.80 > ~Cet~19:00~4046*~Levski > Sofia~Gent~2.00~3.20~3.35~1.23~1.25~1.64~3.35~13.0~30.0~4.85~5.00~7.00~30.0~13.0~6.75~-1~2.95~3.25~1.63 > ~Cet~19:00~4047*~Dinamo > Z.~Villarreal~3.35~3.20~2.00~1.64~1.25~1.23~6.75~13.0~30.0~7.00~5.00~4.85~30.0~13.0~3.35~1~1.63~3.25~2.95 > ~Cet~19:00~4048*~Club > Brugge~PAOK~2.10~3.15~3.15~1.26~1.26~1.58~3.50~13.0~30.0~4.95~5.00~6.65~30.0~13.0~6.40~-1~3.20~3.25~1.57 > ~Cet~19:00~4049*~AZ Alkmaar~Sheriff > Tiraspol~1.50~3.40~6.70~1.04~1.23~2.26~2.25~15.0~40.0~4.15~6.05~12.5~32.0~14.0~11.5~-1~1.87~3.60~2.24 > ~Cet~19:00~4050*~Dinamo > K.~BATE~1.40~3.75~7.65~1.02~1.18~2.52~2.05~17.0~40.0~4.10~6.65~15.0~32.0~14.0~14.0~-1~1.70~3.70~2.52 > ~Cet~19:00~4051*~Sparta > P.~Palermo~2.50~3.05~2.60~1.37~1.27~1.40~4.45~12.5~30.0~5.65~5.00~5.80~28.0~12.5~4.65~-1~4.40~3.20~1.40 > ~Cet~19:00~4052*~Lausanne~CSKA > Moscow~6.70~3.40~1.50~2.26~1.23~1.04~11.5~14.0~32.0~12.5~6.05~4.15~40.0~15.0~2.25~1~2.24~3.60~1.87 > ~Cet~21:05~4053*~Anderlecht~Zenit~2.60~3.05~2.50~1.40~1.27~1.37~4.65~12.5~28.0~5.80~5.00~5.65~30.0~12.5~4.45~1~1.40~3.20~4.40 > ~Cet~21:05~4054*~AEK~Hajduk~1.60~3.25~5.60~1.07~1.24~2.06~2.50~13.0~35.0~4.25~5.65~10.5~32.0~13.5~10.5~-1~2.07~3.50~2.06 > ~CeCet~21:021:05~4055*~Stuttgart~Y. > Boys~1.60~3.25~5.60~1.07~1.24~2.06~2.50~13.0~35.0~4.25~5.65~10.5~32.0~13.5~10.5~-1~2.07~3.50~2.06 > Last line in this listing has problems. It has duplicate values somehow. > You can find this issue on almost every page of this list. Other lists (that > i have not uploaded) have same problems. > As i said, other file (listaMeridian.pdf) does not have this issue. > Maybe this will help you fix this and it will surely help me. :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.