[ 
https://issues.apache.org/jira/browse/PDFBOX-838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved PDFBOX-838.
----------------------------------

       Resolution: Cannot Reproduce
    Fix Version/s:     (was: 1.3.0)
         Assignee: Jukka Zitting

This seems to have been fixed in some other issue, as I can't reproduce the 
problem anymore with the latest trunk.

> Problem with text extraction
> ----------------------------
>
>                 Key: PDFBOX-838
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-838
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.1
>            Reporter: Dusan Radojevic
>            Assignee: Jukka Zitting
>         Attachments: listaMeridian.pdf, listaMillenium.pdf
>
>
> I want to make a parser that will parse some bookie pdf list with odds. I 
> have two files. One is working flawlessly and the other one have problems 
> although the two files are almost in identical form. The file uploaded 
> (listaMillenium.pdf) has problems with text extraction and the other file 
> (listaMeridian.pdf) is working fine.
> This is the code i used:
>                  try {
>                   doc = PDDocument.load("listaMillenium.pdf");
>                  
>                   PDFTextStripper stripper = new PDFTextStripper();           
>      
>                   stripper.setStartPage( 6 );
>                   stripper.setEndPage( 6 );
>        
>                   stripper.setSortByPosition(true);
>                   stripper.setShouldSeparateByBeads(true);
>                   stripper.setSuppressDuplicateOverlappingText(true);
>                   stripper.setWordSeparator("~");
>                   stripper.writeText(doc, sw);
>               } finally {
>                    if (doc != null) {
>                        doc.close();
>                    }
>               }
> On page 6 of the uploaded document (listaMillenium.pdf) you can see the 
> output lines like this:
> nedelja 37 - 14.09. Utorak, 15.09. Sreda i 16.09. Četvrtak~strana 6
> ~Football~UEFA Europa League~Rezultat~KONAČAN ISHOD~DUPLA 
> ŠANSA~POLUVREME-KRAJ~Hen~HENDIKEP
> ~dan~čas~šifra~45~90~1~X~2~1X~12~X2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2~H~H1~HX~H2
> ~Cet~19:00~4041*~Salzburg~Man. 
> City~5.60~3.25~1.60~2.06~1.24~1.07~10.5~13.5~32.0~10.5~5.65~4.25~35.0~13.0~2.50~1~2.06~3.50~2.07
> ~Cet~19:00~4042*~Juventus~Lech 
> P.~1.20~5.25~10.5~1.08~3.50~1.50~21.0~70.0~4.75~9.00~20.0~40.0~19.0~27.0~-1~1.40~3.85~3.50
> ~Cet~19:00~4043*~Aris~Atl. 
> Madrid~3.50~3.20~1.95~1.67~1.25~1.21~7.00~13.0~30.0~7.25~5.05~4.80~30.0~13.0~3.25~1~1.67~3.30~2.80
> ~Cet~19:00~4044*~Leverkusen~Rosenborg~1.35~4.00~8.30~1.01~1.16~2.70~1.95~17.0~50.0~4.05~7.00~17.0~35.0~15.0~15.0~-1~1.63~3.70~2.70
> ~Cet~19:00~4045*~Lille~Sporting 
> L.~1.80~3.20~4.10~1.15~1.25~1.80~2.95~13.0~30.0~4.65~5.25~7.95~30.0~13.0~7.80~-1~2.45~3.45~1.80
> ~Cet~19:00~4046*~Levski 
> Sofia~Gent~2.00~3.20~3.35~1.23~1.25~1.64~3.35~13.0~30.0~4.85~5.00~7.00~30.0~13.0~6.75~-1~2.95~3.25~1.63
> ~Cet~19:00~4047*~Dinamo 
> Z.~Villarreal~3.35~3.20~2.00~1.64~1.25~1.23~6.75~13.0~30.0~7.00~5.00~4.85~30.0~13.0~3.35~1~1.63~3.25~2.95
> ~Cet~19:00~4048*~Club 
> Brugge~PAOK~2.10~3.15~3.15~1.26~1.26~1.58~3.50~13.0~30.0~4.95~5.00~6.65~30.0~13.0~6.40~-1~3.20~3.25~1.57
> ~Cet~19:00~4049*~AZ Alkmaar~Sheriff 
> Tiraspol~1.50~3.40~6.70~1.04~1.23~2.26~2.25~15.0~40.0~4.15~6.05~12.5~32.0~14.0~11.5~-1~1.87~3.60~2.24
> ~Cet~19:00~4050*~Dinamo 
> K.~BATE~1.40~3.75~7.65~1.02~1.18~2.52~2.05~17.0~40.0~4.10~6.65~15.0~32.0~14.0~14.0~-1~1.70~3.70~2.52
> ~Cet~19:00~4051*~Sparta 
> P.~Palermo~2.50~3.05~2.60~1.37~1.27~1.40~4.45~12.5~30.0~5.65~5.00~5.80~28.0~12.5~4.65~-1~4.40~3.20~1.40
> ~Cet~19:00~4052*~Lausanne~CSKA 
> Moscow~6.70~3.40~1.50~2.26~1.23~1.04~11.5~14.0~32.0~12.5~6.05~4.15~40.0~15.0~2.25~1~2.24~3.60~1.87
> ~Cet~21:05~4053*~Anderlecht~Zenit~2.60~3.05~2.50~1.40~1.27~1.37~4.65~12.5~28.0~5.80~5.00~5.65~30.0~12.5~4.45~1~1.40~3.20~4.40
> ~Cet~21:05~4054*~AEK~Hajduk~1.60~3.25~5.60~1.07~1.24~2.06~2.50~13.0~35.0~4.25~5.65~10.5~32.0~13.5~10.5~-1~2.07~3.50~2.06
> ~CeCet~21:021:05~4055*~Stuttgart~Y. 
> Boys~1.60~3.25~5.60~1.07~1.24~2.06~2.50~13.0~35.0~4.25~5.65~10.5~32.0~13.5~10.5~-1~2.07~3.50~2.06
> Last line in this listing has problems. It has duplicate values somehow.
> You can find this issue on almost every page of this list. Other lists (that 
> i have not uploaded) have same problems.
> As i said, other file (listaMeridian.pdf) does not have this issue.
> Maybe this will help you fix this and it will surely help me. :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to