[ 
https://issues.apache.org/jira/browse/PDFBOX-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dusan Radojevic updated PDFBOX-951:
-----------------------------------

    Description: 
Hi,

i have noticed a big improvement in latest releases. Extraction is better but 
still has some problems.
I have attached some files where i had problems.

This is the code i use when extracting text:

PDFTextStripper stripper = new PDFTextStripper(); 
stripper.setStartPage( 1 );
stripper.setEndPage( i );
stripper.setSortByPosition(true);
stripper.setWordSeparator("~");
stripper.writeText(doc, sw);

And here are some extracted lines from file1.pdf (I have skipped few lines and 
made them shorter because the problem is on the beggining of the line):

01. ENGLESKA CARLING KUP~KONA^AN ISHOD~DUPLA [ANSA~PRVO POL....
02. Dan~^as~R.B.~1/2 
finala~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-...
03. Uto~20:45~2101~ARSENAL      0~1      
IPSWICH~1.15~6.25~12.00~1.00~4.03~1.30~3.10~11.00~1.35~29.0~50~4.50~1....
04. Sre~20:45~2102~BIRMINGHAM     1~2    WEST 
HAM~2.10~3.10~3.15~1.25~1.26~1.56~2.60~1.95~3.85~3.....
05. ENGLESKA 1~KONA^AN GOLOVI ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL.~...
06. Dan~^as~R.B.~igre bez 
uslova~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-....
07. Uto~20:30~2001~BLACKPOOL~MANCHESTER 
UTD~7.50~4.20~1.36~2.69~1.12~1.00~7.00~2.35~1.75~13.0.....
08. Uto~20:45~2002~WIGAN ASTON 
VILLA~2.80~3.00~2.35~1.45~1.28~1.32~3.65~1.93~2.9.....
09. 
Sre~21:00~2003~LIVERPOOL~FULHAM~1.50~3.50~6.05~1.05~1.20~2.22~1.95~2.13~6.55...
10. ENGLESKA 2~KONA^AN GOLOVI A TIMA DUPLA ISHOD~DUPLA [ANSA~PRVO POL.~45' / 
90'~POL....
11. 
Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2~1p2+2p2+~0-1.....
12. Uto~20:45~2171~DONCASTER 
BARNSLEY~1.95~3.10~3.55~1.20~1.26~1.65~2.50~2.00~4.25~3.15~13.0~30~4.70~....
13. Uto~20:45~2172NOTTINGHAM FOREST~BRISTOL 
CITY~1.55~3.45~5.50~1.07~1.21~2.12~2.00~2.12~....
14. ENGLESKA 3~KONA^AN DUPLA [ANSA~PRVO POL.~45' / 90'~GOLOVI LA ISHOD~POL.~....
15. 
Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2....
16. Uto~20:45~2201~BRIGHTON COLCHESTER 
1.70~3.30~4.45~1.12~1.23~1.89~2.15~2.10~5.20~.....
17. Uto~20:45~2202~HARTLEPOOL~NOTTS COUNTY~2.50~3.05~2.60~1.37~1.27.....
18. Uto~20:45~2203~LEYTON MK DONS~2.20~3.10~2.95~1.29~1.26~1.51~2.75~1.98~3.....
19. Uto~20:45~2204~SHEFFIELD WED~YEOVIL 
1.67~3.35~4.70~1.11~1.22~1.96~2.10~2.10.....

Lines 07, 09, 17 are extracted well and well formated.
Lines 03 and 04 share the same problem, there are unnecessary spaces which 
should be line separators (in my case "~" separates words). I have seen this in 
other documents.
Lines 08 and 18 for example doesn't have word separator ("~") between two team 
names. The space in the document between "Wigan" and "Aston Villa"  words is 
realy big.
Lines 16 and 19 doesn't have word separator between second team name and first 
quota (COLCHESTER 1.70 and YEOVIL 1.67)


> Text extraction has issues on some pdfs
> ---------------------------------------
>
>                 Key: PDFBOX-951
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-951
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Dusan Radojevic
>            Priority: Minor
>             Fix For: 1.5.0
>
>
> Hi,
> i have noticed a big improvement in latest releases. Extraction is better but 
> still has some problems.
> I have attached some files where i had problems.
> This is the code i use when extracting text:
> PDFTextStripper stripper = new PDFTextStripper(); 
> stripper.setStartPage( 1 );
> stripper.setEndPage( i );
> stripper.setSortByPosition(true);
> stripper.setWordSeparator("~");
> stripper.writeText(doc, sw);
> And here are some extracted lines from file1.pdf (I have skipped few lines 
> and made them shorter because the problem is on the beggining of the line):
> 01. ENGLESKA CARLING KUP~KONA^AN ISHOD~DUPLA [ANSA~PRVO POL....
> 02. Dan~^as~R.B.~1/2 
> finala~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-...
> 03. Uto~20:45~2101~ARSENAL      0~1      
> IPSWICH~1.15~6.25~12.00~1.00~4.03~1.30~3.10~11.00~1.35~29.0~50~4.50~1....
> 04. Sre~20:45~2102~BIRMINGHAM     1~2    WEST 
> HAM~2.10~3.10~3.15~1.25~1.26~1.56~2.60~1.95~3.85~3.....
> 05. ENGLESKA 1~KONA^AN GOLOVI ISHOD~DUPLA [ANSA~PRVO POL.~45' / 90'~POL.~...
> 06. Dan~^as~R.B.~igre bez 
> uslova~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-....
> 07. Uto~20:30~2001~BLACKPOOL~MANCHESTER 
> UTD~7.50~4.20~1.36~2.69~1.12~1.00~7.00~2.35~1.75~13.0.....
> 08. Uto~20:45~2002~WIGAN ASTON 
> VILLA~2.80~3.00~2.35~1.45~1.28~1.32~3.65~1.93~2.9.....
> 09. 
> Sre~21:00~2003~LIVERPOOL~FULHAM~1.50~3.50~6.05~1.05~1.20~2.22~1.95~2.13~6.55...
> 10. ENGLESKA 2~KONA^AN GOLOVI A TIMA DUPLA ISHOD~DUPLA [ANSA~PRVO POL.~45' / 
> 90'~POL....
> 11. 
> Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2~1p2+2p2+~0-1.....
> 12. Uto~20:45~2171~DONCASTER 
> BARNSLEY~1.95~3.10~3.55~1.20~1.26~1.65~2.50~2.00~4.25~3.15~13.0~30~4.70~....
> 13. Uto~20:45~2172NOTTINGHAM FOREST~BRISTOL 
> CITY~1.55~3.45~5.50~1.07~1.21~2.12~2.00~2.12~....
> 14. ENGLESKA 3~KONA^AN DUPLA [ANSA~PRVO POL.~45' / 90'~GOLOVI LA 
> ISHOD~POL.~....
> 15. 
> Dan~^as~R.B.~90'~1~X~2~1X~12~X2~1~X~2~1-1~1-X~1-2~X-1~X-X~X-2~2-1~2-X~2-2....
> 16. Uto~20:45~2201~BRIGHTON COLCHESTER 
> 1.70~3.30~4.45~1.12~1.23~1.89~2.15~2.10~5.20~.....
> 17. Uto~20:45~2202~HARTLEPOOL~NOTTS COUNTY~2.50~3.05~2.60~1.37~1.27.....
> 18. Uto~20:45~2203~LEYTON MK 
> DONS~2.20~3.10~2.95~1.29~1.26~1.51~2.75~1.98~3.....
> 19. Uto~20:45~2204~SHEFFIELD WED~YEOVIL 
> 1.67~3.35~4.70~1.11~1.22~1.96~2.10~2.10.....
> Lines 07, 09, 17 are extracted well and well formated.
> Lines 03 and 04 share the same problem, there are unnecessary spaces which 
> should be line separators (in my case "~" separates words). I have seen this 
> in other documents.
> Lines 08 and 18 for example doesn't have word separator ("~") between two 
> team names. The space in the document between "Wigan" and "Aston Villa"  
> words is realy big.
> Lines 16 and 19 doesn't have word separator between second team name and 
> first quota (COLCHESTER 1.70 and YEOVIL 1.67)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to