[ 
https://issues.apache.org/jira/browse/PDFBOX-5247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388809#comment-17388809
 ] 

flywire commented on PDFBOX-5247:
---------------------------------

Extracts from the file using font and cmap to decode characters to text:

 

 
{noformat}

...
BT
/PFLLBD+TimesNewRomanPSMT 11.00000 Tf
8.00 11.00 Td
0.50196 0.50196 0.50196 rg
0.26403 Tc
 ( * H Q H U D W H G  $ W      $ S U                $ 0) Tj
 ET
...
25 0 obj
<<
/Type /Font
/Subtype /Type0
/BaseFont /PFLLBD+TimesNewRomanPSMT
/Name /PFLLBD+TimesNewRomanPSMT
/DescendantFonts [29 0 R]
/ToUnicode 30 0 R
/Encoding /Identity-H
>>
endobj
...
30 0 obj
<<
/Length 1317
>>
stream
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo << /Registry (Adobe)/Ordering (UCS)/Supplement 0>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0003><00b4>
endcodespacerange
52 beginbfrange
<0003><0003><00A0>
<000B><000B><0028>
<000C><000C><0029>
<000F><000F><002C>
<0010><0010><00AD>
<0011><0011><002E>
<0013><0013><0030>
<0014><0014><0031>
<0015><0015><0032>
<0017><0017><0034>
<001D><001D><003A>
<0024><0024><0041>
<0025><0025><0042>
<0026><0026><0043>
<0029><0029><0046>
<002A><002A><0047>
<002B><002B><0048>
<002C><002C><0049>
<002D><002D><004A>
<0030><0030><004D>
<0032><0032><004F>
<0033><0033><0050>
<0035><0035><0052>
<0036><0036><0053>
<0037><0037><0054>
<003A><003A><0057>
<003B><003B><0058>
<0044><0044><0061>
<0045><0045><0062>
<0046><0046><0063>
<0047><0047><0064>
<0048><0048><0065>
<0049><0049><0066>
<004A><004A><0067>
<004B><004B><0068>
<004C><004C><0069>
<004D><004D><006A>
<004E><004E><006B>
<004F><004F><006C>
<0050><0050><006D>
<0051><0051><006E>
<0052><0052><006F>
<0053><0053><0070>
<0054><0054><0071>
<0055><0055><0072>
<0056><0056><0073>
<0057><0057><0074>
<0058><0058><0075>
<0059><0059><0076>
<005A><005A><0077>
<005C><005C><0079>
<00B4><00B4><201D>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end end
endstream
endobj

Tj line as Hex
28 00 2a 00 48 00 51 00 48 00 55 00 44 00 57 00 
48 00 47 00 03 00 24 00 57 00 1d 00 03 00 13 00 
14 00 03 00 24 00 53 00 55 00 03 00 15 00 13 00 
15 00 14 00 03 00 13 00 14 00 1d 00 13 00 13 00 
1d 00 14 00 17 00 03 00 24 00 30 29 20 20 54 6a
{noformat}
 

> Space in pdf returns c2 a0 characters instead of 20
> ---------------------------------------------------
>
>                 Key: PDFBOX-5247
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5247
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.16, 2.0.24
>         Environment: Portfolio Performance
> Version: 0.54.0 (Jul. 2021)
> Platform: win32, x86_64
> Java: 11.0.4+11-LTS, Azul Systems, Inc.
> Locale: AU
>            Reporter: flywire
>            Priority: Minor
>         Attachments: PDFBoxSpaceSample.pdf, PDFBoxSpaceSample.pdf.txt
>
>
> *pdf containing:*
> SelfWealth Limited ABN: 52 154 324 428 AFSL 421789 W: www.selfwealth.com.au 
> E: [email protected]
> This trade was executed and cleared by OpenMarkets Australia Ltd ABN 38 090 
> 472 012,
> AFSL 246 705, Market Particpant of ASX, CHI­X and NSX.
> Buy Confirmation
>  
> *Gives (see hex on right side):*
> !https://user-images.githubusercontent.com/11288701/126945391-18c0ccb4-289d-49cd-85a8-8714e145df3f.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to