Hi 

I am having an issue with the ordering of text extracted from certain PDFs.  I 
have read around the subject and am aware that this is a common issue caused by 
complex internals of the PDF format and many changes to it over time.
I am hoping however that there may be a simple fix to my problem …

Unfortunately I cannot share the offending document, but hopefully I can 
describe the issue well enough that someone can guide me further.

So here goes …

On a page there are several columns / sections:

|     ID     |       31894442308      |     … an address ending with a zip code 
“44303” …     |

The extraction gets the order wrong and *runs the text elements together* (with 
no space or delimiter) like so:


… address … 4430331894442308 …

This joining the 2 numbers together causes a real problem for me because it 
creates a new number (not in the original) which some post processing chokes on.
I have tried following it in the debugger but it is quite involved and I cannot 
see the conditions which cause it.

Question: could a space or ‘\n’ could be inserted after each of these virtual 
sections?

Thanks

- Chris

Chris Bamford
Senior Developer
m: +44 7860 405292
p: +44 207 847 8700
w: www.mimecast.com
Address click here: www.mimecast.com/About-us/Contact-us/





Reply via email to