Hi I am having an issue with the ordering of text extracted from certain PDFs. I have read around the subject and am aware that this is a common issue caused by complex internals of the PDF format and many changes to it over time. I am hoping however that there may be a simple fix to my problem …
Unfortunately I cannot share the offending document, but hopefully I can describe the issue well enough that someone can guide me further. So here goes … On a page there are several columns / sections: | ID | 31894442308 | … an address ending with a zip code “44303” … | The extraction gets the order wrong and *runs the text elements together* (with no space or delimiter) like so: … address … 4430331894442308 … This joining the 2 numbers together causes a real problem for me because it creates a new number (not in the original) which some post processing chokes on. I have tried following it in the debugger but it is quite involved and I cannot see the conditions which cause it. Question: could a space or ‘\n’ could be inserted after each of these virtual sections? Thanks - Chris Chris Bamford Senior Developer m: +44 7860 405292 p: +44 207 847 8700 w: www.mimecast.com Address click here: www.mimecast.com/About-us/Contact-us/

