I know you can extract text based on a region, and I also remember seeing many e-mails about improvements in preserving spacing in text extraction. If you haven't already, search the mailing list archives and see if any of those e-mails help you. I haven't done any text extraction myself, but I hopefully someone else on the list will be able to point you in the right direction.
---- Thanks, Adam From: Kevin Brown <[email protected]> To: [email protected] Date: 03/16/2011 08:23 Subject: OFF TOPIC -- Extracting PDF tables by selecting them? Sorry, I understand pdfbox probably won't be able to do this.... but perhaps it can? :) We use this software from BCL called Jade that allowed you to select a 'zone' on a PDF page and extract it to text in such a way that the spacing and line breaking was preserved. It did (and does!) a better job of this than any other tool we have ever tried. But they no longer make or support it! Just wondering if any of you PDF mavens have found a tool or method for doing this which works really well? It seems impossible to do programmatically unless you know the parameters of the text -- one needs to select it manually. For example, we use this a lot for odd tables. - FHA 203b; 203k; HECM; VA; USDA; Conventional - Warehouse Lines; FHA-Authorized Originators - Lending and Servicing in over 45 States www.swmc.com - www.simplehecmcalculator.com Visit www.swmc.com/resources for helpful links on Training, Webinars, Lender Alerts and Submitting Conditions This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884.

