RE: [PDFBox-user] Re: [iText-questions] Good reading/resarch on PDF text extraction

2006-02-21 Thread Leonard Rosenthol
At 11:53 AM 2/21/2006, Richard Braman wrote: How much of the PDF content do you reckon is tagged? Very little - < 10%. I haven't seen anything from IRS come tagged. They SHOULD be, since they are required by law (Section 508) - and the tagging is what improves the accessib

RE: [PDFBox-user] Re: [iText-questions] Good reading/resarch on PDF text extraction

2006-02-21 Thread Richard Braman
PDFBox-user] Re: [iText-questions] Good reading/resarch on PDF text extraction At 10:36 AM 2/21/2006, Richard Braman wrote: >As more and more content gets "pushed" into PDF it looses its >meaning to anyone else other than a human reader or a printer. ONLY IF t

Re: [iText-questions] Good reading/resarch on PDF text extraction

2006-02-21 Thread Leonard Rosenthol
At 10:36 AM 2/21/2006, Richard Braman wrote: As more and more content gets "pushed" into PDF it looses its meaning to anyone else other than a human reader or a printer. ONLY IF the document content is untagged. Tagged PDF (part of the PDF spec since 1.5) provides for the incl

[iText-questions] Good reading/resarch on PDF text extraction

2006-02-21 Thread Richard Braman
Title: Message In 2003, Tamir Hassan wrote a OS program http://www.tamirhassan.com/ to extract text out of PDF tables and columns and put it into HTML as a part of a University research product.  His algorthims were actually quite sophisticated and well documented in http://www.tamirhassan.d