Don't we have a FAQ for this somewhere? Extracting text from PDF is a Nontrivial Exercise. It's flippin' hard. Text in PDF is just characters and coordinates. No paragraphs or rows or lines. Just "draw glyph foo of font bar with this transformation matrix (....)". Often, you can avoid the "glyph from a font" step and get some kind of human encoding ("WinAnsi", "MacRoman", several others), but not always.
To figure out lines and paragraphs you need to figure out the locations of all those characters (which could be in any order, not just reading order... Text is often drawn by font, or size... But it's perfectly legal to draw all the 'a's, then all the 'b's, and so forth. Hard and inefficient, but legal). If these are PDFs you yourself are building, you have a couple options available to you: 1) Marked Content AKA PDF Structure With PDF Structure you can mark parts of a content stream as part of some logical object... A table, paragraph, whatever. You can then read that structure to extract the information you want. 2) Custom tags in the PDF. AKA "cheating". ;) Write all the text you want to store somewhere in the PDF as text, broken up so its in the format you want. You then suck it back out at your leisure. Structure is pretty much the Adobe-supported way of implementing #2. It's a bit more complex than writing out a long string of text with some odd delimiters, but is more portable. Other people might actually make use of it that way. --Mark Storer Senior Software Engineer Cardiff.com import legalese.Disclaimer; Disclaimer<Cardiff> DisCard = null; > -----Original Message----- > From: crimeunit [mailto:nielspauwel...@gmail.com] > Sent: Monday, May 17, 2010 5:57 AM > To: itext-questions@lists.sourceforge.net > Subject: [iText-questions] iText Read Chuncks of PDF into java > > > Hello, > > I'm spending a lot of time after searching a solution of my > following problem; > > with the iText in Java I wan't to read out the Chunks for > each paragraph. > (because I want to have a list of all 'links-to-other-pdf-file') > > > > > if I have a first test application that's working for getting > thee example > test1 and test2: > > Chunk chunk = new Chunk(); > chunk.setRemoteGoto("test1", "test2"); > ArrayList<Chunk> listChunks = chunk.getChunks(); for (Chunk > chnk : listChunks) { > chnk.getContent(); > } > > > > but now is my problem, these chuks not from my setted > example, but of a pdf that's read in. > (my opinion is to first load every paragraph of the first > document, and take every chunck of these paragraphs) > > Can somebody help me out please??? > > Appriciated in advance! > -- > View this message in context: > http://itext-general.2136553.n4.nabble.com/iText-Read-Chuncks- > of-PDF-into-java-tp2219554p2219554.html > Sent from the iText - General mailing list archive at Nabble.com. > > -------------------------------------------------------------- > ---------------- > > _______________________________________________ > iText-questions mailing list > iText-questions@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/itext-questions > > Buy the iText book: http://www.itextpdf.com/book/ Check the > site with examples before you ask questions: > http://www.1t3xt.info/examples/ You can also search the > keywords list: http://1t3xt.info/tutorials/keywords/ > > > No virus found in this incoming message. > Checked by AVG - www.avg.com > Version: 9.0.819 / Virus Database: 271.1.1/2842 - Release > Date: 05/16/10 23:26:00 > ------------------------------------------------------------------------------ _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.itextpdf.com/book/ Check the site with examples before you ask questions: http://www.1t3xt.info/examples/ You can also search the keywords list: http://1t3xt.info/tutorials/keywords/