Re: [iText-questions] iText Read Chuncks of PDF into java

Mark Storer Mon, 17 May 2010 09:11:18 -0700

Don't we have a FAQ for this somewhere?

Extracting text from PDF is a Nontrivial Exercise.  It's flippin' hard.
Text in PDF is just characters and coordinates.  No paragraphs or rows
or lines.  Just "draw glyph foo of font bar with this transformation
matrix (....)".  Often, you can avoid the "glyph from a font" step and
get some kind of human encoding ("WinAnsi", "MacRoman", several others),
but not always.


To figure out lines and paragraphs you need to figure out the locations
of all those characters (which could be in any order, not just reading
order... Text is often drawn by font, or size... But it's perfectly
legal to draw all the 'a's, then all the 'b's, and so forth.  Hard and
inefficient, but legal).

If these are PDFs you yourself are building, you have a couple options
available to you:
1) Marked Content AKA PDF Structure
With PDF Structure you can mark parts of a content stream as part of
some logical object... A table, paragraph, whatever.  You can then read
that structure to extract the information you want.

2) Custom tags in the PDF.  AKA "cheating".  ;)  Write all the text you
want to store somewhere in the PDF as text, broken up so its in the
format you want.  You then suck it back out at your leisure. 

Structure is pretty much the Adobe-supported way of implementing #2.
It's a bit more complex than writing out a long string of text with some
odd delimiters, but is more portable.  Other people might actually make
use of it that way.

--Mark Storer
  Senior Software Engineer
  Cardiff.com
 
import legalese.Disclaimer;
Disclaimer<Cardiff> DisCard = null;
 
 

> -----Original Message-----
> From: crimeunit [mailto:nielspauwel...@gmail.com] 
> Sent: Monday, May 17, 2010 5:57 AM
> To: itext-questions@lists.sourceforge.net
> Subject: [iText-questions] iText Read Chuncks of PDF into java
> 
> 
> Hello,
> 
> I'm spending a lot of time after searching a solution of my 
> following problem;
> 
> with the iText in Java I wan't to read out the Chunks for 
> each paragraph.
> (because I want to have a list of all 'links-to-other-pdf-file')
> 
> 
> 
> 
> if I have a first test application that's working for getting 
> thee example
> test1 and test2:
> 
> Chunk chunk = new Chunk();
> chunk.setRemoteGoto("test1", "test2");
> ArrayList<Chunk> listChunks = chunk.getChunks(); for (Chunk 
> chnk : listChunks) {
>           chnk.getContent();  
> }
> 
> 
> 
> but now is my problem, these chuks not from my setted 
> example, but of a pdf that's read in.
> (my opinion is to first load every paragraph of the first 
> document, and take every chunck of these paragraphs)
> 
> Can somebody help me out please???
> 
> Appriciated in advance!
> --
> View this message in context: 
> http://itext-general.2136553.n4.nabble.com/iText-Read-Chuncks-
> of-PDF-into-java-tp2219554p2219554.html
> Sent from the iText - General mailing list archive at Nabble.com.
> 
> --------------------------------------------------------------
> ----------------
> 
> _______________________________________________
> iText-questions mailing list
> iText-questions@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/itext-questions
> 
> Buy the iText book: http://www.itextpdf.com/book/ Check the 
> site with examples before you ask questions: 
> http://www.1t3xt.info/examples/ You can also search the 
> keywords list: http://1t3xt.info/tutorials/keywords/
> 
> 
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.819 / Virus Database: 271.1.1/2842 - Release 
> Date: 05/16/10 23:26:00
> 

------------------------------------------------------------------------------

_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/

Re: [iText-questions] iText Read Chuncks of PDF into java

Reply via email to