Re: text extraction from pdf

2008-05-15 Thread Bill Janssen
> Problem I am having is that some of them has multiple columns. and multiple > word boxes. Does the xpdf patch extract different columns and wordboxes? It tells you where each word is. Columns you have to do for yourself. Bill > > In UpLib, I use xpdf-3.02pl2 with a patch which gives me positi

Re: text extraction from pdf

2008-05-15 Thread Cam Bazz
Hello Bill, Problem I am having is that some of them has multiple columns. and multiple word boxes. Does the xpdf patch extract different columns and wordboxes? Best, -C.B. On Wed, May 14, 2008 at 6:35 PM, Bill Janssen <[EMAIL PROTECTED]> wrote: > > > the unix program pdf2text can convert keep

Re: text extraction from pdf

2008-05-14 Thread Bill Janssen
> > the unix program pdf2text can convert keeping the text places, but I wanted > > to ask you guys if you know something better, > > AFAIK, PDFBox has a lower-level API that allows you to get hold of text > positions. In UpLib, I use xpdf-3.02pl2 with a patch which gives me position and font in

Re: text extraction from pdf

2008-05-14 Thread Andrzej Bialecki
Cam Bazz wrote: Hello All, Any suggestions for extracting text from PDF? I have tried pdfbox, but it works nice, however if the pdf is structured, it wont provide good results. For example consider the pdf: P1 Lorem Ipsum Bla bla P3 Lorem2 Ipsum2 P1 bla bla

text extraction from pdf

2008-05-14 Thread Cam Bazz
Hello All, Any suggestions for extracting text from PDF? I have tried pdfbox, but it works nice, however if the pdf is structured, it wont provide good results. For example consider the pdf: P1 Lorem Ipsum Bla bla P3 Lorem2 Ipsum2 P1 bla bla P2 bla bla bla P