Dear Eliot,

> So in short, it's not unreasonable but it's also not something that can be
> easily generalized. For a general solution you have to have some way to
> configure the details about the pages you're extracting text from: the
> header and footer boundaries, the number of columns, the writing system used
> (is it Hebrew or Arabic? Is it a top-to-bottom, right-to-left language?),
> and so on. 

Yes, I am aware of that, but that seems to be at least partially solved (see 
below).

> I implemented a pretty good paragraph recognizer some years ago using an
> earlier (but functionally equivalent for the purpose) version of PDFBox.
> Unfortunately, that code was proprietary and I no longer have access to it.
> But we were able to recognize paragraphs on pages in typical mass-market
> fiction books (we were doing conversion of PDFs to a proprietary e-reader
> format). We also had to recognize page breaks within paragraphs and do
> de-hyphenation.


This sounds like the kind of program I'd be looking for, pity it's not 
available. 

And Bob,

> I have struggled with the same issues, not just with free
> ebooks, but web page content, etc. The "free" books from Project
> Gutenberg are often available in plain text, and
> you can work from there. Sometimes, books offered
> by Google Books are available in plain text.

Yes, I'm aware of that, and they usually give e pretty rich choice of formats.
What bothers me is that I can get e.g. the OpenAccess books from my own 
university (see http://www.univerlag.uni-goettingen.de/) only in PDF (they say 
it's hard to produce the various formats). And these PDFs don't display easily 
on ebook readers.

> My current tool of choice is Calibre. It can read PDF and convert
> to many formats. How well? It has problems.

I tried Calibre, but wasn't really satisfied, that's why I came back to pdfbox.

My point is that pdfbox with html format is already fairly close to what I need.
In html format, breaks between paragraphs are recognised and marked by </p><p>, 
while line breaks are preserved as such, but not tagged. The same distinction 
(e.g. by a free line) in the text format would already go a far way into the 
direction I'm looking for.
Thus the paragraph recognition problem seems to be essentially solved.

What I'm missing there is the distinction between page breaks that are also 
paragraph breaks and those that are only line breaks. Else I could fairly 
easily transform the html format into the kind of text I am looking for, using 
some change and replace with regular expressions.
But I'm not sufficiently versed with either Java or the PDF format to know 
where I could modify the program to handle that distinction. But probably 
someone else is…

Best
Thomas

Reply via email to