Dear Eliot, > So in short, it's not unreasonable but it's also not something that can be > easily generalized. For a general solution you have to have some way to > configure the details about the pages you're extracting text from: the > header and footer boundaries, the number of columns, the writing system used > (is it Hebrew or Arabic? Is it a top-to-bottom, right-to-left language?), > and so on.
Yes, I am aware of that, but that seems to be at least partially solved (see below). > I implemented a pretty good paragraph recognizer some years ago using an > earlier (but functionally equivalent for the purpose) version of PDFBox. > Unfortunately, that code was proprietary and I no longer have access to it. > But we were able to recognize paragraphs on pages in typical mass-market > fiction books (we were doing conversion of PDFs to a proprietary e-reader > format). We also had to recognize page breaks within paragraphs and do > de-hyphenation. This sounds like the kind of program I'd be looking for, pity it's not available. And Bob, > I have struggled with the same issues, not just with free > ebooks, but web page content, etc. The "free" books from Project > Gutenberg are often available in plain text, and > you can work from there. Sometimes, books offered > by Google Books are available in plain text. Yes, I'm aware of that, and they usually give e pretty rich choice of formats. What bothers me is that I can get e.g. the OpenAccess books from my own university (see http://www.univerlag.uni-goettingen.de/) only in PDF (they say it's hard to produce the various formats). And these PDFs don't display easily on ebook readers. > My current tool of choice is Calibre. It can read PDF and convert > to many formats. How well? It has problems. I tried Calibre, but wasn't really satisfied, that's why I came back to pdfbox. My point is that pdfbox with html format is already fairly close to what I need. In html format, breaks between paragraphs are recognised and marked by </p><p>, while line breaks are preserved as such, but not tagged. The same distinction (e.g. by a free line) in the text format would already go a far way into the direction I'm looking for. Thus the paragraph recognition problem seems to be essentially solved. What I'm missing there is the distinction between page breaks that are also paragraph breaks and those that are only line breaks. Else I could fairly easily transform the html format into the kind of text I am looking for, using some change and replace with regular expressions. But I'm not sufficiently versed with either Java or the PDF format to know where I could modify the program to handle that distinction. But probably someone else is… Best Thomas

