Hi guys! I'm an ebook reader user, and many times I find myself in need to
convert from PDF (files I find online) to ePub (which as you might know is
just a collection of HTML files). While Calibre has this function (it uses
pdftohtml AFAICT), the resulting documents are awful. There are some other
tools available, but I decided to write my own.

So I am trying now to use podofo, but I've come to a stop, since I can't
figure out how to extract the actual text. As far as I have learned, PDF
pages have small spans of positioned text, and images as well. I would be
fine at this point with a way to get a list of drawing commands such as
draw_text and draw_image, but I haven't managed to find such a type of
parser in the podofo API. Correct me if I'm wrong.

I've tried the podofotextextract tool, but while it extracts the spans of
text, on many documents it reports them all positioned at (0, 0), because
it seems to be skipping the TD commands. Furthermore, it spits out lots of
errors and even invalid characters.l've tried it on a PDF generated by
libreoffice. Also see http://abrp.bizland.com/sample.pdf for a sample that
outputs all the text at (0,0), and generates some font warnings as well.

There is even printf output occurring in the library itself such as
"Reading object 777 0 R with type: Number", which looks like debug text
left over.

So my question is - is there a way to get some more usable data for my
scenario? For example Xpdf has the OutputDev class that you can extend,
htmltopdf extends it with HtmlOutputDev and does the processing in that
class (including reattaching the text spans to make rows, and then
outputting html tags). Or is this functionality missing from podofo at this
time?
------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60134791&iu=/4140/ostg.clktrk
_______________________________________________
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users

Reply via email to