fread-pdf-trunk into lp:zorba

Matthias Brantner Fri, 21 Sep 2012 10:36:36 -0700

Review: Needs Fixing

The module works pretty decent. I was able to extract text or generate images 
for several pdfs without any problems.
There are some minor things that should be discussed and/or fixed:


- the error seems to be too general, essentially it always raises 
JAVA-EXCEPTION no matter what goes wrong (e.g. it the given input is not a 
valid pdf)

- the java stack trace seems to be sent to standard error

- Renders the each page of the PDF document as an image. => Renders each page 
of the PDF document as an image.

- the names of the private functions should also adhere to the code conventions 
renderToImages => render-to-images

- make xqdoc failes because the comments seem to contain invalid xml
</home/mbrantner/zorba/build/URI_PATH/com/zorba-xquery/www/modules/project_xqdoc.xq>:142,9:
 user-defined error [err:UE004]: Error processing module zerr:ZXQD0002 - " This 
module provides funtionality to read the text from PDF documents and
 to render PDF documents to images.
 <a href="http://pdfbox.apache.org";>Apache PDFBox</a> library is used to
 implement these functions.
 <br />
 <br />
 <b>Note:</b> Since this module has a Java library dependency a JVM required
 to be installed on the system. For Windows: jvm.dll is required on the system
 path ( usually located in "C:\Program Files\Java\jre6\bin\client".
 <b>Note:<b> For Debian based Linux distributions install PdfBox and FontBox
 packages: sudo apt-get install libpdfbox-java libfontbox-java
": can not parse as XML for xqdoc: loader parsing error: Opening and ending tag 
mismatch: b line 0 and root
; raised at 
/home/mbrantner/zorba/sandbox/src/runtime/errors_and_diagnostics/errors_and_diagnostics_impl.cpp:81

- adapt the year in "Copyright 2006-2009 The FLWOR Foundation." in the .xq file 
(and some other files also)

- would it make sense to return one string per page in the pdf instead of one 
big string?

- remove commented out code in read-pdf.cpp

- valgrind shows tons of invalid writes. Why? Are they critical? Is there 
anything we can do?

- would it make sense to return the images in a streaming fashion (i.e. don't 
create all base64's in a vector)?

- encoding each image shouldn't be necessary and will probably we wasted effort 
because the images might be written to a file in their binary form
-- 
https://code.launchpad.net/~zorba-coders/zorba/fread-pdf-trunk/+merge/125338
Your team Zorba Coders is subscribed to branch lp:zorba.

-- 
Mailing list: https://launchpad.net/~zorba-coders
Post to     : zorba-coders@lists.launchpad.net
Unsubscribe : https://launchpad.net/~zorba-coders
More help   : https://help.launchpad.net/ListHelp

Re: [Zorba-coders] [Merge] lp:~zorba-coders/zorba/fread-pdf-trunk into lp:zorba

Reply via email to