Re: The Multivio project

2010-05-21 Thread Johnny Mariéthoz
Dear Ferran,

I work on the multivio project especially on the server side, which do the
pdf rendering.

Le 3 mai 2010 à 11:19, Ferran Jorba a écrit :

 Hello Miguel,
 
 It certainly looks interesting!  I've tested a couple of PDF files from
 our site.  The first one happened to weight 11 MB (an old scanned
 journal, from http://ddd.uab.cat/record/53804), and it took so long that
 I had to abort it:
 
 http://demo.multivio.org/client/#geturl=http://ddd.uab.cat/pub/garbanzo/garbanzo_a1873n47.pdf
 
 The second one, a modern native PDF (from http://ddd.uab.cat/record/5),
 was sligtly better:
 
 http://demo.multivio.org/client/#geturl=http://ddd.uab.cat/pub/autonoma/autonoma_a2010m3n233.pdf
 
 I'd certainly choose Multivio instead of our Flash based equivalent (no,
 it's no my fault, I give it to you so you can see a propietary
 alternative):
 
 http://www.uab.es/revista-autonoma/

Concerning the fact that some pdfs can take while to be displayed on Multivio, 
this is mainly due to network problems
at the server side. The external file should be downloaded on the multivio 
server, and it seems that we have some problem
with our firewall.

To avoid this, you can try multivio with local files with RERO DOC. For example:

http://doc.rero.ch/record/18242?ln=fr (try the multivio button)

or directly:

http://demo.multivio.org/client/#geturl=http://doc.rero.ch/record/18242/export/xd

However, we are working on a new prototype to make multivio more responsive. 
But as you can see,
accessing very big files take only few seconds. The main idea is that only the 
information see by the
user is downloaded. This is particularly useful for smart phone or GPRS/UMTS 
internet connexion.

The fast web pdf is a part of the solution for big pdf files. At RERO, we have 
spécific collection such as L'Impartial which is
a Swiss newpaper. That's represents Giga of data. For such cases, fast pdf is 
not really useful. Moreover, to search in a pdf or display
the Table Of Content, a file should be completely downloaded.

 
 However, I'd say that the quality of the thumbnails can be improved.

Do you mean the quality of the rendering?

 Which tool are you using?  In my case, I've found that, by far, the
 fastest and best results are using a combination of Xpdf's pdftoppm and
 Imagemagick's convert.  We create the thumbnails of the first page of
 our PDFs this way (simplified):
 
 $ pdftoppm -f $page -l $page $file.pdf $file
 $ convert -thumbnail 85 $file-0$page.ppm $file.png
 $ rm $file-0$page.ppm
 
 pdftoppm converts all pages to ppm if no -f or -l parameters are given.
 Relying on ImageMagick's own PDF to PNG (or any other graphic format)
 conversion, the route goes through Ghostscript, and it brings any system
 to its knees, and the quality is worses.

I use the same kind of tools: I use poppler and I did a python wrapper on the 
poppler
classes to perform the rendering and PIL for the image manipulation. I do not 
want to have
system calls.

I know that the rendering is affected by the font configuration/installation on 
the linux distribution.
I have to do some tests to obtain the best results. If you have some advise for 
that, do not hesitate.

 Hope it helps,

Course, all comments are welcome. Thanks again.
 
 Ferran

-- Johnny



Re: The Multivio project

2010-05-03 Thread Piotr Praczyk
Hi !

Looks very interesting :) Thanks for the information about this.
 just as a curiosity... It does not support polish fiscal
declarations ;) (Very few PDF readers do ;) )

http://demo.multivio.org/client/#geturl=http://e-deklaracje.mf.gov.pl/files/pdf/PIT-36%2814%29_v2-0.pdf


cheers
Piotr

2010/5/3 Miguel Moreira miguel.more...@rero.ch:
 i...@multivio.org


Re: The Multivio project

2010-05-03 Thread Ferran Jorba
Hello Miguel,

 I hope you'll excuse me for using the list, but I have an announcement
 that might be of interest to you.

 There's a project going on here at RERO called Multivio whose goal
 is to provide a presentation layer for archives of digital documents:

 https://www.multivio.org/

[...]
 Please don't hesitate to take a look at the project site
 https://www.multivio.org/, try some examples, try it with your own
 documents and send us some feedback at i...@multivio.org.

It certainly looks interesting!  I've tested a couple of PDF files from
our site.  The first one happened to weight 11 MB (an old scanned
journal, from http://ddd.uab.cat/record/53804), and it took so long that
I had to abort it:

 
http://demo.multivio.org/client/#geturl=http://ddd.uab.cat/pub/garbanzo/garbanzo_a1873n47.pdf

The second one, a modern native PDF (from http://ddd.uab.cat/record/5),
was sligtly better:

 
http://demo.multivio.org/client/#geturl=http://ddd.uab.cat/pub/autonoma/autonoma_a2010m3n233.pdf

I'd certainly choose Multivio instead of our Flash based equivalent (no,
it's no my fault, I give it to you so you can see a propietary
alternative):

 http://www.uab.es/revista-autonoma/

However, I'd say that the quality of the thumbnails can be improved.
Which tool are you using?  In my case, I've found that, by far, the
fastest and best results are using a combination of Xpdf's pdftoppm and
Imagemagick's convert.  We create the thumbnails of the first page of
our PDFs this way (simplified):

 $ pdftoppm -f $page -l $page $file.pdf $file
 $ convert -thumbnail 85 $file-0$page.ppm $file.png
 $ rm $file-0$page.ppm

pdftoppm converts all pages to ppm if no -f or -l parameters are given.
Relying on ImageMagick's own PDF to PNG (or any other graphic format)
conversion, the route goes through Ghostscript, and it brings any system
to its knees, and the quality is worses.

Hope it helps,

Ferran


Re: The Multivio project

2010-05-03 Thread Ferran Jorba
Hello Samuele,

 In data lunedì 3 maggio 2010 11:19:23, Ferran Jorba ha scritto:
 It certainly looks interesting!  I've tested a couple of PDF files from
 our site.  The first one happened to weight 11 MB (an old scanned
 journal, from http://ddd.uab.cat/record/53804), and it took so long that
 I had to abort it:

 just for reference, in case it's needed also by other users, the
 pdfopt utils (from ghostscript) can transform any PDF into a
 linearized PDF (also called fast web view mode), that will add hints
 to the PDF to reference single pages without downloading the full
 file. I guess this would make the multivio able to open your 11Mb
 scanned document without any problem.

Thanks for thee suggestion.  I've tried it on one of our 100 MB+
monsters and what I've seen is that the size doesn't vary.  But
certainly Xpdf's pdfinfo notes the change in the «Optimized» field:

 before after pdfopt

 Pages:  294294
 Encrypted:  no no
 File size:  112858067 bytes112838493 bytes
 Optimized:  no yes
 PDF version:1.51.5

Another task in our TODO list...

Thanks,

Ferran


Re: The Multivio project

2010-05-03 Thread Samuele Kaplun
In data lunedì 3 maggio 2010 16:23:06, Ferran Jorba ha scritto:
 Thanks for thee suggestion.  I've tried it on one of our 100 MB+
 monsters and what I've seen is that the size doesn't vary.  But
 certainly Xpdf's pdfinfo notes the change in the «Optimized» field:
 
  before after pdfopt
 
  Pages:  294294
  Encrypted:  no no
  File size:  112858067 bytes112838493 bytes
  Optimized:  no yes
  PDF version:1.51.5
 
 Another task in our TODO list...

Yes, this is correct. The file is basically the same and the size should just 
get a bit larger. What really changes is that the PDF is reorganized in a way 
that a special table stored inside the PDF renderer in a easy position is 
filled up with pointers to the pages, so that it's possible to jump directly 
to any exact page (so this can be exploited in HTTP connections, to request 
exactly the range of bytes which is sufficient to render the specific page, 
while continuing pre-fetching the rest of the document in background). So if 
you actually try to access your optimized monster through Multivio (if 
Multivio is taking advantage of this feature) you should definitively be able 
to jump to each page quickly, regardless of the size...)

Cheers,
Sam

P.s in the next release of Invenio there will be integrated a conversion 
library that will, among other things, wrap this pdfopt in the fulltext 
management operations, so that you can have in principle for free this 
optimization (though in the current git master it's not currently fully 
integrated yet in WebSubmit  friends...)

-- 
Samuele Kaplun ** CERN Document Server ** http://cds.cern.ch/