On 7/25/05, David A. Desrosiers <[EMAIL PROTECTED]> wrote:
> 
> > I am looking for a way to speed up the generation of the pdb's .
> > Right now i am generating a bunch of html files , which are then
> > converted to the Plucker pdb format with plucker-build.  This
> > process takes about 30-40 minutes on my machine (AMD XP 2500, 512MB
> > ram, Mandrake Linux 10.0).
> 
>         How many thousands of html files are you converting? 10,000?
> 50,000? More? plucker-build (or cplucker, the C++ parser) should never
> take that long, unless you have an _enormous_ number of files, or your
> Python vm is running out of memory for some reason.

There are about 2000 files.   The files are small, but there are indeed many 
links (  grep -i href  *html     found 3200  links ).  

Well,  most of that time is spent when the html files are generated 
(the data is read from a PostgreSQL database) .   The generation of the 
pdb's is delayed until each sales agent synchronises his/her palm . 

Thanks for the tips.    I'll start by trying to optimise the html generation,
and if that is not enough i'll see what can be done to speed the pdb
generation, too. 


Best wishes,
Adrian Maier
 
>         The Python distiller has a few deficiencies in this area,
> specifically with regard to large numbers of files being parsed. I'd
> look into using the C++ parser from cvs if you need a bit more speed.
> 
>         I just checked some of my largest fetches this week, and a
> single website with 233 external links took 2 minutes 24 seconds to
> fetch and convert completely. The machine that .pdb was generated on
> was a paltry 2.1Ghz machine with 1gb of RAM in it (its a test box).
> Based on your results, if my machine was parsing content for 30-40
> minutes, that would be well over 6,000 separate links in total.
> 
>         That's a LOT of links for a Plucker .pdb.
> 
>         I should also note that I parse and convert Wikipedia and some
> other VERY large texts on a regular basis for testing/benchmarking,
> but those processes take several days at a time and probably are not
> good examples to compare to your source material. Most of the medium
> to large documents I create take under 15 minutes at the most, when
> using the Python distiller.
> 
> > Do you think that it might be a good idea to generate the pdb's
> > directly ( instead of generating html files that have to be
> > processed by plucker-build )  ?
> 
>         If you can deal with the HTML'ized objects in memory as a
> stream, you can certainly do that. The Plucker document format is very
> well documented:
> 
>         http://cvs.plkr.org/index.cgi/*checkout*/docs/DBFormat.html?rev=HEAD
> 
> > I am asking this question because i am not familiar with the plucker
> > pdb format , and i'm wondering whether the effort involved in
> > teaching my programs to generate pdb's is worth it.
> 
>         jSyncManager reportedly has some Java classes that can write
> Plucker documents (or at least the author _planned_ to write some, I'm
> not sure how far he got with that, whether they're complete or if they
> even work). You can see the classes here:
> 
>         http://www.jsyncmanager.org/javadoc/v32/index.html
> 
> > Is it possible to avoid the parsing of the html files? Maybe it's
> > possible to feed the contents directly to plucker-build ...
> 
>         In its current implementation, plucker-build (which is just a
> symlink to Spider.py), cannot be fed a stream of data directly. You
> might try implementing psyco to gain some speed in the distiller if
> your machine truly can't handle parsing your source files.
> 
>         http://psyco.sourceforge.net/
> 
> 
> David A. Desrosiers
> [EMAIL PROTECTED]
> http://gnu-designs.com
> _______________________________________________
> plucker-list mailing list
> [email protected]
> http://lists.rubberchicken.org/mailman/listinfo/plucker-list
>
_______________________________________________
plucker-list mailing list
[email protected]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list

Reply via email to