Fri, 04 Mar 2011 20:17:19 +0100, Platonides <platoni...@gmail.com> wrote: > Seb35 wrote: >> Krinkle wrote: >>> How much is "too much memory" ? >> >> We needed to transform and crop TIFF images, read an XML associated >> with a >> book containing the OCRized text of the digitized book, and create a >> DjVu >> with the images and the text layer. >> >> For that we rent a server, I cannot remember exactly the hardware we >> choosed, but it was probably a 4-core (or 8-core) with 4GB (or 8GB) of >> RAM >> and 200-300GB of disk (and a server bandwith, useful to download the >> files >> from the FTP of the BnF, about 500 files by book (1 XML/page + TIFF >> multipage + some others) x 1416 books = 2-3 days of download on the >> server >> because of many small files). >> >> From what I remember, "Too much memory" means my laptop (2-core 2.8GHz, >> 3GB of RAM) on which I developed the (Python) program had difficulies to >> load the whole XML file (with DOM). Then I tried with SAX and the work >> was >> done in some seconds without a lot of memory (I didn't used SAX before, >> but I ♥ SAX now :-) >> >> We wrote a technical report about that, but didn't published it for now >> (perhaps a day, I hope), you can see >> <http://commons.wikimedia.org/wiki/Commons:Biblioth%E8que_nationale_de_France> >> for an "outreach" document and >> <https://fisheye.toolserver.org/browse/Seb35/BnF_import> for the Python >> program. >> >> Seb35 > > It is important to use the right tools. As you mention, such big xmls > need to be processed on-the-fly, not by loading them in memory. > You mention a server with 4 or 8 cores. Was your program multithreaded > (or otherwise running several instances)? Are those single-threaded 24h? > > Also, those instances happened once, and are quite different, so it's > probably better to ask about the needed resources when you know what you > are next needing. > What you mention doesn't seem too much for the toolserver. You should be > able to use enough disk space, and the task could be run in the > background, so cpu wouldn't need to affect other users (specially given > that there are not fixed time constraints). Memory could be a problem, > though, depending on the amount used and for how long. SGE can probably > show some memory usage graphs from which to deduce the amount available > for these kind of projects.
Thanks for all these responses, we will ask the next time before renting a server for such a purpose. We use multi-threads (easy with Python, 4 threads after the program on FishEye, so it was probably a 4-core server), but most of the time was used by disk accesses, so the equivalent single-threaded time should be about x2 or x2,5 our 24h-time. Seb35 _______________________________________________ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette