Fri, 04 Mar 2011 20:17:19 +0100, Platonides <platoni...@gmail.com> wrote:
> Seb35 wrote:
>> Krinkle wrote:
>>> How much is "too much memory" ?
>>
>> We needed to transform and crop TIFF images, read an XML associated  
>> with a
>> book containing the OCRized text of the digitized book, and create a  
>> DjVu
>> with the images and the text layer.
>>
>> For that we rent a server, I cannot remember exactly the hardware we
>> choosed, but it was probably a 4-core (or 8-core) with 4GB (or 8GB) of  
>> RAM
>> and 200-300GB of disk (and a server bandwith, useful to download the  
>> files
>>  from the FTP of the BnF, about 500 files by book (1 XML/page + TIFF
>> multipage + some others) x 1416 books = 2-3 days of download on the  
>> server
>> because of many small files).
>>
>>  From what I remember, "Too much memory" means my laptop (2-core 2.8GHz,
>> 3GB of RAM) on which I developed the (Python) program had difficulies to
>> load the whole XML file (with DOM). Then I tried with SAX and the work  
>> was
>> done in some seconds without a lot of memory (I didn't used SAX before,
>> but I ♥ SAX now :-)
>>
>> We wrote a technical report about that, but didn't published it for now
>> (perhaps a day, I hope), you can see
>> <http://commons.wikimedia.org/wiki/Commons:Biblioth%E8que_nationale_de_France>
>> for an "outreach" document and
>> <https://fisheye.toolserver.org/browse/Seb35/BnF_import> for the Python
>> program.
>>
>> Seb35
>
> It is important to use the right tools. As you mention, such big xmls
> need to be processed on-the-fly, not by loading them in memory.
> You mention a server with 4 or 8 cores. Was your program multithreaded
> (or otherwise running several instances)? Are those single-threaded 24h?
>
> Also, those instances happened once, and are quite different, so it's
> probably better to ask about the needed resources when you know what you
> are next needing.
> What you mention doesn't seem too much for the toolserver. You should be
> able to use enough disk space, and the task could be run in the
> background, so cpu wouldn't need to affect other users (specially given
> that there are not fixed time constraints). Memory could be a problem,
> though, depending on the amount used and for how long. SGE can probably
> show some memory usage graphs from which to deduce the amount available
> for these kind of projects.

Thanks for all these responses, we will ask the next time before renting a  
server for such a purpose.

We use multi-threads (easy with Python, 4 threads after the program on  
FishEye, so it was probably a 4-core server), but most of the time was  
used by disk accesses, so the equivalent single-threaded time should be  
about x2 or x2,5 our 24h-time.

Seb35

_______________________________________________
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette

Reply via email to