The following issue has been SUBMITTED. ====================================================================== http://bugs.librdf.org/mantis/view.php?id=355 ====================================================================== Reported By: scudette Assigned To: ====================================================================== Project: Raptor RDF Parsing and Serializing Library Issue ID: 355 Category: api Reproducibility: always Severity: tweak Priority: normal Status: new Syntax Name: Turtle ====================================================================== Date Submitted: 2010-02-23 21:46 Last Modified: 2010-02-23 21:46 ====================================================================== Summary: libraptor serializer very slow for large number of objects - No flush API Description: When serializing into Turtle the raptor_turtle_serialize_statement() function maintains an avl tree to group all the statements into subjects. This is necessary in order to ensure that all statements related to the same subject are emitted together - even if the statements are serialized in random order.
This behaviour is reasonable in the case where the statements are issued randomly. In many applications however, the statements are issued basically in the correct order - that is raptor_turtle_serialize_statement() is called for all the predicates of each subject in turn. The problem is that maintaining the AVL tree slows down the raptor_turtle_serialize_statement() function significantly for large number of subjects. Memory is consumed for the AVL tree until the serialize_end() function is called, when the tree is walked and then any output is being produced. For very large number of subject this is very slow and no output at all is produced until the very end. The API really needs a flush() function which can be called when you know you are done serializing a subject. When the flush() is called, the tree can be free'd and all present subjects can be dumped into the stream. For now I have simulated a flush() function by allowing the serialize_end() function to be called as many times as needed: - I have added a code block to free and rebuild the AVL tree in raptor_turtle_emit() which is called from raptor_serialize_end() - I have removed the iostream freeing in raptor_serialize_end() (this might leak - maybe this should be moved to raptor_free_serializer() ), and removed the context->written_header=0 in raptor_turtle_serialize_end() http://code.google.com/p/aff4/source/browse/libraptor/raptor_serialize_turtle.c So the end result is that I can call raptor_serialize_end() as frequently as I want without the state of the serializer being changed. Each time I call it the serializer flushes more data into the iostream which reduces memory consumption (and also means progress is made in writing the file). I am calling it about every 100 subjects. To give you an idea of the speed improvement - my simple unit test serialises about 22k objects. Prior to the change it took about 2min to write the file (during which time there was no writing at all until the very end). After the change it takes about 20sec to do the same, and the file is written progressively. Memory demand is obviously much more modest in the new code. ====================================================================== Issue History Date Modified Username Field Change ====================================================================== 2010-02-23 21:46 scudette New Issue 2010-02-23 21:46 scudette Syntax Name => Turtle ====================================================================== _______________________________________________ redland-dev mailing list [email protected] http://lists.librdf.org/mailman/listinfo/redland-dev
