On 15/08/12 10:09, Richard Eckart de Castilho wrote:
> Hi,
> 
> I am looking for a way to improve loading times in an application, so I did a 
> little experiment with binary CAS serialization to see if it was superior to 
> XMI serialization. For serialization I used the CASCompleteSerializer to 
> serialize the type-system and heaps into the same file using Java object 
> serialization - at least that is what I understood it should do. To read in 
> these files, I would deserialize the CASCompleteSerializer and initialize a 
> CAS from it using CASImpl.reinit().
> 
> 96.400 files
> 
> plain text (uncompressed)      :                 581.865.593 Byte
> binary (serialized java, gzip) : 0:47:02.835   3.555.449.597 Byte 
> xmi (gzip)                     : 1:20:31.535   4.712.633.769 Byte
> 
> So binary takes about 60% of the time xmi serialization would need and uses 
> about 75% of the space.
> I didn't do reading experiment yet, but I suppose the improvement should be 
> on a similar level, if not better.
> 
> I am also not sure yet about the draw-backs of binary serialization and in 
> which scenarios they apply. The draw-backs I saw so far are:
> 
> - Type-system is stored redudantly in every output file.
> - The type system configured with CASImpl.reinit() may be different from the 
> one which was used to initialize the pipeline, CAS-based annotators relying 
> on typeSystemInit() may not be configured with the correct types - this is a 
> hypothesis I didn't test.
> - Serialized Java objects may become due to refactoring within the UIMA 
> framework. However, there is yet another binary CAS serialization in UIMA 
> which uses the DataOutputStream and may be more stable.
> 
> Did anybody ever use any form of binary CAS serialization outside 
> Vinci/UIMA-AS?

Not sure this will help, but we originally implemented the binary
serialization to pass the CAS between Java and C++ (in-process).  Used
that way it's blindingly fast because Java and C++ use the same heap
layout for the CAS data.  We have also used it for communication via
SOAP, but I'm not sure I would do that today.  I might prefer a format
that's at least somewhat human readable.  I do not recall ever using the
binary format for serialization to disk, just because I never had that
use case.

The thing with the binary serialization is that type information is
encoded in a binary format as well.  So you need to be sure that when
you read it back in, every type and feature gets assigned the same code,
otherwise the heap is garbage.  That's why you need to be sure to use
the correct, encoded type system as well.

However, as I recall, there was a way you could serialize the CAS
without the type system if you were sure you didn't need it.  Isn't that
the difference between the CasCompleteSerializer and the
NotSoCompleteSerializer (making that up here)?  On the way back, you can
deserialize into an existing CAS that has the right type system.

Your times above, do they include time needed to do the compression?
I'm surprised binary serialization is not even twice as fast.  Or is
this gated by the disk I/O?

--Thilo

> 
> Cheers,
> 
> -- Richard
> 

Reply via email to