hmmm, seems like unusual behavior. It would help a lot to diagnose this if you could construct a small test case - one which perhaps creates a cas, fills it with a bit of data, does the compressed serialization, resets the cas, and loops and see if that produces "expanding" serializations.
-- if it does, please post the test case to a Jira and we'll diagnose / fix this :-) -- if it doesn't, then you have to get closer to your actual use case and iterate until you see what it is that you last added that starts making it serialize ever-expanding instances. That will be a big clue, I think. -Marshall On 1/12/2016 10:54 AM, D. Heinze wrote: > The CAS.size() starts as larger than the serializedWithCompression version, > but eventually the serializedWithCompression version grows to be larger than > the CAS.size(). > The overall process is: > * Create a new CAS > * Read in an xml document and store the structure and content in the cas. > * Tokenize and parse the document and store that info in the cas. > * Run a number of lexical engines and ConceptMapper engines on the data and > store that info in the cas > * Produce an xml document with the content of the original input document > marked up with the analysis results and both write that out to a file and > also store it in the cas > * serializeWithCompression to a FileOutputStream > * cas.reset() > * iterate on the next input document > All the work other than creating and cas.reset() is done using the JCas. > Even though the output CASes keep getting larger, they seem to deserialize > just fine and are usable. > Thanks/Dan > > -----Original Message----- > From: Richard Eckart de Castilho [mailto:r...@apache.org] > Sent: Tuesday, January 12, 2016 2:45 AM > To: user@uima.apache.org > Subject: Re: CAS serializationWithCompression > > Is the CAS.size() larger than the serialized version or smaller? > What are you actually doing to the CAS? Just serializing/deserializing a > couple of times in a row, or do you actually add feature structures? > The sample code you show doesn't give any hint about where the CAS comes > from and what is being done with it. > > -- Richard > >> On 12.01.2016, at 03:06, D. Heinze <dhei...@gnoetics.com> wrote: >> >> I'm having a problem with CAS serializationWithCompression. I am >> processing a few million text document on an IBM P8 with 16 physical >> SMTP 8 cpus, 200GB RAM, Ubuntu 14.04.3 LTS and IBM Java 1.8. >> >> I run 55 UIMA pipelines concurrently. I'm using UIMA 2.6.0. >> >> I use serializeWithCompression to save the final state of the >> processing on each document to a file for later processing. >> >> However, the size of the serialized CAS just keeps growing. The size >> of the CAS is stable, but the serialized CASes just keep getting >> bigger. I even went to creating a new CAS for each process instead of >> using cas.reset(). I have also tried writing the serialized CAS to a >> byte array output stream first and then to a file, but it is the >> serializeWithCompression that caused the size problem not writing the > file. >> Here's what the code looks like. Flushing or not flushing does not >> make a difference. Closing or not closing the file output strem does >> not make a difference (other than leaking memory). I've also tried >> doing serializeWithCompression with type filtering. Wanted to try >> using a Marker, but cannot see how to do that. The problem exists >> regardless of doing 1 or >> 55 pipelines concurrently. >> >> >> >> File fout = new File(documentPath); >> >> fos = new FileOutputStream(fout); >> >> >> org.apache.uima.cas.impl.Serialization.serializeWithCompression( >> cas, fos); >> >> fos.flush(); >> >> fos.close(); >> >> logger.info( "serializedCas size " + cas.size() + " ToFile " + >> documentPath); >> >> >> >> Suggestions will be appreciated. >> >> >> >> Thanks / Dan >> >> >> >