Yeah, I know java serialization.

I think it depends on the perspective and the use case. I added a header
to the serialized outputs since I see them as binary fomats and I
thought that all binary formats should get the same header. Then, I
removed it again, then I added it again. I will remove it again now.


I don't think that we will get an optimal solution, e.g., the header is
read twice, the previous uimaj method should return the format and so
on. We should get this up and running for the release without breaking
backwards compatibility and then think what it should look like, and if
further functionality/refactoring is required.


I used uimaj-core 2.8.1. Here are some errors:

simpleCas.bins0
org.apache.uima.cas.CASRuntimeException: No sofaFS for specified sofaRef
found.simpleCas.bins4
    at org.apache.uima.cas.impl.CASImpl.getSofa(CASImpl.java:806)
    at
org.apache.uima.cas.impl.FSIndexRepositoryImpl.ll_addFS_common(FSIndexRepositoryImpl.java:2781)
    at
org.apache.uima.cas.impl.FSIndexRepositoryImpl.ll_addFS(FSIndexRepositoryImpl.java:2763)
    at
org.apache.uima.cas.impl.FSIndexRepositoryImpl.addFS(FSIndexRepositoryImpl.java:2068)
    at org.apache.uima.cas.impl.CASImpl.reinitIndexedFSs(CASImpl.java:1765)
    at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1488)
    at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1344)
    at
org.apache.uima.cas.impl.Serialization.deserializeCAS(Serialization.java:171)
    at tutorial.entity.LoadCas.main(LoadCas.java:55)
org.apache.uima.cas.CASRuntimeException: Error trying to read BLOB data
from an input stream and deserialize into a CAS.
    at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1591)
    at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1344)
    at
org.apache.uima.cas.impl.Serialization.deserializeCAS(Serialization.java:171)
    at tutorial.entity.LoadCas.main(LoadCas.java:39)

simpleCas.bins6
java.io.EOFException
    at java.io.DataInputStream.readUnsignedByte(DataInputStream.java:290)
    at org.apache.uima.util.impl.DataIO.readVlong(DataIO.java:355)
    at
org.apache.uima.cas.impl.BinaryCasSerDes6.readVlong(BinaryCasSerDes6.java:2193)
    at
org.apache.uima.cas.impl.BinaryCasSerDes6.readDiff(BinaryCasSerDes6.java:2102)
    at
org.apache.uima.cas.impl.BinaryCasSerDes6.readLongOrDouble(BinaryCasSerDes6.java:2128)
    at
org.apache.uima.cas.impl.BinaryCasSerDes6.readByKind(BinaryCasSerDes6.java:1920)
    at
org.apache.uima.cas.impl.BinaryCasSerDes6.deserializeAfterVersion(BinaryCasSerDes6.java:1748)
    at
org.apache.uima.cas.impl.BinaryCasSerDes6.deserialize(BinaryCasSerDes6.java:1596)
    at
org.apache.uima.cas.impl.Serialization.deserializeCAS(Serialization.java:270)
    at tutorial.entity.LoadCas.main(LoadCas.java:47)



Am 22.07.2016 um 21:17 schrieb Marshall Schor:
> I think the model for these two formats is more general than what you are
> imagining.  These are formats that follow the standard Java serialization
> standard, see for example,
> https://docs.oracle.com/javase/7/docs/platform/serialization/spec/serialTOC.html
>
> The bytes corresponding to the serialized form are expected to (in general) be
> written anywhere in a data output stream, perhaps preceded or followed by 
> (maybe
> many) other serialized objects; the overall format of that stream is up to the
> user designing it, including any headers the user might decide on.
>
> In the data output stream, each data object, including one representing the 
> CAS,
> for example, has a format dictated by the Java standard for object 
> serialization.
>
> What error do you get when you try to deserialize a CAS object in a data 
> stream
> with an older version of UIMA?
>
> -Marshall
>
> On 7/22/2016 9:31 AM, Peter Klügl wrote:
>> So SERIALIZED and SERIALIZED_TS get no header?
>>
>>
>> Can you try to deserialize the CAS files created by the unit test with
>> an older version of uima? I cannot get it to work.
>>
>>
>> Best,
>>
>>
>> Peter
>>
>>
>> Am 22.07.2016 um 15:18 schrieb Marshall Schor:
>>> Re: The java-serialized formats now have also a binary header
>>>
>>> Not sure what you mean by java-serialized formats.  Perhaps this means the
>>> formats created by using standard Java Object serialization on the special
>>> objects in UIMA built for this.
>>>
>>> If so, then it seems this would break backwards compatibility, in that a 
>>> user
>>> serializing with UIMA 2.9.0, but not using any new features, could not have 
>>> that
>>> "read" by an older version of UIMA.
>>>
>>>
>>> -Marshall
>>>
>>> On 7/22/2016 7:43 AM, Peter Klügl wrote:
>>>> Hi,
>>>>
>>>>
>>>> I changed CasIOUtils to use the Header and I extended the header with a
>>>> bit (0x08) indicating an included type system. No information about the
>>>> serialization of the type system yet. The java-serialized formats now
>>>> have also a binary header as I did not want to make the header
>>>> serializable as it should be read/written by the same functionality.
>>>>
>>>> I have thought that old UIMA versions (e.g., 2.8.1) should be able to
>>>> load new CAS files, but my tests failed.  No idea yet why. I am overall
>>>> not very happy with the current solution, but I could live with it.
>>>>
>>>> Maybe someone wants to take a look at it?
>>>>
>>>>
>>>> Best,
>>>>
>>>> Peter
>>>>
>>>> Am 20.07.2016 um 14:30 schrieb Peter Klügl:
>>>>> Hi,
>>>>>
>>>>>
>>>>> I'll try to find the time to do these changes this week, next week latest.
>>>>>
>>>>>
>>>>> btw, input stream sniffing in order to distinguish XMI and XCAS is
>>>>> currently not supported. There could be a lot of text before the
>>>>> relevant element occurs, e.g., license text.
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>>
>>>>> Peter
>>>>>
>>>>>
>>>>> Am 20.07.2016 um 14:19 schrieb Marshall Schor:
>>>>>> Hi,
>>>>>>
>>>>>> We can change the header, but:
>>>>>>
>>>>>> The changed header ought to be "readable" by previous versions of UIMA.  
>>>>>>
>>>>>> For XMI and XCAS, these do not currently have special headers, and if we 
>>>>>> added
>>>>>> these, those formats could not be read by older versions of UIMA.  Those 
>>>>>> formats
>>>>>> contain sufficient distinguishing initial strings to distinguish them, 
>>>>>> though. 
>>>>>>
>>>>>> The XMI format is specified, also, in an OASIS standard which the UIMA 
>>>>>> project
>>>>>> is said to (mostly) follow: 
>>>>>> http://uima.apache.org/uima-specification.html
>>>>>>
>>>>>> For binary serializations, I think there's room in the header for an 
>>>>>> extra bit,
>>>>>> which if on, could indicate that a type system was included.  I think it 
>>>>>> would
>>>>>> be good to have a header extension, when type systems are included, to 
>>>>>> specify
>>>>>> the format and version of the type system serialization.
>>>>>>
>>>>>> Most serializations in core UIMA have not included the type system.  The 
>>>>>> one
>>>>>> which does is CASCompleteSerializer.  This is  a "serializable" (using 
>>>>>> standard
>>>>>> Java serializations) object containing serializable forms of the CAS and 
>>>>>> Type
>>>>>> System.
>>>>>>
>>>>>> Regarding making methods in CommonSerDes public:
>>>>>>
>>>>>> It is fine to make them public in the sense that they are accessible 
>>>>>> from other
>>>>>> packages, not in a sub-type hierarchy.  But I think it is best to not 
>>>>>> include
>>>>>> CommonSerDes in a package which is intended for end-users, because the 
>>>>>> end user
>>>>>> UIMA APIs should be (as much as possible) stable over a long time 
>>>>>> period. 
>>>>>> Details of how we evolve headers, etc., should not disturb end users, if
>>>>>> possible; keeping these as public but in packages with names like 
>>>>>> xxx.impl or
>>>>>> xyz.internal.abc etc. is the way this has been traditionally done.  It 
>>>>>> allows us
>>>>>> to evolve these without affecting end-user APIs.  
>>>>>>
>>>>>> Just to be clear: I would not consider uimaFIT and Ruta to be 
>>>>>> "end-users", as
>>>>>> they are developed within the UIMA project, and we are willing to evolve 
>>>>>> them
>>>>>> together with UIMA core changes.
>>>>>>
>>>>>> We don't have a deadline for the next release, but it's mostly ready to 
>>>>>> go, and
>>>>>> will solve a significant issue for people wanting to upgrade their 
>>>>>> Eclipse to
>>>>>> Neon :-). 
>>>>>>
>>>>>> -Marshall
>>>>>>
>>>>>> On 7/20/2016 5:03 AM, Peter Klügl wrote:
>>>>>>> Ok, after looking at the code I must admit that there is much more to do
>>>>>>> than I epxected. We first need to discuss several things:
>>>>>>>
>>>>>>> - can we change the header at all?
>>>>>>>
>>>>>>> - do we support type system inclusion in the header?
>>>>>>>
>>>>>>> - do we support type system inclusion in the serialized files?
>>>>>>>
>>>>>>> - which serial format are which ones?
>>>>>>>
>>>>>>> - can we make the methods in CommonSerDes public?
>>>>>>>
>>>>>>>
>>>>>>> What is the deadline for the release? I am now quite loaded with work
>>>>>>> until next Wednesday :-(
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>>
>>>>>>> Peter
>>>>>>>
>>>>>>>
>>>>>>> Am 19.07.2016 um 22:39 schrieb Marshall Schor:
>>>>>>>> Great.
>>>>>>>>
>>>>>>>> There's now also common code for writing / reading UIMA serialization 
>>>>>>>> headers, in
>>>>>>>>
>>>>>>>> CommonSerDes (in org.apache.uima.cas.impl )
>>>>>>>>
>>>>>>>> This includes the extensions to support versioning the serializations, 
>>>>>>>> which
>>>>>>>> start to be needed in the next release because a bug fix is slightly 
>>>>>>>> changing
>>>>>>>> the serialized form for **delta binary** CAS.
>>>>>>>>
>>>>>>>> So, it would be good to use that rather than have another separate 
>>>>>>>> header
>>>>>>>> reader/writer to maintain.
>>>>>>>>
>>>>>>>> -Marshall
>>>>>>>>
>>>>>>>>
>>>>>>>> On 7/19/2016 4:13 PM, Peter Klügl wrote:
>>>>>>>>> Ah, I didn't know that enum. I'll adapt the code and enum.
>>>>>>>>>
>>>>>>>>> Am 19.07.2016 um 20:09 schrieb Marshall Schor:
>>>>>>>>>> We already have an enum in the core for various serial formats.  The 
>>>>>>>>>> class is
>>>>>>>>>>
>>>>>>>>>> public enum SerialFormat {
>>>>>>>>>>    UNKNOWN,
>>>>>>>>>>    XCAS,         // with reachability filtering
>>>>>>>>>>    XMI,          // with reachability filtering
>>>>>>>>>>    BINARY,       // no filtering
>>>>>>>>>>    COMPRESSED,   // no filtering  (form 4)
>>>>>>>>>>    COMPRESSED_FILTERED,   // with reachability and type and feature 
>>>>>>>>>> filtering
>>>>>>>>>> (form 6)
>>>>>>>>>>    COMPRESSED_PROJECTION, // with subset of views
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> (I don't think COMPRESSED_PROJECTION is in use...)
>>>>>>>>>>
>>>>>>>>>> This has been around for maybe 3 years.  I would be in favor of 
>>>>>>>>>> considering
>>>>>>>>>> using and/or extending this as needed, rather than having two 
>>>>>>>>>> formats (that is,
>>>>>>>>>> the proposed SerializationFormat class).
>>>>>>>>>>
>>>>>>>>>> -Marshall
>>>>>>>>>>
>>>>>>>>>> On 7/19/2016 2:49 AM, Peter Klügl wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> yes, the class should be officially available to external code. I
>>>>>>>>>>> already included it in the CAS Editor and in Ruta. I also plan to 
>>>>>>>>>>> use it
>>>>>>>>>>> in our inhouse code. I'll change the enforcer rule.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I can write the docs but any help is welcome since I do not know how
>>>>>>>>>>> much spare time I have for the rest of the week for this. I'll take 
>>>>>>>>>>> a
>>>>>>>>>>> look where the documentation should be added. Haven't looked to it 
>>>>>>>>>>> for
>>>>>>>>>>> some time ;-)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I just chose the name of the class Richard contributed since I 
>>>>>>>>>>> thought
>>>>>>>>>>> it is really suitable. Then, I also noticed the uimaFIT class. This 
>>>>>>>>>>> is a
>>>>>>>>>>> not really good situation, but I would not change the name because 
>>>>>>>>>>> of it.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I would not split the API form the implementation. I do not see any
>>>>>>>>>>> advantages right now. The class is just a simple utils class with 
>>>>>>>>>>> only
>>>>>>>>>>> static methods like CasCreationUtils (which is also not separated).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>>
>>>>>>>>>>> Peter
>>>>>>>>>>>
>>>>>>>>>>> Am 18.07.2016 um 22:26 schrieb Marshall Schor:
>>>>>>>>>>>> This is OK with me.  I can even volunteer to write the docs (but 
>>>>>>>>>>>> am happy to
>>>>>>>>>>>> others do it :-) ).
>>>>>>>>>>>>
>>>>>>>>>>>> I'll wait to hear about the split (if any) between the public API 
>>>>>>>>>>>> and the
>>>>>>>>>>>> impl.
>>>>>>>>>>>>
>>>>>>>>>>>> And, we'll need to change the next version # to 2.9.0, from 2.8.2, 
>>>>>>>>>>>> due to this
>>>>>>>>>>>> being that kind of a change.
>>>>>>>>>>>>
>>>>>>>>>>>> Is everyone OK with all of this?
>>>>>>>>>>>>
>>>>>>>>>>>> -Marshall
>>>>>>>>>>>>
>>>>>>>>>>>> On 7/18/2016 2:39 PM, Richard Eckart de Castilho wrote:
>>>>>>>>>>>>> I believe the intention is that this class becomes part of the 
>>>>>>>>>>>>> public API.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Also, my understanding is that it would do a superset of what the
>>>>>>>>>>>>> uimaFIT class by the same name does. We could then probably 
>>>>>>>>>>>>> deprecate
>>>>>>>>>>>>> the respective uimaFIT class and suggest using the core class 
>>>>>>>>>>>>> instead.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- Richard
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 18.07.2016, at 20:30, Marshall Schor <[email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This is a new class added to uimaj-core project, in 
>>>>>>>>>>>>>> org.apache.uima.util
>>>>>>>>>>>>>> package.  This is fine if this is to be part of the official 
>>>>>>>>>>>>>> public APIs
>>>>>>>>>>>>>> supported by UIMA going forward; but if that is the case, it 
>>>>>>>>>>>>>> should
>>>>>>>>>>>>>> probably be
>>>>>>>>>>>>>> documented in the UIMA docs, and we'd have to change the version 
>>>>>>>>>>>>>> number
>>>>>>>>>>>>>> (due to
>>>>>>>>>>>>>> enforcer rules).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If this is more of an internal use utilities, then it should be 
>>>>>>>>>>>>>> in one of
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> internal use packages, such as
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    org.apache.uima.internal.util
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This class is similarly named to a UIMAFit class; are these 
>>>>>>>>>>>>>> related?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If some of the APIs are to be permanent and public and part of 
>>>>>>>>>>>>>> the official
>>>>>>>>>>>>>> public APIs, but some are internal implementation details, please
>>>>>>>>>>>>>> consider using
>>>>>>>>>>>>>> an interface and an ".impl" (or equivalent) approach; packages 
>>>>>>>>>>>>>> which support
>>>>>>>>>>>>>> these are:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    org.apache.uima.util  and
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    org.apache.uima.util.impl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --------------
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If this is only an internal kind of change, not intending to 
>>>>>>>>>>>>>> affect the
>>>>>>>>>>>>>> official
>>>>>>>>>>>>>> UIMA APIs, then moving to the internal.util package will fix the 
>>>>>>>>>>>>>> "enforcer"
>>>>>>>>>>>>>> error the build is currently getting.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Marshall
>>>>>>>>>>>>>>

Reply via email to