Hi,
the errors where on my side. Reading the CASes created by the unit test of CasIOUtils with uima 2.8.1 works fine now. Can I do something else for this ticket? Best, Peter Am 25.07.2016 um 08:43 schrieb Peter Klügl: > Yeah, I know java serialization. > > I think it depends on the perspective and the use case. I added a header > to the serialized outputs since I see them as binary fomats and I > thought that all binary formats should get the same header. Then, I > removed it again, then I added it again. I will remove it again now. > > > I don't think that we will get an optimal solution, e.g., the header is > read twice, the previous uimaj method should return the format and so > on. We should get this up and running for the release without breaking > backwards compatibility and then think what it should look like, and if > further functionality/refactoring is required. > > > I used uimaj-core 2.8.1. Here are some errors: > > simpleCas.bins0 > org.apache.uima.cas.CASRuntimeException: No sofaFS for specified sofaRef > found.simpleCas.bins4 > at org.apache.uima.cas.impl.CASImpl.getSofa(CASImpl.java:806) > at > org.apache.uima.cas.impl.FSIndexRepositoryImpl.ll_addFS_common(FSIndexRepositoryImpl.java:2781) > at > org.apache.uima.cas.impl.FSIndexRepositoryImpl.ll_addFS(FSIndexRepositoryImpl.java:2763) > at > org.apache.uima.cas.impl.FSIndexRepositoryImpl.addFS(FSIndexRepositoryImpl.java:2068) > at org.apache.uima.cas.impl.CASImpl.reinitIndexedFSs(CASImpl.java:1765) > at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1488) > at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1344) > at > org.apache.uima.cas.impl.Serialization.deserializeCAS(Serialization.java:171) > at tutorial.entity.LoadCas.main(LoadCas.java:55) > org.apache.uima.cas.CASRuntimeException: Error trying to read BLOB data > from an input stream and deserialize into a CAS. > at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1591) > at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1344) > at > org.apache.uima.cas.impl.Serialization.deserializeCAS(Serialization.java:171) > at tutorial.entity.LoadCas.main(LoadCas.java:39) > > simpleCas.bins6 > java.io.EOFException > at java.io.DataInputStream.readUnsignedByte(DataInputStream.java:290) > at org.apache.uima.util.impl.DataIO.readVlong(DataIO.java:355) > at > org.apache.uima.cas.impl.BinaryCasSerDes6.readVlong(BinaryCasSerDes6.java:2193) > at > org.apache.uima.cas.impl.BinaryCasSerDes6.readDiff(BinaryCasSerDes6.java:2102) > at > org.apache.uima.cas.impl.BinaryCasSerDes6.readLongOrDouble(BinaryCasSerDes6.java:2128) > at > org.apache.uima.cas.impl.BinaryCasSerDes6.readByKind(BinaryCasSerDes6.java:1920) > at > org.apache.uima.cas.impl.BinaryCasSerDes6.deserializeAfterVersion(BinaryCasSerDes6.java:1748) > at > org.apache.uima.cas.impl.BinaryCasSerDes6.deserialize(BinaryCasSerDes6.java:1596) > at > org.apache.uima.cas.impl.Serialization.deserializeCAS(Serialization.java:270) > at tutorial.entity.LoadCas.main(LoadCas.java:47) > > > > Am 22.07.2016 um 21:17 schrieb Marshall Schor: >> I think the model for these two formats is more general than what you are >> imagining. These are formats that follow the standard Java serialization >> standard, see for example, >> https://docs.oracle.com/javase/7/docs/platform/serialization/spec/serialTOC.html >> >> The bytes corresponding to the serialized form are expected to (in general) >> be >> written anywhere in a data output stream, perhaps preceded or followed by >> (maybe >> many) other serialized objects; the overall format of that stream is up to >> the >> user designing it, including any headers the user might decide on. >> >> In the data output stream, each data object, including one representing the >> CAS, >> for example, has a format dictated by the Java standard for object >> serialization. >> >> What error do you get when you try to deserialize a CAS object in a data >> stream >> with an older version of UIMA? >> >> -Marshall >> >> On 7/22/2016 9:31 AM, Peter Klügl wrote: >>> So SERIALIZED and SERIALIZED_TS get no header? >>> >>> >>> Can you try to deserialize the CAS files created by the unit test with >>> an older version of uima? I cannot get it to work. >>> >>> >>> Best, >>> >>> >>> Peter >>> >>> >>> Am 22.07.2016 um 15:18 schrieb Marshall Schor: >>>> Re: The java-serialized formats now have also a binary header >>>> >>>> Not sure what you mean by java-serialized formats. Perhaps this means the >>>> formats created by using standard Java Object serialization on the special >>>> objects in UIMA built for this. >>>> >>>> If so, then it seems this would break backwards compatibility, in that a >>>> user >>>> serializing with UIMA 2.9.0, but not using any new features, could not >>>> have that >>>> "read" by an older version of UIMA. >>>> >>>> >>>> -Marshall >>>> >>>> On 7/22/2016 7:43 AM, Peter Klügl wrote: >>>>> Hi, >>>>> >>>>> >>>>> I changed CasIOUtils to use the Header and I extended the header with a >>>>> bit (0x08) indicating an included type system. No information about the >>>>> serialization of the type system yet. The java-serialized formats now >>>>> have also a binary header as I did not want to make the header >>>>> serializable as it should be read/written by the same functionality. >>>>> >>>>> I have thought that old UIMA versions (e.g., 2.8.1) should be able to >>>>> load new CAS files, but my tests failed. No idea yet why. I am overall >>>>> not very happy with the current solution, but I could live with it. >>>>> >>>>> Maybe someone wants to take a look at it? >>>>> >>>>> >>>>> Best, >>>>> >>>>> Peter >>>>> >>>>> Am 20.07.2016 um 14:30 schrieb Peter Klügl: >>>>>> Hi, >>>>>> >>>>>> >>>>>> I'll try to find the time to do these changes this week, next week >>>>>> latest. >>>>>> >>>>>> >>>>>> btw, input stream sniffing in order to distinguish XMI and XCAS is >>>>>> currently not supported. There could be a lot of text before the >>>>>> relevant element occurs, e.g., license text. >>>>>> >>>>>> >>>>>> Best, >>>>>> >>>>>> >>>>>> Peter >>>>>> >>>>>> >>>>>> Am 20.07.2016 um 14:19 schrieb Marshall Schor: >>>>>>> Hi, >>>>>>> >>>>>>> We can change the header, but: >>>>>>> >>>>>>> The changed header ought to be "readable" by previous versions of UIMA. >>>>>>> >>>>>>> >>>>>>> For XMI and XCAS, these do not currently have special headers, and if >>>>>>> we added >>>>>>> these, those formats could not be read by older versions of UIMA. >>>>>>> Those formats >>>>>>> contain sufficient distinguishing initial strings to distinguish them, >>>>>>> though. >>>>>>> >>>>>>> The XMI format is specified, also, in an OASIS standard which the UIMA >>>>>>> project >>>>>>> is said to (mostly) follow: >>>>>>> http://uima.apache.org/uima-specification.html >>>>>>> >>>>>>> For binary serializations, I think there's room in the header for an >>>>>>> extra bit, >>>>>>> which if on, could indicate that a type system was included. I think >>>>>>> it would >>>>>>> be good to have a header extension, when type systems are included, to >>>>>>> specify >>>>>>> the format and version of the type system serialization. >>>>>>> >>>>>>> Most serializations in core UIMA have not included the type system. >>>>>>> The one >>>>>>> which does is CASCompleteSerializer. This is a "serializable" (using >>>>>>> standard >>>>>>> Java serializations) object containing serializable forms of the CAS >>>>>>> and Type >>>>>>> System. >>>>>>> >>>>>>> Regarding making methods in CommonSerDes public: >>>>>>> >>>>>>> It is fine to make them public in the sense that they are accessible >>>>>>> from other >>>>>>> packages, not in a sub-type hierarchy. But I think it is best to not >>>>>>> include >>>>>>> CommonSerDes in a package which is intended for end-users, because the >>>>>>> end user >>>>>>> UIMA APIs should be (as much as possible) stable over a long time >>>>>>> period. >>>>>>> Details of how we evolve headers, etc., should not disturb end users, if >>>>>>> possible; keeping these as public but in packages with names like >>>>>>> xxx.impl or >>>>>>> xyz.internal.abc etc. is the way this has been traditionally done. It >>>>>>> allows us >>>>>>> to evolve these without affecting end-user APIs. >>>>>>> >>>>>>> Just to be clear: I would not consider uimaFIT and Ruta to be >>>>>>> "end-users", as >>>>>>> they are developed within the UIMA project, and we are willing to >>>>>>> evolve them >>>>>>> together with UIMA core changes. >>>>>>> >>>>>>> We don't have a deadline for the next release, but it's mostly ready to >>>>>>> go, and >>>>>>> will solve a significant issue for people wanting to upgrade their >>>>>>> Eclipse to >>>>>>> Neon :-). >>>>>>> >>>>>>> -Marshall >>>>>>> >>>>>>> On 7/20/2016 5:03 AM, Peter Klügl wrote: >>>>>>>> Ok, after looking at the code I must admit that there is much more to >>>>>>>> do >>>>>>>> than I epxected. We first need to discuss several things: >>>>>>>> >>>>>>>> - can we change the header at all? >>>>>>>> >>>>>>>> - do we support type system inclusion in the header? >>>>>>>> >>>>>>>> - do we support type system inclusion in the serialized files? >>>>>>>> >>>>>>>> - which serial format are which ones? >>>>>>>> >>>>>>>> - can we make the methods in CommonSerDes public? >>>>>>>> >>>>>>>> >>>>>>>> What is the deadline for the release? I am now quite loaded with work >>>>>>>> until next Wednesday :-( >>>>>>>> >>>>>>>> >>>>>>>> Best, >>>>>>>> >>>>>>>> >>>>>>>> Peter >>>>>>>> >>>>>>>> >>>>>>>> Am 19.07.2016 um 22:39 schrieb Marshall Schor: >>>>>>>>> Great. >>>>>>>>> >>>>>>>>> There's now also common code for writing / reading UIMA serialization >>>>>>>>> headers, in >>>>>>>>> >>>>>>>>> CommonSerDes (in org.apache.uima.cas.impl ) >>>>>>>>> >>>>>>>>> This includes the extensions to support versioning the >>>>>>>>> serializations, which >>>>>>>>> start to be needed in the next release because a bug fix is slightly >>>>>>>>> changing >>>>>>>>> the serialized form for **delta binary** CAS. >>>>>>>>> >>>>>>>>> So, it would be good to use that rather than have another separate >>>>>>>>> header >>>>>>>>> reader/writer to maintain. >>>>>>>>> >>>>>>>>> -Marshall >>>>>>>>> >>>>>>>>> >>>>>>>>> On 7/19/2016 4:13 PM, Peter Klügl wrote: >>>>>>>>>> Ah, I didn't know that enum. I'll adapt the code and enum. >>>>>>>>>> >>>>>>>>>> Am 19.07.2016 um 20:09 schrieb Marshall Schor: >>>>>>>>>>> We already have an enum in the core for various serial formats. >>>>>>>>>>> The class is >>>>>>>>>>> >>>>>>>>>>> public enum SerialFormat { >>>>>>>>>>> UNKNOWN, >>>>>>>>>>> XCAS, // with reachability filtering >>>>>>>>>>> XMI, // with reachability filtering >>>>>>>>>>> BINARY, // no filtering >>>>>>>>>>> COMPRESSED, // no filtering (form 4) >>>>>>>>>>> COMPRESSED_FILTERED, // with reachability and type and feature >>>>>>>>>>> filtering >>>>>>>>>>> (form 6) >>>>>>>>>>> COMPRESSED_PROJECTION, // with subset of views >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> (I don't think COMPRESSED_PROJECTION is in use...) >>>>>>>>>>> >>>>>>>>>>> This has been around for maybe 3 years. I would be in favor of >>>>>>>>>>> considering >>>>>>>>>>> using and/or extending this as needed, rather than having two >>>>>>>>>>> formats (that is, >>>>>>>>>>> the proposed SerializationFormat class). >>>>>>>>>>> >>>>>>>>>>> -Marshall >>>>>>>>>>> >>>>>>>>>>> On 7/19/2016 2:49 AM, Peter Klügl wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> yes, the class should be officially available to external code. I >>>>>>>>>>>> already included it in the CAS Editor and in Ruta. I also plan to >>>>>>>>>>>> use it >>>>>>>>>>>> in our inhouse code. I'll change the enforcer rule. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I can write the docs but any help is welcome since I do not know >>>>>>>>>>>> how >>>>>>>>>>>> much spare time I have for the rest of the week for this. I'll >>>>>>>>>>>> take a >>>>>>>>>>>> look where the documentation should be added. Haven't looked to it >>>>>>>>>>>> for >>>>>>>>>>>> some time ;-) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I just chose the name of the class Richard contributed since I >>>>>>>>>>>> thought >>>>>>>>>>>> it is really suitable. Then, I also noticed the uimaFIT class. >>>>>>>>>>>> This is a >>>>>>>>>>>> not really good situation, but I would not change the name because >>>>>>>>>>>> of it. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I would not split the API form the implementation. I do not see any >>>>>>>>>>>> advantages right now. The class is just a simple utils class with >>>>>>>>>>>> only >>>>>>>>>>>> static methods like CasCreationUtils (which is also not separated). >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> >>>>>>>>>>>> Peter >>>>>>>>>>>> >>>>>>>>>>>> Am 18.07.2016 um 22:26 schrieb Marshall Schor: >>>>>>>>>>>>> This is OK with me. I can even volunteer to write the docs (but >>>>>>>>>>>>> am happy to >>>>>>>>>>>>> others do it :-) ). >>>>>>>>>>>>> >>>>>>>>>>>>> I'll wait to hear about the split (if any) between the public API >>>>>>>>>>>>> and the >>>>>>>>>>>>> impl. >>>>>>>>>>>>> >>>>>>>>>>>>> And, we'll need to change the next version # to 2.9.0, from >>>>>>>>>>>>> 2.8.2, due to this >>>>>>>>>>>>> being that kind of a change. >>>>>>>>>>>>> >>>>>>>>>>>>> Is everyone OK with all of this? >>>>>>>>>>>>> >>>>>>>>>>>>> -Marshall >>>>>>>>>>>>> >>>>>>>>>>>>> On 7/18/2016 2:39 PM, Richard Eckart de Castilho wrote: >>>>>>>>>>>>>> I believe the intention is that this class becomes part of the >>>>>>>>>>>>>> public API. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Also, my understanding is that it would do a superset of what the >>>>>>>>>>>>>> uimaFIT class by the same name does. We could then probably >>>>>>>>>>>>>> deprecate >>>>>>>>>>>>>> the respective uimaFIT class and suggest using the core class >>>>>>>>>>>>>> instead. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best, >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- Richard >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 18.07.2016, at 20:30, Marshall Schor <[email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This is a new class added to uimaj-core project, in >>>>>>>>>>>>>>> org.apache.uima.util >>>>>>>>>>>>>>> package. This is fine if this is to be part of the official >>>>>>>>>>>>>>> public APIs >>>>>>>>>>>>>>> supported by UIMA going forward; but if that is the case, it >>>>>>>>>>>>>>> should >>>>>>>>>>>>>>> probably be >>>>>>>>>>>>>>> documented in the UIMA docs, and we'd have to change the >>>>>>>>>>>>>>> version number >>>>>>>>>>>>>>> (due to >>>>>>>>>>>>>>> enforcer rules). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> If this is more of an internal use utilities, then it should be >>>>>>>>>>>>>>> in one of >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> internal use packages, such as >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.apache.uima.internal.util >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This class is similarly named to a UIMAFit class; are these >>>>>>>>>>>>>>> related? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> If some of the APIs are to be permanent and public and part of >>>>>>>>>>>>>>> the official >>>>>>>>>>>>>>> public APIs, but some are internal implementation details, >>>>>>>>>>>>>>> please >>>>>>>>>>>>>>> consider using >>>>>>>>>>>>>>> an interface and an ".impl" (or equivalent) approach; packages >>>>>>>>>>>>>>> which support >>>>>>>>>>>>>>> these are: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.apache.uima.util and >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.apache.uima.util.impl >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -------------- >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> If this is only an internal kind of change, not intending to >>>>>>>>>>>>>>> affect the >>>>>>>>>>>>>>> official >>>>>>>>>>>>>>> UIMA APIs, then moving to the internal.util package will fix >>>>>>>>>>>>>>> the "enforcer" >>>>>>>>>>>>>>> error the build is currently getting. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -Marshall >>>>>>>>>>>>>>>
