I'll take a look now; thanks for the work! -Marshall
On 8/2/2016 7:40 AM, Peter Klügl wrote: > Hi, > > > the errors where on my side. Reading the CASes created by the unit test > of CasIOUtils with uima 2.8.1 works fine now. > > > Can I do something else for this ticket? > > > Best, > > > Peter > > > Am 25.07.2016 um 08:43 schrieb Peter Klügl: >> Yeah, I know java serialization. >> >> I think it depends on the perspective and the use case. I added a header >> to the serialized outputs since I see them as binary fomats and I >> thought that all binary formats should get the same header. Then, I >> removed it again, then I added it again. I will remove it again now. >> >> >> I don't think that we will get an optimal solution, e.g., the header is >> read twice, the previous uimaj method should return the format and so >> on. We should get this up and running for the release without breaking >> backwards compatibility and then think what it should look like, and if >> further functionality/refactoring is required. >> >> >> I used uimaj-core 2.8.1. Here are some errors: >> >> simpleCas.bins0 >> org.apache.uima.cas.CASRuntimeException: No sofaFS for specified sofaRef >> found.simpleCas.bins4 >> at org.apache.uima.cas.impl.CASImpl.getSofa(CASImpl.java:806) >> at >> org.apache.uima.cas.impl.FSIndexRepositoryImpl.ll_addFS_common(FSIndexRepositoryImpl.java:2781) >> at >> org.apache.uima.cas.impl.FSIndexRepositoryImpl.ll_addFS(FSIndexRepositoryImpl.java:2763) >> at >> org.apache.uima.cas.impl.FSIndexRepositoryImpl.addFS(FSIndexRepositoryImpl.java:2068) >> at org.apache.uima.cas.impl.CASImpl.reinitIndexedFSs(CASImpl.java:1765) >> at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1488) >> at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1344) >> at >> org.apache.uima.cas.impl.Serialization.deserializeCAS(Serialization.java:171) >> at tutorial.entity.LoadCas.main(LoadCas.java:55) >> org.apache.uima.cas.CASRuntimeException: Error trying to read BLOB data >> from an input stream and deserialize into a CAS. >> at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1591) >> at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1344) >> at >> org.apache.uima.cas.impl.Serialization.deserializeCAS(Serialization.java:171) >> at tutorial.entity.LoadCas.main(LoadCas.java:39) >> >> simpleCas.bins6 >> java.io.EOFException >> at java.io.DataInputStream.readUnsignedByte(DataInputStream.java:290) >> at org.apache.uima.util.impl.DataIO.readVlong(DataIO.java:355) >> at >> org.apache.uima.cas.impl.BinaryCasSerDes6.readVlong(BinaryCasSerDes6.java:2193) >> at >> org.apache.uima.cas.impl.BinaryCasSerDes6.readDiff(BinaryCasSerDes6.java:2102) >> at >> org.apache.uima.cas.impl.BinaryCasSerDes6.readLongOrDouble(BinaryCasSerDes6.java:2128) >> at >> org.apache.uima.cas.impl.BinaryCasSerDes6.readByKind(BinaryCasSerDes6.java:1920) >> at >> org.apache.uima.cas.impl.BinaryCasSerDes6.deserializeAfterVersion(BinaryCasSerDes6.java:1748) >> at >> org.apache.uima.cas.impl.BinaryCasSerDes6.deserialize(BinaryCasSerDes6.java:1596) >> at >> org.apache.uima.cas.impl.Serialization.deserializeCAS(Serialization.java:270) >> at tutorial.entity.LoadCas.main(LoadCas.java:47) >> >> >> >> Am 22.07.2016 um 21:17 schrieb Marshall Schor: >>> I think the model for these two formats is more general than what you are >>> imagining. These are formats that follow the standard Java serialization >>> standard, see for example, >>> https://docs.oracle.com/javase/7/docs/platform/serialization/spec/serialTOC.html >>> >>> The bytes corresponding to the serialized form are expected to (in general) >>> be >>> written anywhere in a data output stream, perhaps preceded or followed by >>> (maybe >>> many) other serialized objects; the overall format of that stream is up to >>> the >>> user designing it, including any headers the user might decide on. >>> >>> In the data output stream, each data object, including one representing the >>> CAS, >>> for example, has a format dictated by the Java standard for object >>> serialization. >>> >>> What error do you get when you try to deserialize a CAS object in a data >>> stream >>> with an older version of UIMA? >>> >>> -Marshall >>> >>> On 7/22/2016 9:31 AM, Peter Klügl wrote: >>>> So SERIALIZED and SERIALIZED_TS get no header? >>>> >>>> >>>> Can you try to deserialize the CAS files created by the unit test with >>>> an older version of uima? I cannot get it to work. >>>> >>>> >>>> Best, >>>> >>>> >>>> Peter >>>> >>>> >>>> Am 22.07.2016 um 15:18 schrieb Marshall Schor: >>>>> Re: The java-serialized formats now have also a binary header >>>>> >>>>> Not sure what you mean by java-serialized formats. Perhaps this means the >>>>> formats created by using standard Java Object serialization on the special >>>>> objects in UIMA built for this. >>>>> >>>>> If so, then it seems this would break backwards compatibility, in that a >>>>> user >>>>> serializing with UIMA 2.9.0, but not using any new features, could not >>>>> have that >>>>> "read" by an older version of UIMA. >>>>> >>>>> >>>>> -Marshall >>>>> >>>>> On 7/22/2016 7:43 AM, Peter Klügl wrote: >>>>>> Hi, >>>>>> >>>>>> >>>>>> I changed CasIOUtils to use the Header and I extended the header with a >>>>>> bit (0x08) indicating an included type system. No information about the >>>>>> serialization of the type system yet. The java-serialized formats now >>>>>> have also a binary header as I did not want to make the header >>>>>> serializable as it should be read/written by the same functionality. >>>>>> >>>>>> I have thought that old UIMA versions (e.g., 2.8.1) should be able to >>>>>> load new CAS files, but my tests failed. No idea yet why. I am overall >>>>>> not very happy with the current solution, but I could live with it. >>>>>> >>>>>> Maybe someone wants to take a look at it? >>>>>> >>>>>> >>>>>> Best, >>>>>> >>>>>> Peter >>>>>> >>>>>> Am 20.07.2016 um 14:30 schrieb Peter Klügl: >>>>>>> Hi, >>>>>>> >>>>>>> >>>>>>> I'll try to find the time to do these changes this week, next week >>>>>>> latest. >>>>>>> >>>>>>> >>>>>>> btw, input stream sniffing in order to distinguish XMI and XCAS is >>>>>>> currently not supported. There could be a lot of text before the >>>>>>> relevant element occurs, e.g., license text. >>>>>>> >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> >>>>>>> Peter >>>>>>> >>>>>>> >>>>>>> Am 20.07.2016 um 14:19 schrieb Marshall Schor: >>>>>>>> Hi, >>>>>>>> >>>>>>>> We can change the header, but: >>>>>>>> >>>>>>>> The changed header ought to be "readable" by previous versions of >>>>>>>> UIMA. >>>>>>>> >>>>>>>> For XMI and XCAS, these do not currently have special headers, and if >>>>>>>> we added >>>>>>>> these, those formats could not be read by older versions of UIMA. >>>>>>>> Those formats >>>>>>>> contain sufficient distinguishing initial strings to distinguish them, >>>>>>>> though. >>>>>>>> >>>>>>>> The XMI format is specified, also, in an OASIS standard which the UIMA >>>>>>>> project >>>>>>>> is said to (mostly) follow: >>>>>>>> http://uima.apache.org/uima-specification.html >>>>>>>> >>>>>>>> For binary serializations, I think there's room in the header for an >>>>>>>> extra bit, >>>>>>>> which if on, could indicate that a type system was included. I think >>>>>>>> it would >>>>>>>> be good to have a header extension, when type systems are included, to >>>>>>>> specify >>>>>>>> the format and version of the type system serialization. >>>>>>>> >>>>>>>> Most serializations in core UIMA have not included the type system. >>>>>>>> The one >>>>>>>> which does is CASCompleteSerializer. This is a "serializable" (using >>>>>>>> standard >>>>>>>> Java serializations) object containing serializable forms of the CAS >>>>>>>> and Type >>>>>>>> System. >>>>>>>> >>>>>>>> Regarding making methods in CommonSerDes public: >>>>>>>> >>>>>>>> It is fine to make them public in the sense that they are accessible >>>>>>>> from other >>>>>>>> packages, not in a sub-type hierarchy. But I think it is best to not >>>>>>>> include >>>>>>>> CommonSerDes in a package which is intended for end-users, because the >>>>>>>> end user >>>>>>>> UIMA APIs should be (as much as possible) stable over a long time >>>>>>>> period. >>>>>>>> Details of how we evolve headers, etc., should not disturb end users, >>>>>>>> if >>>>>>>> possible; keeping these as public but in packages with names like >>>>>>>> xxx.impl or >>>>>>>> xyz.internal.abc etc. is the way this has been traditionally done. It >>>>>>>> allows us >>>>>>>> to evolve these without affecting end-user APIs. >>>>>>>> >>>>>>>> Just to be clear: I would not consider uimaFIT and Ruta to be >>>>>>>> "end-users", as >>>>>>>> they are developed within the UIMA project, and we are willing to >>>>>>>> evolve them >>>>>>>> together with UIMA core changes. >>>>>>>> >>>>>>>> We don't have a deadline for the next release, but it's mostly ready >>>>>>>> to go, and >>>>>>>> will solve a significant issue for people wanting to upgrade their >>>>>>>> Eclipse to >>>>>>>> Neon :-). >>>>>>>> >>>>>>>> -Marshall >>>>>>>> >>>>>>>> On 7/20/2016 5:03 AM, Peter Klügl wrote: >>>>>>>>> Ok, after looking at the code I must admit that there is much more to >>>>>>>>> do >>>>>>>>> than I epxected. We first need to discuss several things: >>>>>>>>> >>>>>>>>> - can we change the header at all? >>>>>>>>> >>>>>>>>> - do we support type system inclusion in the header? >>>>>>>>> >>>>>>>>> - do we support type system inclusion in the serialized files? >>>>>>>>> >>>>>>>>> - which serial format are which ones? >>>>>>>>> >>>>>>>>> - can we make the methods in CommonSerDes public? >>>>>>>>> >>>>>>>>> >>>>>>>>> What is the deadline for the release? I am now quite loaded with work >>>>>>>>> until next Wednesday :-( >>>>>>>>> >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> >>>>>>>>> >>>>>>>>> Peter >>>>>>>>> >>>>>>>>> >>>>>>>>> Am 19.07.2016 um 22:39 schrieb Marshall Schor: >>>>>>>>>> Great. >>>>>>>>>> >>>>>>>>>> There's now also common code for writing / reading UIMA >>>>>>>>>> serialization headers, in >>>>>>>>>> >>>>>>>>>> CommonSerDes (in org.apache.uima.cas.impl ) >>>>>>>>>> >>>>>>>>>> This includes the extensions to support versioning the >>>>>>>>>> serializations, which >>>>>>>>>> start to be needed in the next release because a bug fix is slightly >>>>>>>>>> changing >>>>>>>>>> the serialized form for **delta binary** CAS. >>>>>>>>>> >>>>>>>>>> So, it would be good to use that rather than have another separate >>>>>>>>>> header >>>>>>>>>> reader/writer to maintain. >>>>>>>>>> >>>>>>>>>> -Marshall >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 7/19/2016 4:13 PM, Peter Klügl wrote: >>>>>>>>>>> Ah, I didn't know that enum. I'll adapt the code and enum. >>>>>>>>>>> >>>>>>>>>>> Am 19.07.2016 um 20:09 schrieb Marshall Schor: >>>>>>>>>>>> We already have an enum in the core for various serial formats. >>>>>>>>>>>> The class is >>>>>>>>>>>> >>>>>>>>>>>> public enum SerialFormat { >>>>>>>>>>>> UNKNOWN, >>>>>>>>>>>> XCAS, // with reachability filtering >>>>>>>>>>>> XMI, // with reachability filtering >>>>>>>>>>>> BINARY, // no filtering >>>>>>>>>>>> COMPRESSED, // no filtering (form 4) >>>>>>>>>>>> COMPRESSED_FILTERED, // with reachability and type and >>>>>>>>>>>> feature filtering >>>>>>>>>>>> (form 6) >>>>>>>>>>>> COMPRESSED_PROJECTION, // with subset of views >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> (I don't think COMPRESSED_PROJECTION is in use...) >>>>>>>>>>>> >>>>>>>>>>>> This has been around for maybe 3 years. I would be in favor of >>>>>>>>>>>> considering >>>>>>>>>>>> using and/or extending this as needed, rather than having two >>>>>>>>>>>> formats (that is, >>>>>>>>>>>> the proposed SerializationFormat class). >>>>>>>>>>>> >>>>>>>>>>>> -Marshall >>>>>>>>>>>> >>>>>>>>>>>> On 7/19/2016 2:49 AM, Peter Klügl wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> yes, the class should be officially available to external code. I >>>>>>>>>>>>> already included it in the CAS Editor and in Ruta. I also plan to >>>>>>>>>>>>> use it >>>>>>>>>>>>> in our inhouse code. I'll change the enforcer rule. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I can write the docs but any help is welcome since I do not know >>>>>>>>>>>>> how >>>>>>>>>>>>> much spare time I have for the rest of the week for this. I'll >>>>>>>>>>>>> take a >>>>>>>>>>>>> look where the documentation should be added. Haven't looked to >>>>>>>>>>>>> it for >>>>>>>>>>>>> some time ;-) >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I just chose the name of the class Richard contributed since I >>>>>>>>>>>>> thought >>>>>>>>>>>>> it is really suitable. Then, I also noticed the uimaFIT class. >>>>>>>>>>>>> This is a >>>>>>>>>>>>> not really good situation, but I would not change the name >>>>>>>>>>>>> because of it. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I would not split the API form the implementation. I do not see >>>>>>>>>>>>> any >>>>>>>>>>>>> advantages right now. The class is just a simple utils class with >>>>>>>>>>>>> only >>>>>>>>>>>>> static methods like CasCreationUtils (which is also not >>>>>>>>>>>>> separated). >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>> >>>>>>>>>>>>> Peter >>>>>>>>>>>>> >>>>>>>>>>>>> Am 18.07.2016 um 22:26 schrieb Marshall Schor: >>>>>>>>>>>>>> This is OK with me. I can even volunteer to write the docs (but >>>>>>>>>>>>>> am happy to >>>>>>>>>>>>>> others do it :-) ). >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'll wait to hear about the split (if any) between the public >>>>>>>>>>>>>> API and the >>>>>>>>>>>>>> impl. >>>>>>>>>>>>>> >>>>>>>>>>>>>> And, we'll need to change the next version # to 2.9.0, from >>>>>>>>>>>>>> 2.8.2, due to this >>>>>>>>>>>>>> being that kind of a change. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Is everyone OK with all of this? >>>>>>>>>>>>>> >>>>>>>>>>>>>> -Marshall >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 7/18/2016 2:39 PM, Richard Eckart de Castilho wrote: >>>>>>>>>>>>>>> I believe the intention is that this class becomes part of the >>>>>>>>>>>>>>> public API. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Also, my understanding is that it would do a superset of what >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> uimaFIT class by the same name does. We could then probably >>>>>>>>>>>>>>> deprecate >>>>>>>>>>>>>>> the respective uimaFIT class and suggest using the core class >>>>>>>>>>>>>>> instead. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- Richard >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 18.07.2016, at 20:30, Marshall Schor <[email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This is a new class added to uimaj-core project, in >>>>>>>>>>>>>>>> org.apache.uima.util >>>>>>>>>>>>>>>> package. This is fine if this is to be part of the official >>>>>>>>>>>>>>>> public APIs >>>>>>>>>>>>>>>> supported by UIMA going forward; but if that is the case, it >>>>>>>>>>>>>>>> should >>>>>>>>>>>>>>>> probably be >>>>>>>>>>>>>>>> documented in the UIMA docs, and we'd have to change the >>>>>>>>>>>>>>>> version number >>>>>>>>>>>>>>>> (due to >>>>>>>>>>>>>>>> enforcer rules). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> If this is more of an internal use utilities, then it should >>>>>>>>>>>>>>>> be in one of >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> internal use packages, such as >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> org.apache.uima.internal.util >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This class is similarly named to a UIMAFit class; are these >>>>>>>>>>>>>>>> related? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> If some of the APIs are to be permanent and public and part of >>>>>>>>>>>>>>>> the official >>>>>>>>>>>>>>>> public APIs, but some are internal implementation details, >>>>>>>>>>>>>>>> please >>>>>>>>>>>>>>>> consider using >>>>>>>>>>>>>>>> an interface and an ".impl" (or equivalent) approach; packages >>>>>>>>>>>>>>>> which support >>>>>>>>>>>>>>>> these are: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> org.apache.uima.util and >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> org.apache.uima.util.impl >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -------------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> If this is only an internal kind of change, not intending to >>>>>>>>>>>>>>>> affect the >>>>>>>>>>>>>>>> official >>>>>>>>>>>>>>>> UIMA APIs, then moving to the internal.util package will fix >>>>>>>>>>>>>>>> the "enforcer" >>>>>>>>>>>>>>>> error the build is currently getting. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -Marshall >>>>>>>>>>>>>>>> >
