Re: deserializing the same CAS twice shouldn't change the addresses; if you have a case where it's doing that, I'll investigate (need a small test case...).
-Marshall On 9/2/2016 5:36 AM, Peter Klügl wrote: > Same here. > > > It looks like that we are now also starting to use the address, and I am > also thinking of using it more in Ruta (internal indexing). > > > Btw, I did some simple experiments lately concerning the stability of > the addresses when using CasIOUtils. Can it happens that the addresses > change if you just deserialize the same CAs twice without serializing it > in between? > > > Best, > > > Peter > > > > Am 01.09.2016 um 19:29 schrieb Richard Eckart de Castilho: >> FS IDs are IMHO a very useful thing. Providing out-of-band (i.e. >> out-of-type-system) unique identifiers for feature structures facilitates >> handling them in e.g. in editors. We use that quite a bit in WebAnno. >> >> In WebAnno, we do not rely on any heap arithmetics - an ID is just expected >> to be a unique identifier. However, I could imagine cases where people might >> rely on the ID to increment monotonically for new FSes. >> >> Most binary formats do not preserve the ID across a save/load cycle. >> However, SERIALIZED and SERIALIZED_TSI *do* preserve the ID, and WebAnno >> makes used of that. It allows to keep references to FSes without having to >> keep the CAS in memory all the time. >> >> There should continue to be a V3 serialization format which preserves IDs >> across a load/save cycle. >> >> I do presently not see a case where a strong similarity between V2 and V3 >> IDs would be important. It would be nice if deserializing a V2 SERIALIZED or >> SERIALIZED_TSI into V3 would restore the V2 IDs - I expect it to be an easy >> thing to do. >> >> Cheers, >> >> -- Richard >> >>> On 01.09.2016, at 16:09, Marshall Schor <[email protected]> wrote: >>> >>> UIMA V3 implementation includes in many places extra code (takes time / >>> space) >>> whose goal is to make things look closer to version 2. Some of this is for >>> interoperability with version 2 artifacts, like serialized forms. >>> >>> An example: in v2, many serialization forms include "references" to other >>> Feature Structures (FSs), and for those, the encoding is the "address" in >>> the >>> heap of the FS. >>> >>> In v3, there is no heap, but the FSs have "ids", which are (at the moment) >>> an >>> int which increments by 1. This mis-matches the "address" in v2, so many >>> parts >>> of the serialization code builds a map at serialization time from the v3 >>> id's to >>> v2 "addresses", and uses the latter in the serialization form. >>> >>> Currently, this is done for various binary serializations, so that these >>> can be >>> read back in by v2 code. >>> >>> Currently, it's not done for JSON or XMI (and maybe XCAS - haven't >>> checked). So >>> the serialized forms for these differ between v2 and v3, in that the numbers >>> used to represent references to other FSs are different. >>> >>> The deserialization code for XMI and JSON doesn't depend on these numbers >>> being >>> anything other than unique per FS, so there's no issue in deserializing. >>> But >>> the UIMA community may have built other things that depend on these >>> identifiers >>> not changing. >>> >>> What's your opinion: should the XMI and JSON etc serialization in V3 be >>> changed >>> to reproduce (approximately) the same reference numbers as v2? I say >>> approximately, because other factors might affect these, such as the >>> ordering >>> for things not in "ordered" indexes, etc. between v2 and v3. >>> >>> -Marshall >>> >
