whew! -M
On 9/2/2016 9:27 AM, Peter Klügl wrote: > Tested all formats, did not happen for a reasonable complex CAS. > > > Am 02.09.2016 um 15:26 schrieb Marshall Schor: >> Re: deserializing the same CAS twice shouldn't change the addresses; if you >> have a case where it's doing that, I'll investigate (need a small test >> case...). >> >> -Marshall >> >> On 9/2/2016 5:36 AM, Peter Klügl wrote: >>> Same here. >>> >>> >>> It looks like that we are now also starting to use the address, and I am >>> also thinking of using it more in Ruta (internal indexing). >>> >>> >>> Btw, I did some simple experiments lately concerning the stability of >>> the addresses when using CasIOUtils. Can it happens that the addresses >>> change if you just deserialize the same CAs twice without serializing it >>> in between? >>> >>> >>> Best, >>> >>> >>> Peter >>> >>> >>> >>> Am 01.09.2016 um 19:29 schrieb Richard Eckart de Castilho: >>>> FS IDs are IMHO a very useful thing. Providing out-of-band (i.e. >>>> out-of-type-system) unique identifiers for feature structures facilitates >>>> handling them in e.g. in editors. We use that quite a bit in WebAnno. >>>> >>>> In WebAnno, we do not rely on any heap arithmetics - an ID is just >>>> expected to be a unique identifier. However, I could imagine cases where >>>> people might rely on the ID to increment monotonically for new FSes. >>>> >>>> Most binary formats do not preserve the ID across a save/load cycle. >>>> However, SERIALIZED and SERIALIZED_TSI *do* preserve the ID, and WebAnno >>>> makes used of that. It allows to keep references to FSes without having to >>>> keep the CAS in memory all the time. >>>> >>>> There should continue to be a V3 serialization format which preserves IDs >>>> across a load/save cycle. >>>> >>>> I do presently not see a case where a strong similarity between V2 and V3 >>>> IDs would be important. It would be nice if deserializing a V2 SERIALIZED >>>> or SERIALIZED_TSI into V3 would restore the V2 IDs - I expect it to be an >>>> easy thing to do. >>>> >>>> Cheers, >>>> >>>> -- Richard >>>> >>>>> On 01.09.2016, at 16:09, Marshall Schor <[email protected]> wrote: >>>>> >>>>> UIMA V3 implementation includes in many places extra code (takes time / >>>>> space) >>>>> whose goal is to make things look closer to version 2. Some of this is >>>>> for >>>>> interoperability with version 2 artifacts, like serialized forms. >>>>> >>>>> An example: in v2, many serialization forms include "references" to other >>>>> Feature Structures (FSs), and for those, the encoding is the "address" in >>>>> the >>>>> heap of the FS. >>>>> >>>>> In v3, there is no heap, but the FSs have "ids", which are (at the >>>>> moment) an >>>>> int which increments by 1. This mis-matches the "address" in v2, so many >>>>> parts >>>>> of the serialization code builds a map at serialization time from the v3 >>>>> id's to >>>>> v2 "addresses", and uses the latter in the serialization form. >>>>> >>>>> Currently, this is done for various binary serializations, so that these >>>>> can be >>>>> read back in by v2 code. >>>>> >>>>> Currently, it's not done for JSON or XMI (and maybe XCAS - haven't >>>>> checked). So >>>>> the serialized forms for these differ between v2 and v3, in that the >>>>> numbers >>>>> used to represent references to other FSs are different. >>>>> >>>>> The deserialization code for XMI and JSON doesn't depend on these numbers >>>>> being >>>>> anything other than unique per FS, so there's no issue in deserializing. >>>>> But >>>>> the UIMA community may have built other things that depend on these >>>>> identifiers >>>>> not changing. >>>>> >>>>> What's your opinion: should the XMI and JSON etc serialization in V3 be >>>>> changed >>>>> to reproduce (approximately) the same reference numbers as v2? I say >>>>> approximately, because other factors might affect these, such as the >>>>> ordering >>>>> for things not in "ordered" indexes, etc. between v2 and v3. >>>>> >>>>> -Marshall >>>>> >
