yes, good idea :-) I'll change this in v3, so the id is more likely to correspond to the v2 one. I suspect the performance impact will be unnoticable.
-Marshall On 9/2/2016 8:17 AM, Burn Lewis wrote: > Could the id assigned in V3 be the same as the V2 address, as if the offset > in a heap? Unique and monotonically increasing. > > Burn > > On Fri, Sep 2, 2016 at 5:36 AM, Peter Klügl <[email protected]> > wrote: > >> Same here. >> >> >> It looks like that we are now also starting to use the address, and I am >> also thinking of using it more in Ruta (internal indexing). >> >> >> Btw, I did some simple experiments lately concerning the stability of >> the addresses when using CasIOUtils. Can it happens that the addresses >> change if you just deserialize the same CAs twice without serializing it >> in between? >> >> >> Best, >> >> >> Peter >> >> >> >> Am 01.09.2016 um 19:29 schrieb Richard Eckart de Castilho: >>> FS IDs are IMHO a very useful thing. Providing out-of-band (i.e. >> out-of-type-system) unique identifiers for feature structures facilitates >> handling them in e.g. in editors. We use that quite a bit in WebAnno. >>> In WebAnno, we do not rely on any heap arithmetics - an ID is just >> expected to be a unique identifier. However, I could imagine cases where >> people might rely on the ID to increment monotonically for new FSes. >>> Most binary formats do not preserve the ID across a save/load cycle. >> However, SERIALIZED and SERIALIZED_TSI *do* preserve the ID, and WebAnno >> makes used of that. It allows to keep references to FSes without having to >> keep the CAS in memory all the time. >>> There should continue to be a V3 serialization format which preserves >> IDs across a load/save cycle. >>> I do presently not see a case where a strong similarity between V2 and >> V3 IDs would be important. It would be nice if deserializing a V2 >> SERIALIZED or SERIALIZED_TSI into V3 would restore the V2 IDs - I expect it >> to be an easy thing to do. >>> Cheers, >>> >>> -- Richard >>> >>>> On 01.09.2016, at 16:09, Marshall Schor <[email protected]> wrote: >>>> >>>> UIMA V3 implementation includes in many places extra code (takes time / >> space) >>>> whose goal is to make things look closer to version 2. Some of this is >> for >>>> interoperability with version 2 artifacts, like serialized forms. >>>> >>>> An example: in v2, many serialization forms include "references" to >> other >>>> Feature Structures (FSs), and for those, the encoding is the "address" >> in the >>>> heap of the FS. >>>> >>>> In v3, there is no heap, but the FSs have "ids", which are (at the >> moment) an >>>> int which increments by 1. This mis-matches the "address" in v2, so >> many parts >>>> of the serialization code builds a map at serialization time from the >> v3 id's to >>>> v2 "addresses", and uses the latter in the serialization form. >>>> >>>> Currently, this is done for various binary serializations, so that >> these can be >>>> read back in by v2 code. >>>> >>>> Currently, it's not done for JSON or XMI (and maybe XCAS - haven't >> checked). So >>>> the serialized forms for these differ between v2 and v3, in that the >> numbers >>>> used to represent references to other FSs are different. >>>> >>>> The deserialization code for XMI and JSON doesn't depend on these >> numbers being >>>> anything other than unique per FS, so there's no issue in >> deserializing. But >>>> the UIMA community may have built other things that depend on these >> identifiers >>>> not changing. >>>> >>>> What's your opinion: should the XMI and JSON etc serialization in V3 be >> changed >>>> to reproduce (approximately) the same reference numbers as v2? I say >>>> approximately, because other factors might affect these, such as the >> ordering >>>> for things not in "ordered" indexes, etc. between v2 and v3. >>>> >>>> -Marshall >>>> >>
