[ https://issues.apache.org/jira/browse/AVRO-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068023#comment-14068023 ]
Sachin Goyal commented on AVRO-695: ----------------------------------- {quote} The writer would add an entry to an IdentityHashMap<Object,Integer> for every sub-record it writes. Whenever it encounters a previously-written record, it writes a ref instead. Similarly, the reader would add each records it reads to an array, and when a ref is read, return the corresponding element of the array. {quote} The current fix does use an IdentityHashMap to do this. Reference code in patch: # GenericDatumWriter.java, line 40 and # GenericDatumReader.java, line 46 Please correct me if I am wrong, but it appears the schema generated for a circular list should look somewhat like this: {code:javascript} { "type" : "record", "name" : "CircularList", "namespace" : "org.apache.avro.generic", "fields" : [ { "name" : "__crefId", "type" : "string" }, { "name" : "nodeData", "type" : [ "null", "string" ], "default" : null }, { "name" : "next", "type" : [ "null", "CircularList", "string" ], "default" : null } ], "circularRefIdPrefix" : "__crefId" } {code} (This is generated using current patch) \\ \\ Circular references could be just anywhere in the code. For example, in a family-tree involving grandparents, uncles, aunts, cousins, children, grandchildren etc. circular references could be encountered for many branches outgoing from a single node. Since we do not know which outgoing link would reveal itself as an already-traversed-node, the *__crefId* field needs to be written in advance for each and every record. Hence the need for a separate field in *each* record. {code:javascript} "fields" : [ { "name" : "__crefId", "type" : "string" }, .... {code} Now, when we do encounter an already-traversed-node, the node must be written as an ID. Hence every record's type must be a union with string: {code:javascript} "type" : [ "null", "CircularList", "string" ] {code} I would be happy to consider other options if the above seems incorrect. If it seems correct, +I will submit a patch without non-string map-keys+. \\ \\ \\ [~martinkl], Currently Avro supports circular references in schema. So supporting circular references in data should be a natural extension of the same. Also, circular references are very common in ORM (like Hibernate/JPA) and Java based programs in general. http://stackoverflow.com/questions/11007247/are-circular-references-in-jpa-an-antipattern And parsers like Gson and Jackson support this feature too. The serialized data from the above patch should work with all language implementations and also with Hive/Pig (because we are breaking the circular reference by changing it to an ID). Please share if you think otherwise. > Cycle Reference Support > ----------------------- > > Key: AVRO-695 > URL: https://issues.apache.org/jira/browse/AVRO-695 > Project: Avro > Issue Type: New Feature > Components: spec > Affects Versions: 1.7.6 > Reporter: Moustapha Cherri > Attachments: avro-1.4.1-cycle.patch.gz, avro-1.4.1-cycle.patch.gz, > avro_circular_references.zip, avro_circular_refs_2014_06_14.zip, > circular_refs_and_nonstring_map_keys_2014_06_25.zip > > Original Estimate: 672h > Remaining Estimate: 672h > > This is a proposed implementation to add cycle reference support to Avro. It > basically introduce a new type named Cycle. Cycles contains a string > representing the path to the other reference. > For example if we have an object of type Message that have a member named > previous with type Message too. If we have have this hierarchy: > message > previous : message2 > message2 > previous : message2 > When serializing the cycle path for "message2.previous" will be "previous". > The implementation depend on ANTLR to evaluate those cycle at read time to > resolve them. I used ANTLR 3.2. This dependency is not mandated; I just used > ANTLR to speed thing up. I kept in this implementation the generated code > from ANTLR though this should not be the case as this should be generated > during the build. I only updated the Java code. > I did not make full unit testing but you can find "avrotest.Main" class that > can be used a preliminary test. > Please do not hesitate to contact me for further clarification if this seems > interresting. > Best regards, > Moustapha Cherri -- This message was sent by Atlassian JIRA (v6.2#6252)