[ 
https://issues.apache.org/jira/browse/AVRO-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068023#comment-14068023
 ] 

Sachin Goyal commented on AVRO-695:
-----------------------------------

{quote}
The writer would add an entry to an IdentityHashMap<Object,Integer> for every 
sub-record it writes. Whenever it encounters a previously-written record, it 
writes a ref instead. Similarly, the reader would add each records it reads to 
an array, and when a ref is read, return the corresponding element of the array.
{quote}
The current fix does use an IdentityHashMap to do this. Reference code in patch:
# GenericDatumWriter.java, line 40 and 
# GenericDatumReader.java, line 46

Please correct me if I am wrong, but it appears the schema generated for a 
circular list should look somewhat like this:
{code:javascript}
{
  "type" : "record",
  "name" : "CircularList",
  "namespace" : "org.apache.avro.generic",
  "fields" : [ {
    "name" : "__crefId",
    "type" : "string"
  }, {
    "name" : "nodeData",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "next",
    "type" : [ "null", "CircularList", "string" ],
    "default" : null
  } ],
  "circularRefIdPrefix" : "__crefId"
}
{code}
(This is generated using current patch)

\\
\\
Circular references could be just anywhere in the code.
For example, in a family-tree involving grandparents, uncles, aunts, cousins, 
children, grandchildren etc. circular references could be encountered for many 
branches outgoing from a single node.

Since we do not know which outgoing link would reveal itself as an 
already-traversed-node, the *__crefId* field needs to be written in advance for 
each and every record. Hence the need for a separate field in *each* record.
{code:javascript}
"fields" : [ {
    "name" : "__crefId",
    "type" : "string"
  }, ....
{code}

Now, when we do encounter an already-traversed-node, the node must be written 
as an ID. Hence every record's type must be a union with string:
{code:javascript}
    "type" : [ "null", "CircularList", "string" ]
{code}

I would be happy to consider other options if the above seems incorrect.
If it seems correct, +I will submit a patch without non-string map-keys+.


\\
\\
\\
[~martinkl], Currently Avro supports circular references in schema.
So supporting circular references in data should be a natural extension of the 
same.

Also, circular references are very common in ORM (like Hibernate/JPA) and Java 
based programs in general.
http://stackoverflow.com/questions/11007247/are-circular-references-in-jpa-an-antipattern

And parsers like Gson and Jackson support this feature too.

The serialized data from the above patch should work with all language 
implementations and also with Hive/Pig (because we are breaking the circular 
reference by changing it to an ID).
Please share if you think otherwise.


> Cycle Reference Support
> -----------------------
>
>                 Key: AVRO-695
>                 URL: https://issues.apache.org/jira/browse/AVRO-695
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>    Affects Versions: 1.7.6
>            Reporter: Moustapha Cherri
>         Attachments: avro-1.4.1-cycle.patch.gz, avro-1.4.1-cycle.patch.gz, 
> avro_circular_references.zip, avro_circular_refs_2014_06_14.zip, 
> circular_refs_and_nonstring_map_keys_2014_06_25.zip
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> This is a proposed implementation to add cycle reference support to Avro. It 
> basically introduce a new type named Cycle. Cycles contains a string 
> representing the path to the other reference.
> For example if we have an object of type Message that have a member named 
> previous with type Message too. If we have have this hierarchy:
> message
>   previous : message2
> message2
>   previous : message2
> When serializing the cycle path for "message2.previous" will be "previous".
> The implementation depend on ANTLR to evaluate those cycle at read time to 
> resolve them. I used ANTLR 3.2. This dependency is not mandated; I just used 
> ANTLR to speed thing up. I kept in this implementation the generated code 
> from ANTLR though this should not be the case as this should be generated 
> during the build. I only updated the Java code.
> I did not make full unit testing but you can find "avrotest.Main" class that 
> can be used a preliminary test.
> Please do not hesitate to contact me for further clarification if this seems 
> interresting.
> Best regards,
> Moustapha Cherri



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to