[ 
https://issues.apache.org/jira/browse/AVRO-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Kulp updated AVRO-1922:
------------------------------
    Fix Version/s:     (was: 1.9.0)

> Fixed dimension for array
> -------------------------
>
>                 Key: AVRO-1922
>                 URL: https://issues.apache.org/jira/browse/AVRO-1922
>             Project: Apache Avro
>          Issue Type: New Feature
>          Components: spec
>    Affects Versions: 1.8.1
>            Reporter: Jim Pivarski
>            Priority: Major
>
> This is a feature request for future versions of the Avro specification.
> We have found one kind of data structure that is hard to express in Avro: 
> tensors. Although we can (and do) build matrices as {"type": "array", 
> "items": {"type": "array", "items": "double"}}, this type does not specify 
> that the grid of numbers is rectangular. We believe that rectangular arrays 
> of numbers (or other nested types) would be a strong addition to Avro, both 
> as a type system and as a serialization format. With the total size of all 
> dimensions fixed in the schema, they would not need to be repeated in each 
> serialized datum.
> For instance, suppose there was an extension of type "array" to specify 
> dimensions:
>     {"type": "array", "dimensions": [3, 3, 3, 3], "items": "double"}
> This 3-by-3-by-3-by-3 tensor (representing, for instance, the Riemann 
> curvature tensor in 3-space) specifies that 81 double-precision numbers 
> (3*3*3*3) are expected for each datum. With nested arrays, the size, "3," 
> would have to be separately encoded 40 times (1 + 3*(1 + 3*(1 + 3))) for each 
> datum, even though they would never change in a dataset of Riemann tensors. 
> With a "dimensions" attribute in the schema, only the content needs to be 
> serialized. Moreover, this extension can clearly be used with any other 
> "items" type, to make dense tables of strings, for instance.
> Avro has been extended in a similar way in the past. The "fixed" type is a 
> "bytes" without the need to specify the number of bytes for each datum. Our 
> proposal provides a similar packing for structured objects that can be 
> significant for large numbers of dimensions, as shown above. The advantage to 
> consumers of Avro data is that we can write functions that do not need to 
> check all array sizes at runtime (for operations like tensor contractions and 
> products).
> We have searched the web and the Avro JIRA site for similar proposals and 
> found none, so we're adding this proposal to JIRA in addition to this e-mail. 
> Please let us know if you have any comments, or if we can provide any more 
> information.
> Thank you!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to