[ https://issues.apache.org/jira/browse/AVRO-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Daniel Kulp updated AVRO-1922: ------------------------------ Fix Version/s: (was: 1.9.0) > Fixed dimension for array > ------------------------- > > Key: AVRO-1922 > URL: https://issues.apache.org/jira/browse/AVRO-1922 > Project: Apache Avro > Issue Type: New Feature > Components: spec > Affects Versions: 1.8.1 > Reporter: Jim Pivarski > Priority: Major > > This is a feature request for future versions of the Avro specification. > We have found one kind of data structure that is hard to express in Avro: > tensors. Although we can (and do) build matrices as {"type": "array", > "items": {"type": "array", "items": "double"}}, this type does not specify > that the grid of numbers is rectangular. We believe that rectangular arrays > of numbers (or other nested types) would be a strong addition to Avro, both > as a type system and as a serialization format. With the total size of all > dimensions fixed in the schema, they would not need to be repeated in each > serialized datum. > For instance, suppose there was an extension of type "array" to specify > dimensions: > {"type": "array", "dimensions": [3, 3, 3, 3], "items": "double"} > This 3-by-3-by-3-by-3 tensor (representing, for instance, the Riemann > curvature tensor in 3-space) specifies that 81 double-precision numbers > (3*3*3*3) are expected for each datum. With nested arrays, the size, "3," > would have to be separately encoded 40 times (1 + 3*(1 + 3*(1 + 3))) for each > datum, even though they would never change in a dataset of Riemann tensors. > With a "dimensions" attribute in the schema, only the content needs to be > serialized. Moreover, this extension can clearly be used with any other > "items" type, to make dense tables of strings, for instance. > Avro has been extended in a similar way in the past. The "fixed" type is a > "bytes" without the need to specify the number of bytes for each datum. Our > proposal provides a similar packing for structured objects that can be > significant for large numbers of dimensions, as shown above. The advantage to > consumers of Avro data is that we can write functions that do not need to > check all array sizes at runtime (for operations like tensor contractions and > products). > We have searched the web and the Avro JIRA site for similar proposals and > found none, so we're adding this proposal to JIRA in addition to this e-mail. > Please let us know if you have any comments, or if we can provide any more > information. > Thank you! -- This message was sent by Atlassian JIRA (v7.6.3#76005)