[GitHub] [arrow-julia] NHDaly opened a new issue #282: Allow constructing an Arrow stream/file from columnar data with no column names

GitBox Tue, 08 Feb 2022 21:10:29 -0800


NHDaly opened a new issue #282:
URL: https://github.com/apache/arrow-julia/issues/282

We have a data source (Relations from our database engine at RelationalAI)
that have _columnar data_, but without column names. (We represent a Relation
as a Set of Tuples, e.g. `movie_title` relates movie IDs to Titles, so the
positions are meaningful but they do not have names.)

We would like to encode this in Arrow as essentially a Vector of columns. In
JSON, we would encode this as:
```json
[
[1001, 2232, 3582, 4030],
["The Matrix", "50 First Dates", "I Am Legend", "The Notebook"]
]
```

From what I can tell, this _is_ supported by the Arrow spec, but isn't
currently supported by the Arrow.jl package?

This is the understanding my colleague and I have come to of the current
situation:

- Looking at the Arrow spec, each RecordBatch message, containing the actual
data, is preceded by a Schema message, defining the logical schema of the
former. The Schema contains an array of Field types that define the columns of
the RecordBatch in proper order. The name property appears to be optional. That
would mean we could serialize columns without a name.
-
https://github.com/apache/arrow/blob/56d060ca197352f575edced64e6a1fbc9331b336/format/Schema.fbs#L463
- The fields in the Schema message are flattened, see:
https://arrow.apache.org/docs/format/Columnar.html#recordbatch-message
- Arrow.jl does support writing unnamed columns, but only if we supply the
data row-wise. Then the resulting arrow schema upon loading contains column
names like the following: `Symbol("1")` (which is a bit cumbersome to work with
in Julia).

Can we work to expose this ability through the Arrow.jl package as well, in
the code to construct an Arrow stream from a column-wise data source?

Thanks!

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-julia] NHDaly opened a new issue #282: Allow constructing an Arrow stream/file from columnar data with no column names

Reply via email to