Currently, to figure out which types may be inferred and under which
circumstances they will be inferred involves digging through code. I think
it would be useful to have an API for expressing type inference rules.
Ideally this would be provided as utility functions alongside
StringConverter and used by anything which does type inference while
parsing/unboxing. In addition to simplifying implementation, this would
simplify documentation by providing a single inference mechanism to
summarize them all.

For purposes of discussion, type inference rules can be expressed as a
directed graph with vertices representing types and the edges indicating
fallback on failed conversion. For example, in the case of arrow's csv
reader, the graph is very simple:

   NULL -> INT64 -> DOUBLE -> TIMESTAMP -> STRING -> BINARY

This indicates that a column containing only values which can be converted
to null (NULL, null, N/A, and a few other strings are currently recognized)
will be an array of NullType. If the column contains values which can't be
converted to null then conversion to int64 is attempted. If that succeeds
then the column is an array of Int64Type, otherwise conversion to double is
attempted and so on.

By contrast, when reading JSON (which is explicit about numbers vs
strings), the graph would be:

  NULL -> BOOL
  NULL -> INT64 -> DOUBLE
  NULL -> TIMESTAMP -> STRING -> BINARY

Seem reasonable?
Is there a case which isn't covered by a fallback graph as above?

Reply via email to