Hello,

With JSON and other "typed" formats (msgpack, protobuf, ...) you need to
take account unions, e.g.

{a: "herp", b: 10}
{a: true, c: "derp"}

The type for `a` would be union<string, bool>.

I think we should also evaluate into investing at ingesting different
schema DSL (protobuf idl, json-schema) to avoid inference entirely.

On Fri, Nov 30, 2018 at 9:43 AM Ben Kietzman <ben.kietz...@rstudio.com>
wrote:

> Hi Antoine,
>
> The conversion of previous blocks is part of the fall back mechanism I'm
> trying to describe. When type inference fails (even in a different block),
> conversion of all blocks of the column is attempted to the next type in the
> fallback graph.
>
> If there is no problem with the fallback graph model, the API would
> probably look like a reusable LoosenType- something which simplifies
> querying for the loosened type when inference fails.
>
> Unrelated: I forgot to include some edges in the json graph
>
> NULL -> BOOL
> NULL -> INT64 -> DOUBLE
> NULL -> TIMESTAMP -> STRING -> BINARY
> NULL -> STRUCT
> NULL -> LIST
>
> On Fri, Nov 30, 2018, 04:52 Antoine Pitrou <anto...@python.org> wrote:
>
> >
> > Hi Ben,
> >
> > Le 30/11/2018 à 02:19, Ben Kietzman a écrit :
> > > Currently, to figure out which types may be inferred and under which
> > > circumstances they will be inferred involves digging through code. I
> > think
> > > it would be useful to have an API for expressing type inference rules.
> > > Ideally this would be provided as utility functions alongside
> > > StringConverter and used by anything which does type inference while
> > > parsing/unboxing.
> >
> > It may be a bit more complicated.  For example, a CSV file is parsed by
> > blocks, and each block produces an array chunk.  But when the type of a
> > later block changes due to type inference failing on the current type,
> > all previous blocks must be parsed again.
> >
> > So I'm curious what you would make the API look like.
> >
> > > By contrast, when reading JSON (which is explicit about numbers vs
> > > strings), the graph would be:
> > >
> > >   NULL -> BOOL
> > >   NULL -> INT64 -> DOUBLE
> > >   NULL -> TIMESTAMP -> STRING -> BINARY
> > >
> > > Seem reasonable?
> > > Is there a case which isn't covered by a fallback graph as above?
> >
> > I have no idea.  Someone else may be able to answer your question.
> >
> > Regards
> >
> > Antoine.
> >
>


-- 
Sent from my jetpack.

Reply via email to