Well, the restart strategy still works for your examples. And you only pay
once. From them you look at the cached type information and used an upper
bound data type as you read the data. Since it works to read the values in
the right order, it is obviously possible to push down typing information
even into the json reader.



On Tue, Apr 3, 2018, 21:42 Paul Rogers <par0...@yahoo.com.invalid> wrote:

> Subtle point. I can provide schema with Parquet, as you note. (Actually,
> for Parquet, Drill is schema-required: I can't not provide a schema due to
> the nature of Parquet...)
>
> But, I can't provide a schema for JSON, CSV, etc. The point is, Drill
> forbids the user from providing a schema; only the file format itself can
> provide the schema (or not, in the case of JSON). This is the very heart of
> the problem.
>
> The root cause of our schema change exception is that vectors are, indeed,
> strongly typed. But, file columns are not. Here is my favorite:
>
> {x: 10} {x: 10.1}
>
> Blam! Query fails because the vector is chosen as BigInt, then we discover
> it really should have been Float8. (If the answer is: go back and rebuild
> the vector with the new type, consider the case that 100K records separate
> the two above so that the first batch is long gone by the time we see the
> offending record. If only I could tell Drill to use Float8 (or Decimal) up
> front...
>
> Views won't help here because the failure occurs before a view can kick
> in. However, presumably, I could write a view to handle a different classic
> case:
>
> myDir /
> |- File 1: {a: 10, b: "foo"}
> |- File 2: {a: 20}
>
> With query: SELECT a, b FROM myDir
>
> For File 2, Drill will guess that b is a Nullable Int, but it is really
> VarChar. I think I could write clever SQL that says:
>
> If b is of type Nullable Int, return NULL cast to nullable VarChar, else
> return b
>
> The irony is that I must to write procedural code to declare a static
> attribute of the data. Yet SQL is otherwise declarative: I state what I
> want, not how to implement it.
>
> Life would be so much easier if I could just say, "trust me, when you read
> column b, it is a VarChar."
>
> Thanks,
> - Paul
>
>
>
>     On Tuesday, April 3, 2018, 10:53:27 AM PDT, Ted Dunning <
> ted.dunn...@gmail.com> wrote:
>
>  I don't see why you say that Drill is schema-forbidden.
>
> The Parquet reader, for instance, makes strong use of the implied schema to
> facilitate reading of typed data.
>
> Likewise, the vectorized internal format is strongly typed and, as such,
> uses schema information.
>
> Views are another way to communicate schema information.
>
> It is true that you can't, say, view comments on fields from the command
> line. But I don't understand saying "schema-forbidden".
>
>
> On Tue, Apr 3, 2018 at 10:01 AM, Paul Rogers <par0...@yahoo.com.invalid>
> wrote:
>
> > Here is another way to think about it. Today, Drill is
> "schema-forbidden":
> > even if I know the schema, I can't communicate that to Drill; Drill must
> > figure it out on its own, making the same mistakes every time on
> ambiguous
> > schemas.
> >
>

Reply via email to