Hi Stefan
  This is the same old issue: Drill does an initial scan to determine the
type of a field. In cases where Drill encounters nulls in the data it
defaults to using a Nullable Int as the type (not a good choice perhaps).
  This leads to all sorts of issues (most of which you're hitting).
  There is an effort to improve this (DRILL-3228) but it will be a while
before this work is completed.

  In the meantime, I can only suggest a workaround : use as cast around
your columns -

 select p.type, coalesce( cast(p.dimensions.dim_type as varchar(20)),
cast(p.dimensions.type as varchar(20))) dimensions_type, count(*) from
`test.json` as p where occurred_at > '2015-07-26' and p.type in
('plan.item.added','plan.item.removed') group by p.type,
coalesce(cast(p.dimensions.dim_type as varchar(20)), cast(p.dimensions.type
as varchar(20)));





On Mon, Jul 27, 2015 at 4:59 AM, Stefán Baxter <ste...@activitystream.com>
wrote:

> Hi,
>
> It seems that null values can trigger a column to be treated as a numeric
> one, in expressions evaluation, regardless of content or other indicators
> and that fields in substructures can affect same-named-fields in parent
> structure.
> (1.2-SNAPSHOT, parquet files)
>
> I have JSON data that can be reduced to to this:
>
>    - {"occurred_at":"2015-07-26
>
>  
> 08:45:41.234","type":"plan.item.added","dimensions":{"type":null,"dim_type":"Unspecified","category":"Unspecified","sub_category":null}}
>    - {"occurred_at":"2015-07-26
>
>  
> 08:45:43.598","type":"plan.item.removed","dimensions":{"type":"Unspecified","dim_type":null,"category":"Unspecified","sub_category":null}}
>    - {"occurred_at":"2015-07-26
>    08:45:44.241","type":"plan.item.removed","dimensions":{"type":"To
>    See","category":"Nature","sub_category":"Waterfalls"}}
>
> * notice the discrepancy in the dimensions structure that the type field is
> either called type or dim_type (slightly relevant for the rest of this
> case)
>
>
> *1. Query where dimensions are not involved*
>
> select p.type, count(*) from
> dfs.tmp.`/analytics/processed/<some-tenant>/events` as p where occurred_at
> > '2015-07-26' and p.type in ('plan.item.added','plan.item.removed') group
> by p.type;
> +--------------------+---------+
> |        type        | EXPR$1  |
> +--------------------+---------+
> | plan.item.removed  | 947     |
> | plan.item.added    | 40342   |
> +--------------------+---------+
> 2 rows selected (0.508 seconds)
>
>
> *2. Same query but involves dimension.type as well*
>
> select p.type, coalesce(p.dimensions.dim_type, p.dimensions.type)
> dimensions_type, count(*) from
> dfs.tmp.`/analytics/processed/<some-tenant>/events` as p where occurred_at
> > '2015-07-26' and p.type in ('plan.item.added','plan.item.removed') group
> by p.type, coalesce(p.dimensions.dim_type, p.dimensions.type);
>
> Error: SYSTEM ERROR: NumberFormatException: To See
> Fragment 2:0
> [Error Id: 4756f549-cc47-43e5-899e-10a11efb60ea on localhost:31010]
> (state=,code=0)
>
>
> I can provide test data if this is not enough to reproduce this bug.
>
> Regards,
>  -Stefán
>

Reply via email to