Re: Question about Drill aggregate queries and schema change

Jinfeng Ni Mon, 24 Jul 2017 16:07:55 -0700

Based on my limited understanding of Drill's KuduRecordReader, the problem
seems to be in the next() method [1]. When RowResult's iterator return
false for hasNext(), in the case filter prune everything, the code will
skip the call of addRowResult(). That means no columns/data will be added
to scan's batch.  Nullable int will be injected in downstream operator.


1.
https://github.com/apache/drill/blob/master/contrib/storage-kudu/src/main/java/org/apache/drill/exec/store/kudu/KuduRecordReader.java#L149-L163


On Mon, Jul 24, 2017 at 1:35 PM, Cliff Resnick <cre...@gmail.com> wrote:

> Jinfeng,
>
> I'm wondering if there's a way to push schema info to Drill even if there
> is no result. KuduScanner always has schema, and RecordReader always has
> scanner. But I can't seem to find the disconnect. Any idea if this is
> possible even if it's Kudu-specific hack?
>
> -Cliff
>
> On Mon, Jul 24, 2017 at 2:46 PM, Cliff Resnick <cre...@gmail.com> wrote:
>
>> Jinfeng,
>>
>> Thanks, that confirms my thoughts as well. If I query using full range
>> bounds and all hash keys, then Kudu prunes to the exact tablets and there
>> is no error. I'll watch that jira expectantly because Kudu + Drill would be
>> an awseome combo. But without the pruning it's useless to us.
>>
>> -Cliff
>>
>> On Mon, Jul 24, 2017 at 2:17 PM, Jinfeng Ni <j...@apache.org> wrote:
>>
>>> If you see such errors only when you enable predicate pushdown, it might
>>> be
>>> related to a known issue; schema change failure caused by empty batch
>>> [1].
>>> This happened when predicate prunes everything, and kudu reader did not
>>> return a RowResult with a schema.  In such case, Drill would interprete
>>> the
>>> requested column (such as a) as nullable int, which would lead conflict
>>> to
>>> other minor-fragment which may have the data/schema.
>>>
>>> The reason why you hit such failure randomly : there is a race condition
>>> for such conflict to happen. If the minor-fragment with empty batch is
>>> executed after the one with data is executed, the empty batch would be
>>> ignored. If reverse order, it would cause conflict, hence query failure.
>>>
>>> 1. https://issues.apache.org/jira/browse/DRILL-5546
>>>
>>>
>>>
>>> On Mon, Jul 24, 2017 at 10:56 AM, Cliff Resnick <cre...@gmail.com>
>>> wrote:
>>>
>>> > I spent some time over the weekend altering Drill's storage-kudu to use
>>> > Kudu's predicate pushdown api. Everything worked great as long as I
>>> > performed flat filtered selects (eg. SELECT .. FROM .. WHERE ..") but
>>> > whenever I tested aggregate queries, they would succeed sometimes, then
>>> > fail other times -- using the exact same queries.
>>> >
>>> > The failures were always like below. After searching around, I came
>>> across
>>> > a number of jiras, like https://issues.apache.org/jira
>>> /browse/DRILL-2602
>>> > that imply Drill can't handle sorts/aggregate queries on "changing
>>> > schemas". This was confusing to me because I was testing with a single
>>> > table/single schema, which leaves me wondering if "changing schema"
>>> means
>>> > the unknown type of the aggregate itself? Meaning,  SELECT SUM(a),b
>>> FROM t
>>> > GROUP BY a; where field a is an INT64, Drill can't figure out how to
>>> deal
>>> > with SUM(a) because it may exceed the scale of INT64?
>>> >
>>> > If someone could clarify this for me I'd really appreciate it. I'm
>>> really
>>> > hoping my above understanding is not correct and it's just a problem
>>> with
>>> > the Vector handling in storage-kudu, because otherwise it seems that
>>> > Drill's aggregation capabilities are rather limited.
>>> >
>>> > Errors:
>>> >
>>> > java.lang.IllegalStateException: Failure while reading vector.
>>> Expected
>>> > vector class of org.apache.drill.exec.vector.NullableIntVector but was
>>> > holding vector class org.apache.drill.exec.vector.BigIntVector, field=
>>> > campaign_id(BIGINT:REQUIRED)
>>> > at org.apache.drill.exec.record.VectorContainer.getValueAccessorById(
>>> > VectorContainer.java:321)
>>> > at org.apache.drill.exec.record.RecordBatchLoader.getValueAcces
>>> sorById(
>>> > RecordBatchLoader.java:179)
>>> >
>>> > OR
>>> >
>>> > Error: UNSUPPORTED_OPERATION ERROR: Sort doesn't currently support
>>> sorts
>>> > with changing schemas.
>>> >
>>>
>>
>>
>

Re: Question about Drill aggregate queries and schema change

Reply via email to