Github user dsbos commented on a diff in the pull request:
https://github.com/apache/drill/pull/228#discussion_r43796603
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java
---
@@ -325,10 +325,13 @@ public AggOutcome doWork() {
if (EXTRA_DEBUG_1) {
logger.debug("Received new schema. Batch has {}
records.", incoming.getRecordCount());
}
- newSchema = true;
- this.cleanup();
- // TODO: new schema case needs to be handled appropriately
- return AggOutcome.UPDATE_AGGREGATOR;
+ final BatchSchema newIncomingSchema = incoming.getSchema();
+ if ((! newIncomingSchema.equals(schema)) && schema !=
null) {
--- End diff --
> You shouldn't put a hack in to ignore a valid schema change.
Yeah, I'm reverting this HashAggTemplate change.
> I don't know all the details but if this is related to HBase, my thought
is you probably need
> to resolve 4010 to avoid spurious schema changes.
It's not clear whether it's related to HBase or not. (The relevant test
failure did occur in an HBase test, but it was in a Jenkins-run cluster test
that has been hard to reproduce elsewhere. At one point (I thought) I had a
reproduction using just JSON files, but now that doesn't seem to work.)
> You should make sure to avoid propagating schema changes that are not
real/required.
Related to that, ExternalSortBatch seems to return two OK_NEW_SCHEMA
batches with schemas differing only in their selection vectors:
11:23:50.149 [main] TRACE o.a.d.e.p.i.v.IteratorValidatorBatchIterator -
[#3; on ExternalSortBatch]: incoming next() return: #records = 204,
schema:
BatchSchema [fields=[`key1`(INT:REQUIRED), `alt`(BIGINT:REQUIRED)],
selectionVector=FOUR_BYTE],
prev. new (not equal):
BatchSchema [fields=[`key1`(INT:REQUIRED), `alt`(BIGINT:REQUIRED)],
selectionVector=NONE]
(The earlier schema ("prev. new:") seems to come from setting up the sort,
and the latter one ("schema:") seems to come from completing the sort (it was
triggered by a NONE from upstream).)
Is that correct? (Should the selection vectors have been the same? Should
the second return have been OK instead of OK_NEW_SCHEMA?)
If that is correct (e.g., vectors were changed and should have been), how
should downstream operators, or at least aggregation, handle that?
Should they be able to handle a "technical" schema change (OK_NEW_SCHEMA
with possibly-changed vectors or certain details of the schema) that is not a
logical schema change (not a change in the logical fields or types in the
aggregated value), even if arbitrary logical schema changes can't be handled
(e.g., because some don't even make sense)?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---