[
https://issues.apache.org/jira/browse/DRILL-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16187208#comment-16187208
]
ASF GitHub Bot commented on DRILL-5830:
---------------------------------------
GitHub user paul-rogers opened a pull request:
https://github.com/apache/drill/pull/968
DRILL-5830: Resolve regressions to MapR DB from DRILL-5546
DRILL-5546 fixed a wide variety of "empty batch" problems. But, it
introduced a regression in the HBase and MapR-DB binary storage plugins. This
PR refines the fix to resolve those regressions.
Prior to DRILL-5546, HBase provided a project push-down rule to expand
wildcard columns. However, a bug in the push-down rule prevented proper
function. DRILL-5546 fixed that rule. But, DRILL-5546 also explicitly expanded
wildcards for HBase, which turned out to be redundant, and so is backed out in
this PR.
Wildcard expansion in the HBase storage plugin was meant to overcome the
schema change conflict that occurs with empty regions. In such regions, we get
the row key and the column family as an empty map. In regions with data, we get
the row key and a non-empty map for the column family. Examples:
* Empty: (row_key, cf{})
* Non-empty: (row_key, cf{col1, col2})
Where cf is a column family and col1, col2 are columns.
It turns out that the receivers were getting confused. The
`RecordBatchLoader` class treated empty and non-empty maps as an identical
schema. This was mentioned in DRILL-5546:
> In HBase a column family always has map type, and a non-rowkey column
always has nullable varbinary type, this ensures that HBaseRecordReader across
different HBase regions will have the same top level schema, even if the region
is empty or prune all the rows due to filter pushdown optimization. In other
words, we will not see different top level schema from different
HBaseRecordReader for the same table.
The problem is, a difference in map content really is a schema change, so
we need to detect and report it. This PR makes that change.
Now, as it turns out, changes made by DRILL-5546 to the top-level project
operator gracefully removes the empty batches (with empty maps), passing along
just the non-empty batches (with non-empty maps.)
In short, this PR:
* Backs out the HBase-specific changes to DRILL-5546,
* Fixes the schema change issue in `RecordBatchLoader`,
* Adds unit tests for the fixes to `RecordBatchLoader`,
* Does a number of minor code cleanups.
The result is that the HBase problems that DRILL-5546 solved are still
solved, but the regressions to MapR DB binary are also fixed.
We propose to modify the MapR DB binary storage plugin to do projection
push-down the same way as is done in HBase, but that will be a separate PR.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/paul-rogers/drill DRILL-5830
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/drill/pull/968.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #968
----
commit 6cb5e98c6cccfc519cd7413248685ddc42af96b4
Author: Paul Rogers <[email protected]>
Date: 2017-09-28T16:49:38Z
Back out HBase changes
commit 333bd1b36e72950d926c899b9050ddfae09fc817
Author: Paul Rogers <[email protected]>
Date: 2017-09-30T01:14:55Z
Code cleanup
commit 8baca87708af2a38c5748a1a0435312e34e90903
Author: Paul Rogers <[email protected]>
Date: 2017-09-30T01:18:16Z
Test utilities
commit 2ce7bf76dc37393b0326e23db99706f2abea7f5c
Author: Paul Rogers <[email protected]>
Date: 2017-09-30T01:18:48Z
Fix for DRILL-5829
commit f660731df0168304456976e22687459b41d35546
Author: Paul Rogers <[email protected]>
Date: 2017-09-30T20:58:22Z
Code cleanup
----
> Resolve regressions to MapR DB from DRILL-5546
> ----------------------------------------------
>
> Key: DRILL-5830
> URL: https://issues.apache.org/jira/browse/DRILL-5830
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.12.0
> Reporter: Paul Rogers
> Assignee: Paul Rogers
> Fix For: 1.12.0
>
>
> DRILL-5546 added a number of fixes for empty batches. One part of the fix was
> for HBase. Key changes:
> * Add code to expand wildcards in the planner. (i.e. SELECT *)
> * Remove support for wildcards in the HBase record reader.
> As noted in DRILL-5775, this change had the effect of breaking support for
> MapR-DB binary (which is API compatible with HBase.) DRILL-5775 does this by
> expanding wildcards in the planner for MapR DB as was done for HBase in
> DRILL-5546.
> Unfortunately, this change introduced other regressions into the code as
> described by DRILL-5706.
> Investigation of those issues revealed that we should back out the original
> DRILL-5546 changes and go down a different route.
> As it turns out, HBase already had a project push-down rule that expanded
> wildcards. However, that rule didn't work correctly some of the time.
> DRILL-5546 fixed that bug, ensuring that wildcards are expanded (at least in
> the cases tested for this ticket.)
> The actual issue turned out to be a bug in the {{RecordBatchLoader}} class
> which did not consider map contents when detecting schema change. As a
> result, results like (row_key, cf\{}) were treated the same as (row_key,
> cf\{mycol}) and the actual data colums were discarded, but randomly depending
> on batch arrival order.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)