Github user jinfengni commented on a diff in the pull request:
https://github.com/apache/drill/pull/906#discussion_r135382803
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java
---
@@ -152,97 +157,75 @@ public void kill(boolean sendUpstream) {
}
}
- private void releaseAssets() {
- container.zeroVectors();
- }
-
- private void clearFieldVectorMap() {
- for (final ValueVector v : mutator.fieldVectorMap().values()) {
- v.clear();
- }
- }
-
@Override
public IterOutcome next() {
if (done) {
return IterOutcome.NONE;
}
oContext.getStats().startProcessing();
try {
- try {
- injector.injectChecked(context.getExecutionControls(),
"next-allocate", OutOfMemoryException.class);
-
- currentReader.allocate(mutator.fieldVectorMap());
- } catch (OutOfMemoryException e) {
- clearFieldVectorMap();
- throw UserException.memoryError(e).build(logger);
- }
- while ((recordCount = currentReader.next()) == 0) {
+ while (true) {
try {
- if (!readers.hasNext()) {
- // We're on the last reader, and it has no (more) rows.
- currentReader.close();
- releaseAssets();
- done = true; // have any future call to next() return NONE
-
- if (mutator.isNewSchema()) {
- // This last reader has a new schema (e.g., we have a
zero-row
- // file or other source). (Note that some sources have a
non-
- // null/non-trivial schema even when there are no rows.)
+ injector.injectChecked(context.getExecutionControls(),
"next-allocate", OutOfMemoryException.class);
--- End diff --
This patch tries to decouple the logic of record reader and scanbatch:
- Record reader is responsible to add vectors to batch (via Mutator), and
populate data
- ScanBatch is responsible to interpret the output of record reader, by
checking rowCount && Mutator.isNewSchema() to decide whether return
OK_NEW_SCHEMA, OK, or NONE.
> What happens on the first reader? There is no schema, so any schema is a
new schema. Suppose the file is JSON and the schema is built on the fly. Does
the code handle the case that we have no schema (first reader), and that reader
adds no columns?
It's not true "any schema is a new schema". If the first reader has no
schema and adds no columns, then Mutator.isNewSchema() should return false.
Mutator.isNewSchema() returns true only after the last call, one or more happens
- a new top level field is added,
- a field in a nested field is added,
- an existing field type is changed
You may argue a more appropriate way to represent an empty JSON file is an
empty schema. However, such idea would lead to various schema conflicts in
down-stream operator, if other scan thread has non-empty JSON files. This is
exactly what happened before this patch.
The proposal in this patch is to **ignore** empty JSON, since 1)rowCount=0,
2)no new column were added to batch.
- If all the record readers for a scan thread return with rowCount = 0,
and produce no new schema, then this Scan thread should return 'NONE' directly,
without returning OK_NEW_SCHEMA.
- If at least one of reader return either with >0 row, or new schema, then
Scan thread will return batch with new schema.
- If all scan threads returns 'NONE' directly, implying the entire table
does not have data/schema, this is what Project.handleFastNone() will deal with.
>But, if the input is CSV, then we always have a schema. If the file has
column headers, then we know that the schema is, say, (a, b, c) because those
are the headers. Or, if the file has no headers, the schema is always the
columns array. So, should we send that schema downstream? If so, should it
include the implicit columns?
If CSV always adds columns (either _a,b,c, or columns_), then ScanBatch
will produce a batch with (a, b, c), or columns. It does not make sense to
ignore those schema.
- In the case of file with header, file with _a,b,c_ will lead to a
batch with (a,b,c) while a file with _a,b,c,d_ will lead to a batch with
(a,b,c,d). Those two files will cause a schema change, which is expected
behavior.
- In the case of file without header, all files will produce a batch with
columns, which means there would be no schema change across different files,
regardless whether they have row=0, or row > 0.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---