[jira] [Commented] (DRILL-5546) Schema change problems caused by empty batch

ASF GitHub Bot (JIRA) Mon, 21 Aug 2017 12:54:27 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16135710#comment-16135710
 ]


ASF GitHub Bot commented on DRILL-5546:
---------------------------------------

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/906#discussion_r134305681
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java 
---
    @@ -152,97 +157,75 @@ public void kill(boolean sendUpstream) {
         }
       }
     
    -  private void releaseAssets() {
    -    container.zeroVectors();
    -  }
    -
    -  private void clearFieldVectorMap() {
    -    for (final ValueVector v : mutator.fieldVectorMap().values()) {
    -      v.clear();
    -    }
    -  }
    -
       @Override
       public IterOutcome next() {
         if (done) {
           return IterOutcome.NONE;
         }
         oContext.getStats().startProcessing();
         try {
    -      try {
    -        injector.injectChecked(context.getExecutionControls(), 
"next-allocate", OutOfMemoryException.class);
    -
    -        currentReader.allocate(mutator.fieldVectorMap());
    -      } catch (OutOfMemoryException e) {
    -        clearFieldVectorMap();
    -        throw UserException.memoryError(e).build(logger);
    -      }
    -      while ((recordCount = currentReader.next()) == 0) {
    +      while (true) {
             try {
    -          if (!readers.hasNext()) {
    -            // We're on the last reader, and it has no (more) rows.
    -            currentReader.close();
    -            releaseAssets();
    -            done = true;  // have any future call to next() return NONE
    -
    -            if (mutator.isNewSchema()) {
    -              // This last reader has a new schema (e.g., we have a 
zero-row
    -              // file or other source).  (Note that some sources have a 
non-
    -              // null/non-trivial schema even when there are no rows.)
    +          injector.injectChecked(context.getExecutionControls(), 
"next-allocate", OutOfMemoryException.class);
    --- End diff --
    
    From the PR description:
    
    > 1. Skip RecordReader if it returns 0 row && present same schema. A new 
schema (by calling Mutator.isNewSchema() ) means either a new top level field 
is added, or a field in a nested field is added, or an existing field type is 
changed.
    
    The code, however, adds an additional condition: if implicit fields change. 
(But, as noted below, that should never occur in practice.)
    
    What happens on the first reader? There is no schema, so *any* schema is a 
new schema. Suppose the file is JSON and the schema is built on the fly. Does 
the code handle the case that we have no schema (first reader), and that reader 
adds no columns?
    
    Or, according to the logic that the downstream wants to know the schema, 
even if there are no records, do we send an empty schema (schema with no 
columns) downstream, because that is an accurate representation of an empty 
JSON file?
    
    What happens in the case of an empty JSON file following a non-empty file? 
In this case, do we consider the empty schema as a schema change relative to a 
non-empty change?
    
    In short, can we generalize this first rule a bit?
    
    > 2. Implicit columns are added and populated only when the input is not 
empty, i.e. the batch contains > 0 row or rowCount == 0 && new schema.
    
    How does this interact with a scan batch that has only one file, and that 
file is empty? Would we return the empty schema downstream? With the implicit 
columns?
    
    > 3. ScanBatch will return NONE directly (called as "fast NONE"), if all 
its RecordReaders haver empty input and thus are skipped, in stead of returing 
OK_NEW_SCHEMA first.
    
    
    This is just a bit ambiguous. If the reader is JSON, then an empty file has 
an empty schema (for reasons cited above.)
    
    But, if the input is CSV, then we *always* have a schema. If the file has 
column headers, then we know that the schema is, say, (a, b, c) because those 
are the headers. Or, if the file has no headers, the schema is always the 
`columns` array. So, should we send that schema downstream? If so, should it 
include the implicit columns?
    
    This, in fact, raises another issue (out of scope for this PR): if we 
return an empty batch with non-empty schema, we have no place to attach the 
implicit columns that will allow the user to figure out that, say, "foo.csv" is 
empty.
    
    On the other hand, if we say that an empty CSV file has no schema, then we 
can skip that file. The same might be true of JSON. What about Parquet? We'd 
have a schema even if there are no rows. Same with JDBC. Should we return this 
schema, even if the data is empty?
    
    Finally, do we need special handling for "null" files: a file that no 
longer exists on disk and so has a completely undefined schema? An undefined 
schema is different than an empty schema. Undefined = "we just don't know what 
the schema might be." Empty = "we know there is a schema and that schema is 
empty." A missing CSV file has an undefined schema. We could argue that a 
0-length (or, a one-byte file with the only content being a newline) is a valid 
file, but it is a file with no columns, hence an empty schema.
    
    Or, should we have a rule that says something like:
    
    1. If a file is missing, or has an undefined schema, skip it.
    2. If a file has a schema (empty or not) and no rows, then:
      * If this is the first file, remember the schema and skip to the next.
      * If this file follows a non-empty file, trigger a schema change.
    4. If a file has a rows, then:
      * If the previous schema was empty, ignore that empty schema.
      * If the previous schema was non-empty, and different, return 
OK_NEW_SCHEMA
      * If the previous schema was the same, return OK.
    5. At EOF:
      * If the previous schema was empty, and the previous file had no rows, 
return that schema.
      * If the previous schema was empty, or no non-empty files, return NONE.
    
    We can reduce these to a simpler set of rules:
    
    * We ignore all empty or missing schemas.
    * We ignore empty files unless they are the only file(s) or they have a 
schema different than the next non-empty file.
    
    Other rules are possible. One is simply to ignore all 0-length files 
regardless of whether the schema is the same or different than surrounding 
files. If all files return 0 records, just return NONE. (These are the rules 
adopted, as of now, by the new scan operator; though we will change them to 
match the rules here one we pin them down.)


> Schema change problems caused by empty batch
> --------------------------------------------
>
>                 Key: DRILL-5546
>                 URL: https://issues.apache.org/jira/browse/DRILL-5546
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Jinfeng Ni
>            Assignee: Jinfeng Ni
>
> There have been a few JIRAs opened related to schema change failure caused by 
> empty batch. This JIRA is opened as an umbrella for all those related JIRAS ( 
> such as DRILL-4686, DRILL-4734, DRILL4476, DRILL-4255, etc).
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (DRILL-5546) Schema change problems caused by empty batch

Reply via email to