[jira] [Commented] (DRILL-5490) RepeatedVarCharOutput.recordStart becomes corrupted on realloc

2017-05-08 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002056#comment-16002056
 ] 

Paul Rogers commented on DRILL-5490:


More bizarre. In actual use, {{recordStart}} is, in fact, 0 during the record. 
But, when extracting headers, the header is read, then {{finishRecord()}} is 
called:

{code}
  public void finishRecord() {
recordStart = characterData;
{code}

At this point, {{recordStart}} is set to "record end"...

So, this line actually works:

{code}
if (this.recordStart != characterData) {
  throw new ExecutionSetupException("record text was requested before 
finishing record");
}
{code}

But, it works in the sense that {{recordStart}} and {{characterData}} both 
point to the *end* of the record. If it were not for the issue described in 
DRILL-5491, the logic would be that {{recordStart}} would be zero if we didn't 
finish the record, but we we always do, so the check is a complete no-op.


> RepeatedVarCharOutput.recordStart becomes corrupted on realloc
> --
>
> Key: DRILL-5490
> URL: https://issues.apache.org/jira/browse/DRILL-5490
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Priority: Minor
>
> The class {{RepeatedVarCharOutput}}, used to read text data into the 
> {{columns}} array, has a member called {{recordStart}} which is supposed to 
> track the memory offset of the start of the record data in the {{columns}} 
> VarChar array. However, the logic for the member is very wrong. First, it is 
> initialized based on an uninitialized variable:
> {{code}}
>   public void startBatch() {
> this.recordStart = characterDataOriginal;
> ...
> loadVarCharDataAddress();
>   }
>   private void loadVarCharDataAddress(){
> ...
> this.characterDataOriginal = buf.memoryAddress();
> ...
>   }
> {{code}}
> Second, the class keeps track of actual memory addresses for the VarChar 
> data. When data would exceed the buffer, memory is reallocated using the 
> following method. But, this method *does not* adjust the {{recordStart}} 
> member, which now points to a bogus memory address:
> {code}
>   private void expandVarCharData(){
> vector.getDataVector().reAlloc();
> long diff = characterData - characterDataOriginal;
> loadVarCharDataAddress();
> characterData += diff;
>   }
> {code}
> Fortunately, like so many bugs in this class, these bugs are offset by the 
> fact that the variable is never used in a way that cause obvious problems 
> (else these issues would have been found sooner.) The only real use is:
> {code}
>   @Override
>   public boolean rowHasData() {
> return recordStart < characterData;
>   }
> {code}
> Which, it turns out, is used only at EOF in {{TextReader.parseRecord()}}:
> {code}
> } catch(StreamFinishedPseudoException e) {
>   // if we've written part of a field or all of a field, we should send 
> this row.
>   if (fieldsWritten == 0 && !output.rowHasData()) {
> throw e;
>   }
> }
> {code}
> Thus, the bug will cause a failure only if:
> * A batch is resized, and
> * The new address is lower than the original address, and
> * It is the last batch in the file.
> Because of the low probability of hitting the bug, the priority is set to 
> Minor.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5490) RepeatedVarCharOutput.recordStart becomes corrupted on realloc

2017-05-08 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002016#comment-16002016
 ] 

Paul Rogers commented on DRILL-5490:


The other place that {{recordStart}} is used is to get the first row of data 
for DRILL-951, to extract the header. Here again, we are fortunate that one bug 
masks another. The {{RepeatedVarCharOutput}} is never used with headers. 
Consider {{CompliantTextReader.setup()}}:

{code}
  if (settings.isHeaderExtractionEnabled()){
//extract header and use that to setup a set of VarCharVectors
String [] fieldNames = extractHeader();
output = new FieldVarCharOutput(outputMutator, fieldNames, 
getColumns(), isStarQuery());
  } else {
//simply use RepeatedVarCharVector
output = new RepeatedVarCharOutput(outputMutator, getColumns(), 
isStarQuery());
  }
{code}

> RepeatedVarCharOutput.recordStart becomes corrupted on realloc
> --
>
> Key: DRILL-5490
> URL: https://issues.apache.org/jira/browse/DRILL-5490
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Priority: Minor
>
> The class {{RepeatedVarCharOutput}}, used to read text data into the 
> {{columns}} array, has a member called {{recordStart}} which is supposed to 
> track the memory offset of the start of the record data in the {{columns}} 
> VarChar array. However, the logic for the member is very wrong. First, it is 
> initialized based on an uninitialized variable:
> {{code}}
>   public void startBatch() {
> this.recordStart = characterDataOriginal;
> ...
> loadVarCharDataAddress();
>   }
>   private void loadVarCharDataAddress(){
> ...
> this.characterDataOriginal = buf.memoryAddress();
> ...
>   }
> {{code}}
> Second, the class keeps track of actual memory addresses for the VarChar 
> data. When data would exceed the buffer, memory is reallocated using the 
> following method. But, this method *does not* adjust the {{recordStart}} 
> member, which now points to a bogus memory address:
> {code}
>   private void expandVarCharData(){
> vector.getDataVector().reAlloc();
> long diff = characterData - characterDataOriginal;
> loadVarCharDataAddress();
> characterData += diff;
>   }
> {code}
> Fortunately, like so many bugs in this class, these bugs are offset by the 
> fact that the variable is never used in a way that cause obvious problems 
> (else these issues would have been found sooner.) The only real use is:
> {code}
>   @Override
>   public boolean rowHasData() {
> return recordStart < characterData;
>   }
> {code}
> Which, it turns out, is used only at EOF in {{TextReader.parseRecord()}}:
> {code}
> } catch(StreamFinishedPseudoException e) {
>   // if we've written part of a field or all of a field, we should send 
> this row.
>   if (fieldsWritten == 0 && !output.rowHasData()) {
> throw e;
>   }
> }
> {code}
> Thus, the bug will cause a failure only if:
> * A batch is resized, and
> * The new address is lower than the original address, and
> * It is the last batch in the file.
> Because of the low probability of hitting the bug, the priority is set to 
> Minor.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)