[jira] [Commented] (DRILL-5490) RepeatedVarCharOutput.recordStart becomes corrupted on realloc
[ https://issues.apache.org/jira/browse/DRILL-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002056#comment-16002056 ] Paul Rogers commented on DRILL-5490: More bizarre. In actual use, {{recordStart}} is, in fact, 0 during the record. But, when extracting headers, the header is read, then {{finishRecord()}} is called: {code} public void finishRecord() { recordStart = characterData; {code} At this point, {{recordStart}} is set to "record end"... So, this line actually works: {code} if (this.recordStart != characterData) { throw new ExecutionSetupException("record text was requested before finishing record"); } {code} But, it works in the sense that {{recordStart}} and {{characterData}} both point to the *end* of the record. If it were not for the issue described in DRILL-5491, the logic would be that {{recordStart}} would be zero if we didn't finish the record, but we we always do, so the check is a complete no-op. > RepeatedVarCharOutput.recordStart becomes corrupted on realloc > -- > > Key: DRILL-5490 > URL: https://issues.apache.org/jira/browse/DRILL-5490 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Priority: Minor > > The class {{RepeatedVarCharOutput}}, used to read text data into the > {{columns}} array, has a member called {{recordStart}} which is supposed to > track the memory offset of the start of the record data in the {{columns}} > VarChar array. However, the logic for the member is very wrong. First, it is > initialized based on an uninitialized variable: > {{code}} > public void startBatch() { > this.recordStart = characterDataOriginal; > ... > loadVarCharDataAddress(); > } > private void loadVarCharDataAddress(){ > ... > this.characterDataOriginal = buf.memoryAddress(); > ... > } > {{code}} > Second, the class keeps track of actual memory addresses for the VarChar > data. When data would exceed the buffer, memory is reallocated using the > following method. But, this method *does not* adjust the {{recordStart}} > member, which now points to a bogus memory address: > {code} > private void expandVarCharData(){ > vector.getDataVector().reAlloc(); > long diff = characterData - characterDataOriginal; > loadVarCharDataAddress(); > characterData += diff; > } > {code} > Fortunately, like so many bugs in this class, these bugs are offset by the > fact that the variable is never used in a way that cause obvious problems > (else these issues would have been found sooner.) The only real use is: > {code} > @Override > public boolean rowHasData() { > return recordStart < characterData; > } > {code} > Which, it turns out, is used only at EOF in {{TextReader.parseRecord()}}: > {code} > } catch(StreamFinishedPseudoException e) { > // if we've written part of a field or all of a field, we should send > this row. > if (fieldsWritten == 0 && !output.rowHasData()) { > throw e; > } > } > {code} > Thus, the bug will cause a failure only if: > * A batch is resized, and > * The new address is lower than the original address, and > * It is the last batch in the file. > Because of the low probability of hitting the bug, the priority is set to > Minor. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5490) RepeatedVarCharOutput.recordStart becomes corrupted on realloc
[ https://issues.apache.org/jira/browse/DRILL-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002016#comment-16002016 ] Paul Rogers commented on DRILL-5490: The other place that {{recordStart}} is used is to get the first row of data for DRILL-951, to extract the header. Here again, we are fortunate that one bug masks another. The {{RepeatedVarCharOutput}} is never used with headers. Consider {{CompliantTextReader.setup()}}: {code} if (settings.isHeaderExtractionEnabled()){ //extract header and use that to setup a set of VarCharVectors String [] fieldNames = extractHeader(); output = new FieldVarCharOutput(outputMutator, fieldNames, getColumns(), isStarQuery()); } else { //simply use RepeatedVarCharVector output = new RepeatedVarCharOutput(outputMutator, getColumns(), isStarQuery()); } {code} > RepeatedVarCharOutput.recordStart becomes corrupted on realloc > -- > > Key: DRILL-5490 > URL: https://issues.apache.org/jira/browse/DRILL-5490 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Priority: Minor > > The class {{RepeatedVarCharOutput}}, used to read text data into the > {{columns}} array, has a member called {{recordStart}} which is supposed to > track the memory offset of the start of the record data in the {{columns}} > VarChar array. However, the logic for the member is very wrong. First, it is > initialized based on an uninitialized variable: > {{code}} > public void startBatch() { > this.recordStart = characterDataOriginal; > ... > loadVarCharDataAddress(); > } > private void loadVarCharDataAddress(){ > ... > this.characterDataOriginal = buf.memoryAddress(); > ... > } > {{code}} > Second, the class keeps track of actual memory addresses for the VarChar > data. When data would exceed the buffer, memory is reallocated using the > following method. But, this method *does not* adjust the {{recordStart}} > member, which now points to a bogus memory address: > {code} > private void expandVarCharData(){ > vector.getDataVector().reAlloc(); > long diff = characterData - characterDataOriginal; > loadVarCharDataAddress(); > characterData += diff; > } > {code} > Fortunately, like so many bugs in this class, these bugs are offset by the > fact that the variable is never used in a way that cause obvious problems > (else these issues would have been found sooner.) The only real use is: > {code} > @Override > public boolean rowHasData() { > return recordStart < characterData; > } > {code} > Which, it turns out, is used only at EOF in {{TextReader.parseRecord()}}: > {code} > } catch(StreamFinishedPseudoException e) { > // if we've written part of a field or all of a field, we should send > this row. > if (fieldsWritten == 0 && !output.rowHasData()) { > throw e; > } > } > {code} > Thus, the bug will cause a failure only if: > * A batch is resized, and > * The new address is lower than the original address, and > * It is the last batch in the file. > Because of the low probability of hitting the bug, the priority is set to > Minor. -- This message was sent by Atlassian JIRA (v6.3.15#6346)