[jira] [Commented] (DRILL-4609) Select true,true,true from ... does not always output true,true,true
[ https://issues.apache.org/jira/browse/DRILL-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002122#comment-16002122 ] Kunal Khatua commented on DRILL-4609: - [~karthikm] Do you think there is a optimization bug in JDK 1.8 for this? > Select true,true,true from ... does not always output true,true,true > > > Key: DRILL-4609 > URL: https://issues.apache.org/jira/browse/DRILL-4609 > Project: Apache Drill > Issue Type: Bug > Components: Client - CLI, Query Planning & Optimization, Storage - > Writer >Affects Versions: 1.5.0, 1.6.0 > Environment: Linux Redhat > tested in cluster (hdfs) and embedded mode >Reporter: F Méthot > > Doing a simple "select true, true, true from table" won't output > true,true,true on all generated rows. > Step to reproduce. > generate a simple CSV files: > {code:sql} > for i in {1..100}; do echo "Allo"; done > /users/fmethot/test.csv > {code} > Open a new fresh drill CLI. > Just to help for validation, switch output to CSV: > {code:sql} > alter session set `store.format`='csv' > {code} > generate a table like this: > {code:sql} >create table TEST_OUT as (select true,true,true,true from > dfs.`/users/fmethot/test.csv') > {code} > Check content of /users/fmethot/test.csv > You will find false values in there! > If you generate another table, on the same session, the same way, chances are > the value will be fine (all true). We can only reproduce this on the first > CTAS run. > We came to test this select pattern after we realize our custom boolean UDF > (as well as the one provided in Drill like "ilike") were not outputting > consistent deterministic results (same input were implausibly generating > random boolean output). We hope that fixing this ticket will also fix our > issue with boolean UDFs. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5300) SYSTEM ERROR: IllegalStateException: Memory was leaked by query while querying parquet files
[ https://issues.apache.org/jira/browse/DRILL-5300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002117#comment-16002117 ] Kunal Khatua commented on DRILL-5300: - [~mgelbana] Did Drill 1.10 resolve this? > SYSTEM ERROR: IllegalStateException: Memory was leaked by query while > querying parquet files > > > Key: DRILL-5300 > URL: https://issues.apache.org/jira/browse/DRILL-5300 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.9.0 > Environment: OS: Linux >Reporter: Muhammad Gelbana > Attachments: both_queries_logs.zip > > > Running the following query against parquet files (I modified some values for > privacy reasons) > {code:title=Query causing the long logs|borderStyle=solid} > SELECT AL4.NAME, AL5.SEGMENT2, SUM(AL1.AMOUNT), AL2.ATTRIBUTE4, > AL2.XXX__CODE, AL8.D_BU, AL8.F_PL, AL18.COUNTRY, AL13.COUNTRY, > AL11.NAME FROM > dfs.`/disk2/XXX/XXX//data/../parquet/XXX_XX/RA__TRX_LINE_GL_DIST_ALL` > AL1, > dfs.`/disk2/XXX/XXX//data/../parquet/XXX_XX/RA_OMER_TRX_ALL` > AL2, > dfs.`/disk2/XXX/XXX//data/../parquet/XXX_FIN_COMMON/GL_XXX` > AL3, > dfs.`/disk2/XXX/XXX//data/../parquet/XXX_HR_COMMON/HR_ALL_ORGANIZATION_UNITS` > AL4, > dfs.`/disk2/XXX/XXX//data/../parquet/XXX_FIN_COMMON/GL_CODE_COMBINATIONS` > AL5, > dfs.`/disk2/XXX/XXX//data/../parquet//XXAT_AR_MU_TAB` > AL8, > dfs.`/disk2/XXX/XXX//data/../parquet/XXX_FIN_COMMON/GL_XXX` > AL11, > dfs.`/disk2/XXX/XXX//data/../parquet/XXX_X_COMMON/XX_X_S` > AL12, > dfs.`/disk2/XXX/XXX//data/../parquet/XXX_X_COMMON/XX_LOCATIONS` > AL13, > dfs.`/disk2/XXX/XXX//data/../parquet/XXX_X_COMMON/XX___S_ALL` > AL14, > dfs.`/disk2/XXX/XXX//data/../parquet/XXX_X_COMMON/XX___USES_ALL` > AL15, > dfs.`/disk2/XXX/XXX//data/../parquet/XXX_X_COMMON/XX___S_ALL` > AL16, > dfs.`/disk2/XXX/XXX//data/../parquet/XXX_X_COMMON/XX___USES_ALL` > AL17, > dfs.`/disk2/XXX/XXX//data/../parquet/XXX_X_COMMON/XX_LOCATIONS` > AL18, > dfs.`/disk2/XXX/XXX//data/../parquet/XXX_X_COMMON/XX_X_S` > AL19 WHERE (AL2.SHIP_TO__USE_ID = AL15._USE_ID AND > AL15.___ID = AL14.___ID AND AL14.X__ID = > AL12.X__ID AND AL12.LOCATION_ID = AL13.LOCATION_ID AND > AL17.___ID = AL16.___ID AND AL16.X__ID = > AL19.X__ID AND AL19.LOCATION_ID = AL18.LOCATION_ID AND > AL2.BILL_TO__USE_ID = AL17._USE_ID AND AL2.SET_OF_X_ID = > AL3.SET_OF_X_ID AND AL1.CODE_COMBINATION_ID = AL5.CODE_COMBINATION_ID AND > AL5.SEGMENT4 = AL8.MU AND AL1.SET_OF_X_ID = AL11.SET_OF_X_ID AND > AL2.ORG_ID = AL4.ORGANIZATION_ID AND AL2.OMER_TRX_ID = > AL1.OMER_TRX_ID) AND ((AL5.SEGMENT2 = '41' AND AL1.AMOUNT <> 0 AND > AL4.NAME IN ('XXX-XX-', 'XXX-XX-', 'XXX-XX-', 'XXX-XX-', > 'XXX-XX-', 'XXX-XX-', 'XXX-XX-', 'XXX-XX-', 'XXX-XX-') > AND AL3.NAME like '%-PR-%')) GROUP BY AL4.NAME, AL5.SEGMENT2, AL2.ATTRIBUTE4, > AL2.XXX__CODE, AL8.D_BU, AL8.F_PL, AL18.COUNTRY, AL13.COUNTRY, > AL11.NAME > {code} > {code:title=Query causing the short logs|borderStyle=solid} > SELECT AL11.NAME > FROM > dfs.`/XXX/XXX/XXX/data/../parquet/XXX_XXX_COMMON/GL_XXX` > LIMIT 10 > {code} > This issue may be a duplicate for [this > one|https://issues.apache.org/jira/browse/DRILL-4398] but I created a new one > based on [this > suggestion|https://issues.apache.org/jira/browse/DRILL-4398?focusedCommentId=15884846&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15884846]. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5492) CSV reader does not validate header names, causes nonsense output
[ https://issues.apache.org/jira/browse/DRILL-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002080#comment-16002080 ] Paul Rogers commented on DRILL-5492: Could not readily find a way to cause serious harm with this bug, so marked it as minor. > CSV reader does not validate header names, causes nonsense output > - > > Key: DRILL-5492 > URL: https://issues.apache.org/jira/browse/DRILL-5492 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Priority: Minor > > Consider the same test case as in DRILL-5491, but with a slightly different > input file: > {code} > ___ > a,b,c > d,e,f > {code} > The underscores represent three spaces: use spaces in the real test. > In this case, the code discussed in DRILL-5491 finds some characters and > happily returns the following array: > {code} > [" "] > {code} > The field name of three blanks is returned to the client to produce the > following bizarre output: > {code} > 2 row(s): > > a > d > {code} > The blank line is normally the header, but the header here was considered to > be three blanks. (In fact, the blanks are actually printed.) > Since the blanks were considered to be a field, the file is assumed to have > only one field, so only the first column was returned. > The expected behavior is that spaces are trimmed from field names, so the > field name list would be empty and a User Error thrown. (That is, it is > confusing to the user why a blank line produces NPE, some produce the > {{ExecutionSetupException}} shown in DRILL-5491, and some produce blank > headings. Behavior should be consistent. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (DRILL-5492) CSV reader does not validate header names, causes nonsense output
[ https://issues.apache.org/jira/browse/DRILL-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers updated DRILL-5492: --- Summary: CSV reader does not validate header names, causes nonsense output (was: CSV with spaces for header uses spaces as field name) > CSV reader does not validate header names, causes nonsense output > - > > Key: DRILL-5492 > URL: https://issues.apache.org/jira/browse/DRILL-5492 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Priority: Minor > > Consider the same test case as in DRILL-5491, but with a slightly different > input file: > {code} > ___ > a,b,c > d,e,f > {code} > The underscores represent three spaces: use spaces in the real test. > In this case, the code discussed in DRILL-5491 finds some characters and > happily returns the following array: > {code} > [" "] > {code} > The field name of three blanks is returned to the client to produce the > following bizarre output: > {code} > 2 row(s): > > a > d > {code} > The blank line is normally the header, but the header here was considered to > be three blanks. (In fact, the blanks are actually printed.) > Since the blanks were considered to be a field, the file is assumed to have > only one field, so only the first column was returned. > The expected behavior is that spaces are trimmed from field names, so the > field name list would be empty and a User Error thrown. (That is, it is > confusing to the user why a blank line produces NPE, some produce the > {{ExecutionSetupException}} shown in DRILL-5491, and some produce blank > headings. Behavior should be consistent. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5492) CSV with spaces for header uses spaces as field name
[ https://issues.apache.org/jira/browse/DRILL-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002078#comment-16002078 ] Paul Rogers commented on DRILL-5492: Even more fun. Input file: {code} EXPR$0,EXPR$1,EXPR$2 a,b,c d,e,f {code} Output: {code} 2 row(s): EXPR$0,EXPR$1,EXPR$2 a,b,c d,e,f {code} Then, use this SQL: {code} SELECT `EXPR$1` as A, 1 + 2, 3 + 4 FROM `dfs.data`.`csv/test11.csv` {code} To get this output: {code} A,EXPR$1,EXPR$2 b,3,7 e,3,7 {code} And this: {code} SELECT `EXPR$1`, 1 + 2, 3 + 4 FROM `dfs.data`.`csv/test11.csv` {code} Produces this: {code} 2 row(s): EXPR$1,EXPR$10,EXPR$11 b,3,7 e,3,7 {code} At least Drill was smart enough to remap the first expression to EXPR$10 to avoid the name conflict... Still, would be better to disallow Drill internal names as column headers. > CSV with spaces for header uses spaces as field name > > > Key: DRILL-5492 > URL: https://issues.apache.org/jira/browse/DRILL-5492 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Priority: Minor > > Consider the same test case as in DRILL-5491, but with a slightly different > input file: > {code} > ___ > a,b,c > d,e,f > {code} > The underscores represent three spaces: use spaces in the real test. > In this case, the code discussed in DRILL-5491 finds some characters and > happily returns the following array: > {code} > [" "] > {code} > The field name of three blanks is returned to the client to produce the > following bizarre output: > {code} > 2 row(s): > > a > d > {code} > The blank line is normally the header, but the header here was considered to > be three blanks. (In fact, the blanks are actually printed.) > Since the blanks were considered to be a field, the file is assumed to have > only one field, so only the first column was returned. > The expected behavior is that spaces are trimmed from field names, so the > field name list would be empty and a User Error thrown. (That is, it is > confusing to the user why a blank line produces NPE, some produce the > {{ExecutionSetupException}} shown in DRILL-5491, and some produce blank > headings. Behavior should be consistent. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5492) CSV with spaces for header uses spaces as field name
[ https://issues.apache.org/jira/browse/DRILL-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002064#comment-16002064 ] Paul Rogers commented on DRILL-5492: The fun never ends. Now try this file: {code} *@! , )( , -_- a,b,c d,e,f {code} There are spaces before and after each field. The code returns the following as column headers: {code} [" *@! "," )( "," -_- " ] {code} With the following as the output: {code} 2 row(s): *@! , )( , -_- a,b,c d,e,f {code} Note that the non-symbol characters are allowed as names, and that the whitespace in the "names" is preserved. Expected whitespace to be remove to allow headings of the form: {code} first, second, third {code} Expected Drill to reject field names that are not valid Drill symbols, or convert the names to some benign form. > CSV with spaces for header uses spaces as field name > > > Key: DRILL-5492 > URL: https://issues.apache.org/jira/browse/DRILL-5492 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Priority: Minor > > Consider the same test case as in DRILL-5491, but with a slightly different > input file: > {code} > ___ > a,b,c > d,e,f > {code} > The underscores represent three spaces: use spaces in the real test. > In this case, the code discussed in DRILL-5491 finds some characters and > happily returns the following array: > {code} > [" "] > {code} > The field name of three blanks is returned to the client to produce the > following bizarre output: > {code} > 2 row(s): > > a > d > {code} > The blank line is normally the header, but the header here was considered to > be three blanks. (In fact, the blanks are actually printed.) > Since the blanks were considered to be a field, the file is assumed to have > only one field, so only the first column was returned. > The expected behavior is that spaces are trimmed from field names, so the > field name list would be empty and a User Error thrown. (That is, it is > confusing to the user why a blank line produces NPE, some produce the > {{ExecutionSetupException}} shown in DRILL-5491, and some produce blank > headings. Behavior should be consistent. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5492) CSV with spaces for header uses spaces as field name
[ https://issues.apache.org/jira/browse/DRILL-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002062#comment-16002062 ] Paul Rogers commented on DRILL-5492: A variation on this bug. Consider this input file: {code} ,__,__ a,b,c d,e,f {code} Where, again, the underscores represent spaces. The header parser returns: {code} [" "," "," "] {code} That is, three fields, each some number of spaces. And, query output is: {code} , a,c d,f {code} Strangely, we pass three field headings into the {{FieldVarCharOutput}} constructor (all strings with blanks), but we only get the first and third fields in the output. Expected the query to just fail due to blank headers. > CSV with spaces for header uses spaces as field name > > > Key: DRILL-5492 > URL: https://issues.apache.org/jira/browse/DRILL-5492 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Priority: Minor > > Consider the same test case as in DRILL-5491, but with a slightly different > input file: > {code} > ___ > a,b,c > d,e,f > {code} > The underscores represent three spaces: use spaces in the real test. > In this case, the code discussed in DRILL-5491 finds some characters and > happily returns the following array: > {code} > [" "] > {code} > The field name of three blanks is returned to the client to produce the > following bizarre output: > {code} > 2 row(s): > > a > d > {code} > The blank line is normally the header, but the header here was considered to > be three blanks. (In fact, the blanks are actually printed.) > Since the blanks were considered to be a field, the file is assumed to have > only one field, so only the first column was returned. > The expected behavior is that spaces are trimmed from field names, so the > field name list would be empty and a User Error thrown. (That is, it is > confusing to the user why a blank line produces NPE, some produce the > {{ExecutionSetupException}} shown in DRILL-5491, and some produce blank > headings. Behavior should be consistent. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5490) RepeatedVarCharOutput.recordStart becomes corrupted on realloc
[ https://issues.apache.org/jira/browse/DRILL-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002056#comment-16002056 ] Paul Rogers commented on DRILL-5490: More bizarre. In actual use, {{recordStart}} is, in fact, 0 during the record. But, when extracting headers, the header is read, then {{finishRecord()}} is called: {code} public void finishRecord() { recordStart = characterData; {code} At this point, {{recordStart}} is set to "record end"... So, this line actually works: {code} if (this.recordStart != characterData) { throw new ExecutionSetupException("record text was requested before finishing record"); } {code} But, it works in the sense that {{recordStart}} and {{characterData}} both point to the *end* of the record. If it were not for the issue described in DRILL-5491, the logic would be that {{recordStart}} would be zero if we didn't finish the record, but we we always do, so the check is a complete no-op. > RepeatedVarCharOutput.recordStart becomes corrupted on realloc > -- > > Key: DRILL-5490 > URL: https://issues.apache.org/jira/browse/DRILL-5490 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Priority: Minor > > The class {{RepeatedVarCharOutput}}, used to read text data into the > {{columns}} array, has a member called {{recordStart}} which is supposed to > track the memory offset of the start of the record data in the {{columns}} > VarChar array. However, the logic for the member is very wrong. First, it is > initialized based on an uninitialized variable: > {{code}} > public void startBatch() { > this.recordStart = characterDataOriginal; > ... > loadVarCharDataAddress(); > } > private void loadVarCharDataAddress(){ > ... > this.characterDataOriginal = buf.memoryAddress(); > ... > } > {{code}} > Second, the class keeps track of actual memory addresses for the VarChar > data. When data would exceed the buffer, memory is reallocated using the > following method. But, this method *does not* adjust the {{recordStart}} > member, which now points to a bogus memory address: > {code} > private void expandVarCharData(){ > vector.getDataVector().reAlloc(); > long diff = characterData - characterDataOriginal; > loadVarCharDataAddress(); > characterData += diff; > } > {code} > Fortunately, like so many bugs in this class, these bugs are offset by the > fact that the variable is never used in a way that cause obvious problems > (else these issues would have been found sooner.) The only real use is: > {code} > @Override > public boolean rowHasData() { > return recordStart < characterData; > } > {code} > Which, it turns out, is used only at EOF in {{TextReader.parseRecord()}}: > {code} > } catch(StreamFinishedPseudoException e) { > // if we've written part of a field or all of a field, we should send > this row. > if (fieldsWritten == 0 && !output.rowHasData()) { > throw e; > } > } > {code} > Thus, the bug will cause a failure only if: > * A batch is resized, and > * The new address is lower than the original address, and > * It is the last batch in the file. > Because of the low probability of hitting the bug, the priority is set to > Minor. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (DRILL-5492) CSV with spaces for header uses spaces as field name
Paul Rogers created DRILL-5492: -- Summary: CSV with spaces for header uses spaces as field name Key: DRILL-5492 URL: https://issues.apache.org/jira/browse/DRILL-5492 Project: Apache Drill Issue Type: Bug Reporter: Paul Rogers Priority: Minor Consider the same test case as in DRILL-5491, but with a slightly different input file: {code} ___ a,b,c d,e,f {code} The underscores represent three spaces: use spaces in the real test. In this case, the code discussed in DRILL-5491 finds some characters and happily returns the following array: {code} [" "] {code} The field name of three blanks is returned to the client to produce the following bizarre output: {code} 2 row(s): a d {code} The blank line is normally the header, but the header here was considered to be three blanks. (In fact, the blanks are actually printed.) Since the blanks were considered to be a field, the file is assumed to have only one field, so only the first column was returned. The expected behavior is that spaces are trimmed from field names, so the field name list would be empty and a User Error thrown. (That is, it is confusing to the user why a blank line produces NPE, some produce the {{ExecutionSetupException}} shown in DRILL-5491, and some produce blank headings. Behavior should be consistent. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (DRILL-5491) NPE when reading a CSV file, with headers, but blank header line
Paul Rogers created DRILL-5491: -- Summary: NPE when reading a CSV file, with headers, but blank header line Key: DRILL-5491 URL: https://issues.apache.org/jira/browse/DRILL-5491 Project: Apache Drill Issue Type: Bug Affects Versions: 1.8.0 Reporter: Paul Rogers See DRILL-5490 for background. Try this unit test case: {code} FixtureBuilder builder = ClusterFixture.builder() .maxParallelization(1); try (ClusterFixture cluster = builder.build(); ClientFixture client = cluster.clientFixture()) { TextFormatConfig csvFormat = new TextFormatConfig(); csvFormat.fieldDelimiter = ','; csvFormat.skipFirstLine = false; csvFormat.extractHeader = true; cluster.defineWorkspace("dfs", "data", "/tmp/data", "csv", csvFormat); String sql = "SELECT * FROM `dfs.data`.`csv/test7.csv`"; client.queryBuilder().sql(sql).printCsv(); } } {code} The test can also be run as a query using your favorite client. Using this input file: {code} a,b,c d,e,f {code} (The first line is blank.) The following is the result: {code} Exception (no rows returned): org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: NullPointerException {code} The {{RepeatedVarCharOutput}} class tries (but fails for the reasons outlined in DRILL-5490) to detect this case. The code crashes here in {{CompliantTextRecordReader.extractHeader()}}: {code} String [] fieldNames = ((RepeatedVarCharOutput)hOutput).getTextOutput(); {code} Because of bad code in {{RepeatedVarCharOutput.getTextOutput()}}: {code} public String [] getTextOutput () throws ExecutionSetupException { if (recordCount == 0 || fieldIndex == -1) { return null; } if (this.recordStart != characterData) { throw new ExecutionSetupException("record text was requested before finishing record"); } {code} Since there is no text on the line, special code elsewhere (see DRILL-5490) elects not to increment the {{recordCount}}. (BTW: {{recordCount}} is the total across-batch count, probably the in-batch count, {{batchIndex}}, was wanted here.) Since the count is zero, we return null. But, if the author probably thought we'd get a zero-length record, and the if-statement throws an exception in this case. But, see DRILL-5490 about why this code does not actually work. The result is one bug (not incrementing the record count), triggering another (returning a null), which masks a third ({{recordStart}} is not set correctly so the exception would not be thrown.) All that bad code is just fun and games until we get an NPE, however. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (DRILL-5490) RepeatedVarCharOutput.recordStart becomes corrupted on realloc
[ https://issues.apache.org/jira/browse/DRILL-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002016#comment-16002016 ] Paul Rogers edited comment on DRILL-5490 at 5/9/17 4:46 AM: The other place that {{recordStart}} is used is to get the first row of data for DRILL-951, to extract the header. Consider {{CompliantTextReader.extractHeader()}}: {code} HeaderOutputMutator hOutputMutator = new HeaderOutputMutator(); TextOutput hOutput = new RepeatedVarCharOutput(hOutputMutator, getColumns(), true); this.allocate(hOutputMutator.fieldVectorMap); ... // extract first row only reader.parseNext(); // grab the field names from output String [] fieldNames = ((RepeatedVarCharOutput)hOutput).getTextOutput(); {code} The only way the above can works is that we never call {RepeatedVarCharOutput.startBatch()}}, so {{recordStart}} defaults to 0, and so the zero-record check accidentally works: {code} if (this.recordStart != characterData) { throw new ExecutionSetupException("record text was requested before finishing record"); } {code} It is likely, however, that an empty header line would not be caught and so the above exception would never actually fire. was (Author: paul-rogers): The other place that {{recordStart}} is used is to get the first row of data for DRILL-951, to extract the header. Here again, we are fortunate that one bug masks another. The {{RepeatedVarCharOutput}} is never used with headers. Consider {{CompliantTextReader.setup()}}: {code} if (settings.isHeaderExtractionEnabled()){ //extract header and use that to setup a set of VarCharVectors String [] fieldNames = extractHeader(); output = new FieldVarCharOutput(outputMutator, fieldNames, getColumns(), isStarQuery()); } else { //simply use RepeatedVarCharVector output = new RepeatedVarCharOutput(outputMutator, getColumns(), isStarQuery()); } {code} > RepeatedVarCharOutput.recordStart becomes corrupted on realloc > -- > > Key: DRILL-5490 > URL: https://issues.apache.org/jira/browse/DRILL-5490 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Priority: Minor > > The class {{RepeatedVarCharOutput}}, used to read text data into the > {{columns}} array, has a member called {{recordStart}} which is supposed to > track the memory offset of the start of the record data in the {{columns}} > VarChar array. However, the logic for the member is very wrong. First, it is > initialized based on an uninitialized variable: > {{code}} > public void startBatch() { > this.recordStart = characterDataOriginal; > ... > loadVarCharDataAddress(); > } > private void loadVarCharDataAddress(){ > ... > this.characterDataOriginal = buf.memoryAddress(); > ... > } > {{code}} > Second, the class keeps track of actual memory addresses for the VarChar > data. When data would exceed the buffer, memory is reallocated using the > following method. But, this method *does not* adjust the {{recordStart}} > member, which now points to a bogus memory address: > {code} > private void expandVarCharData(){ > vector.getDataVector().reAlloc(); > long diff = characterData - characterDataOriginal; > loadVarCharDataAddress(); > characterData += diff; > } > {code} > Fortunately, like so many bugs in this class, these bugs are offset by the > fact that the variable is never used in a way that cause obvious problems > (else these issues would have been found sooner.) The only real use is: > {code} > @Override > public boolean rowHasData() { > return recordStart < characterData; > } > {code} > Which, it turns out, is used only at EOF in {{TextReader.parseRecord()}}: > {code} > } catch(StreamFinishedPseudoException e) { > // if we've written part of a field or all of a field, we should send > this row. > if (fieldsWritten == 0 && !output.rowHasData()) { > throw e; > } > } > {code} > Thus, the bug will cause a failure only if: > * A batch is resized, and > * The new address is lower than the original address, and > * It is the last batch in the file. > Because of the low probability of hitting the bug, the priority is set to > Minor. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5490) RepeatedVarCharOutput.recordStart becomes corrupted on realloc
[ https://issues.apache.org/jira/browse/DRILL-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002016#comment-16002016 ] Paul Rogers commented on DRILL-5490: The other place that {{recordStart}} is used is to get the first row of data for DRILL-951, to extract the header. Here again, we are fortunate that one bug masks another. The {{RepeatedVarCharOutput}} is never used with headers. Consider {{CompliantTextReader.setup()}}: {code} if (settings.isHeaderExtractionEnabled()){ //extract header and use that to setup a set of VarCharVectors String [] fieldNames = extractHeader(); output = new FieldVarCharOutput(outputMutator, fieldNames, getColumns(), isStarQuery()); } else { //simply use RepeatedVarCharVector output = new RepeatedVarCharOutput(outputMutator, getColumns(), isStarQuery()); } {code} > RepeatedVarCharOutput.recordStart becomes corrupted on realloc > -- > > Key: DRILL-5490 > URL: https://issues.apache.org/jira/browse/DRILL-5490 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Priority: Minor > > The class {{RepeatedVarCharOutput}}, used to read text data into the > {{columns}} array, has a member called {{recordStart}} which is supposed to > track the memory offset of the start of the record data in the {{columns}} > VarChar array. However, the logic for the member is very wrong. First, it is > initialized based on an uninitialized variable: > {{code}} > public void startBatch() { > this.recordStart = characterDataOriginal; > ... > loadVarCharDataAddress(); > } > private void loadVarCharDataAddress(){ > ... > this.characterDataOriginal = buf.memoryAddress(); > ... > } > {{code}} > Second, the class keeps track of actual memory addresses for the VarChar > data. When data would exceed the buffer, memory is reallocated using the > following method. But, this method *does not* adjust the {{recordStart}} > member, which now points to a bogus memory address: > {code} > private void expandVarCharData(){ > vector.getDataVector().reAlloc(); > long diff = characterData - characterDataOriginal; > loadVarCharDataAddress(); > characterData += diff; > } > {code} > Fortunately, like so many bugs in this class, these bugs are offset by the > fact that the variable is never used in a way that cause obvious problems > (else these issues would have been found sooner.) The only real use is: > {code} > @Override > public boolean rowHasData() { > return recordStart < characterData; > } > {code} > Which, it turns out, is used only at EOF in {{TextReader.parseRecord()}}: > {code} > } catch(StreamFinishedPseudoException e) { > // if we've written part of a field or all of a field, we should send > this row. > if (fieldsWritten == 0 && !output.rowHasData()) { > throw e; > } > } > {code} > Thus, the bug will cause a failure only if: > * A batch is resized, and > * The new address is lower than the original address, and > * It is the last batch in the file. > Because of the low probability of hitting the bug, the priority is set to > Minor. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (DRILL-5490) RepeatedVarCharOutput.recordStart becomes corrupted on realloc
Paul Rogers created DRILL-5490: -- Summary: RepeatedVarCharOutput.recordStart becomes corrupted on realloc Key: DRILL-5490 URL: https://issues.apache.org/jira/browse/DRILL-5490 Project: Apache Drill Issue Type: Bug Affects Versions: 1.10.0 Reporter: Paul Rogers Priority: Minor The class {{RepeatedVarCharOutput}}, used to read text data into the {{columns}} array, has a member called {{recordStart}} which is supposed to track the memory offset of the start of the record data in the {{columns}} VarChar array. However, the logic for the member is very wrong. First, it is initialized based on an uninitialized variable: {{code}} public void startBatch() { this.recordStart = characterDataOriginal; ... loadVarCharDataAddress(); } private void loadVarCharDataAddress(){ ... this.characterDataOriginal = buf.memoryAddress(); ... } {{code}} Second, the class keeps track of actual memory addresses for the VarChar data. When data would exceed the buffer, memory is reallocated using the following method. But, this method *does not* adjust the {{recordStart}} member, which now points to a bogus memory address: {code} private void expandVarCharData(){ vector.getDataVector().reAlloc(); long diff = characterData - characterDataOriginal; loadVarCharDataAddress(); characterData += diff; } {code} Fortunately, like so many bugs in this class, these bugs are offset by the fact that the variable is never used in a way that cause obvious problems (else these issues would have been found sooner.) The only real use is: {code} @Override public boolean rowHasData() { return recordStart < characterData; } {code} Which, it turns out, is used only at EOF in {{TextReader.parseRecord()}}: {code} } catch(StreamFinishedPseudoException e) { // if we've written part of a field or all of a field, we should send this row. if (fieldsWritten == 0 && !output.rowHasData()) { throw e; } } {code} Thus, the bug will cause a failure only if: * A batch is resized, and * The new address is lower than the original address, and * It is the last batch in the file. Because of the low probability of hitting the bug, the priority is set to Minor. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5489) Unprotected array access in RepeatedVarCharOutput ctor
[ https://issues.apache.org/jira/browse/DRILL-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001908#comment-16001908 ] Paul Rogers commented on DRILL-5489: Similarly, the {{RepeatedVarCharOutput}} class does not protect itself from a file with more than 64K fields on input: {code} @Override public void startField(int index) { fieldIndex = index; collect = collectedFields[index]; fieldOpen = true; } {code} Here, the parser counts the fields and calls {{startField}} for each. If the field is 65537, then the above method will check the {{collectedFields}} to determine if the field is wanted. But, that array is hard-coded at 65536 entries, so the code will fail with an array out-of-bounds exception. Instead, the code should simply discard extra fields, or throw a {{UserException}} to report too many fields. > Unprotected array access in RepeatedVarCharOutput ctor > -- > > Key: DRILL-5489 > URL: https://issues.apache.org/jira/browse/DRILL-5489 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Priority: Minor > > Suppose a user runs a query of form: > {code} > SELECT columns[7] FROM `dfs`.`mycsv.csv` > {code} > Internally, this will create a {{PathSegment}} to represent the selected > column. This is passed into the {{RepeatedVarCharOutput}} constructor where > it is used to set a flag in an array of 64K booleans. But, while the code is > very diligent of making sure that the column name is "columns" and that the > path segment is an array, it does not check the array value. Instead: > {code} > for(Integer i : columnIds){ > ... > fields[i] = true; > } > {code} > We need to add a bounds check to reject array indexes that are not valid: > negative or above 64K. It may be that the code further up the hierarchy does > the checks. But, if so, it should do the other checks as well. Leaving the > checks incomplete is confusing. > The result: > {code} > Exception (no rows returned): > org.apache.drill.common.exceptions.UserRemoteException: > SYSTEM ERROR: ArrayIndexOutOfBoundsException: 7 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (DRILL-5489) Unprotected array access in RepeatedVarCharOutput ctor
[ https://issues.apache.org/jira/browse/DRILL-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers updated DRILL-5489: --- Description: Suppose a user runs a query of form: {code} SELECT columns[7] FROM `dfs`.`mycsv.csv` {code} Internally, this will create a {{PathSegment}} to represent the selected column. This is passed into the {{RepeatedVarCharOutput}} constructor where it is used to set a flag in an array of 64K booleans. But, while the code is very diligent of making sure that the column name is "columns" and that the path segment is an array, it does not check the array value. Instead: {code} for(Integer i : columnIds){ ... fields[i] = true; } {code} We need to add a bounds check to reject array indexes that are not valid: negative or above 64K. It may be that the code further up the hierarchy does the checks. But, if so, it should do the other checks as well. Leaving the checks incomplete is confusing. The result: {code} Exception (no rows returned): org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: ArrayIndexOutOfBoundsException: 7 {code} was: Suppose a user runs a query of form: {code} SELECT columns[7] FROM `dfs`.`mycsv.csv` {code} Internally, this will create a {{PathSegment}} to represent the selected column. This is passed into the {{RepeatedVarCharOutput}} constructor where it is used to set a flag in an array of 64K booleans. But, while the code is very diligent of making sure that the column name is "columns" and that the path segment is an array, it does not check the array value. Instead: {code} for(Integer i : columnIds){ ... fields[i] = true; } {code} We need to add a bounds check to reject array indexes that are not valid: negative or above 64K. It may be that the code further up the hierarchy does the checks. But, if so, it should do the other checks as well. Leaving the checks incomplete is confusing. While we are at it, we might as well fix another bit of bogus code: {code} for(Integer i : columnIds){ maxField = 0; maxField = Math.max(maxField, i); ... } {code} The code to compute maxField simply uses the last value, not the maximum value. This will be thrown off by a query of the form: {code} SELECT columns[20], columns[1] FROM ... {code} This will muck up the text reader. The text reader "shortcuts" the rest of the line if it need not read any more fields. In {{TextReader.parseRecord()}}: {code} earlyTerm = ! parseField(); ... if (earlyTerm) { if (ch != newLine) { input.skipLines(1); } break; } {code} And: {code} private final boolean parseField() throws IOException { ... return output.endField(); {code} And, in {{RepeatedVarCharOutput.endField()}}: {code} public boolean endField() { ... return fieldIndex < maxField; } {code} The only reason that this does not cause data loss is the insertion of an unnecessary sort earlier in the constructor: {code} Collections.sort(columnIds); ... for (Integer i : columnIds){ {code} We can remove the sort and compute the maxField correct and get the same result with simpler code. > Unprotected array access in RepeatedVarCharOutput ctor > -- > > Key: DRILL-5489 > URL: https://issues.apache.org/jira/browse/DRILL-5489 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Priority: Minor > > Suppose a user runs a query of form: > {code} > SELECT columns[7] FROM `dfs`.`mycsv.csv` > {code} > Internally, this will create a {{PathSegment}} to represent the selected > column. This is passed into the {{RepeatedVarCharOutput}} constructor where > it is used to set a flag in an array of 64K booleans. But, while the code is > very diligent of making sure that the column name is "columns" and that the > path segment is an array, it does not check the array value. Instead: > {code} > for(Integer i : columnIds){ > ... > fields[i] = true; > } > {code} > We need to add a bounds check to reject array indexes that are not valid: > negative or above 64K. It may be that the code further up the hierarchy does > the checks. But, if so, it should do the other checks as well. Leaving the > checks incomplete is confusing. > The result: > {code} > Exception (no rows returned): > org.apache.drill.common.exceptions.UserRemoteException: > SYSTEM ERROR: ArrayIndexOutOfBoundsException: 7 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5489) Unprotected array access in RepeatedVarCharOutput ctor
[ https://issues.apache.org/jira/browse/DRILL-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001904#comment-16001904 ] Paul Rogers commented on DRILL-5489: While we are at it, we might as well fix another bit of bogus code: {code} for(Integer i : columnIds){ maxField = 0; maxField = Math.max(maxField, i); ... } {code} The code to compute maxField simply uses the last value, not the maximum value. This will be thrown off by a query of the form: {code} SELECT columns[20], columns[1] FROM ... {code} This will muck up the text reader. The text reader "shortcuts" the rest of the line if it need not read any more fields. In {{TextReader.parseRecord()}}: {code} earlyTerm = ! parseField(); ... if (earlyTerm) { if (ch != newLine) { input.skipLines(1); } break; } {code} And: {code} private final boolean parseField() throws IOException { ... return output.endField(); {code} And, in {{RepeatedVarCharOutput.endField()}}: {code} public boolean endField() { ... return fieldIndex < maxField; } {code} The only reason that this does not cause data loss is the insertion of an unnecessary sort earlier in the constructor: {code} Collections.sort(columnIds); ... for (Integer i : columnIds){ {code} We can remove the sort and compute the maxField correct and get the same result with simpler code. > Unprotected array access in RepeatedVarCharOutput ctor > -- > > Key: DRILL-5489 > URL: https://issues.apache.org/jira/browse/DRILL-5489 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Priority: Minor > > Suppose a user runs a query of form: > {code} > SELECT columns[7] FROM `dfs`.`mycsv.csv` > {code} > Internally, this will create a {{PathSegment}} to represent the selected > column. This is passed into the {{RepeatedVarCharOutput}} constructor where > it is used to set a flag in an array of 64K booleans. But, while the code is > very diligent of making sure that the column name is "columns" and that the > path segment is an array, it does not check the array value. Instead: > {code} > for(Integer i : columnIds){ > ... > fields[i] = true; > } > {code} > We need to add a bounds check to reject array indexes that are not valid: > negative or above 64K. It may be that the code further up the hierarchy does > the checks. But, if so, it should do the other checks as well. Leaving the > checks incomplete is confusing. > The result: > {code} > Exception (no rows returned): > org.apache.drill.common.exceptions.UserRemoteException: > SYSTEM ERROR: ArrayIndexOutOfBoundsException: 7 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (DRILL-5489) Unprotected array access in RepeatedVarCharOutput ctor
[ https://issues.apache.org/jira/browse/DRILL-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers updated DRILL-5489: --- Description: Suppose a user runs a query of form: {code} SELECT columns[7] FROM `dfs`.`mycsv.csv` {code} Internally, this will create a {{PathSegment}} to represent the selected column. This is passed into the {{RepeatedVarCharOutput}} constructor where it is used to set a flag in an array of 64K booleans. But, while the code is very diligent of making sure that the column name is "columns" and that the path segment is an array, it does not check the array value. Instead: {code} for(Integer i : columnIds){ ... fields[i] = true; } {code} We need to add a bounds check to reject array indexes that are not valid: negative or above 64K. It may be that the code further up the hierarchy does the checks. But, if so, it should do the other checks as well. Leaving the checks incomplete is confusing. While we are at it, we might as well fix another bit of bogus code: {code} for(Integer i : columnIds){ maxField = 0; maxField = Math.max(maxField, i); ... } {code} The code to compute maxField simply uses the last value, not the maximum value. This will be thrown off by a query of the form: {code} SELECT columns[20], columns[1] FROM ... {code} This will muck up the text reader. The text reader "shortcuts" the rest of the line if it need not read any more fields. In {{TextReader.parseRecord()}}: {code} earlyTerm = ! parseField(); ... if (earlyTerm) { if (ch != newLine) { input.skipLines(1); } break; } {code} And: {code} private final boolean parseField() throws IOException { ... return output.endField(); {code} And, in {{RepeatedVarCharOutput.endField()}}: {code} public boolean endField() { ... return fieldIndex < maxField; } {code} The only reason that this does not cause data loss is the insertion of an unnecessary sort earlier in the constructor: {code} Collections.sort(columnIds); ... for (Integer i : columnIds){ {code} We can remove the sort and compute the maxField correct and get the same result with simpler code. was: Suppose a user runs a query of form: {code} SELECT columns[7] FROM `dfs`.`mycsv.csv` {code} Internally, this will create a {{PathSegment}} to represent the selected column. This is passed into the {{RepeatedVarCharOutput}} constructor where it is used to set a flag in an array of 64K booleans. But, while the code is very diligent of making sure that the column name is "columns" and that the path segment is an array, it does not check the array value. Instead: {code} for(Integer i : columnIds){ ... fields[i] = true; } {code} We need to add a bounds check to reject array indexes that are not valid: negative or above 64K. It may be that the code further up the hierarchy does the checks. But, if so, it should do the other checks as well. Leaving the checks incomplete is confusing. While we are at it, we might as well fix another bit of bogus code: {code} for(Integer i : columnIds){ maxField = 0; maxField = Math.max(maxField, i); ... } {code} The code to compute maxField simply uses the last value, not the maximum value. This will be thrown off by a query of the form: {code} SELECT columns[20], columns[1] FROM ... {code} This will muck up the text reader. The text reader "shortcuts" the rest of the line if it need not read any more fields. In {{TextReader.parseRecord()}}: {code} earlyTerm = ! parseField(); ... if (earlyTerm) { if (ch != newLine) { input.skipLines(1); } break; } {code} And: {code} private final boolean parseField() throws IOException { ... return output.endField(); {code} And, in {{RepeatedVarCharOutput.endField()}}: {code} public boolean endField() { ... return fieldIndex < maxField; } {code} > Unprotected array access in RepeatedVarCharOutput ctor > -- > > Key: DRILL-5489 > URL: https://issues.apache.org/jira/browse/DRILL-5489 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Priority: Minor > > Suppose a user runs a query of form: > {code} > SELECT columns[7] FROM `dfs`.`mycsv.csv` > {code} > Internally, this will create a {{PathSegment}} to represent the selected > column. This is passed into the {{RepeatedVarCharOutput}} constructor where > it is used to set a flag in an array of 64K booleans. But, while the code is > very diligent of making sure that the column name is "c
[jira] [Updated] (DRILL-5489) Unprotected array access in RepeatedVarCharOutput ctor
[ https://issues.apache.org/jira/browse/DRILL-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers updated DRILL-5489: --- Description: Suppose a user runs a query of form: {code} SELECT columns[7] FROM `dfs`.`mycsv.csv` {code} Internally, this will create a {{PathSegment}} to represent the selected column. This is passed into the {{RepeatedVarCharOutput}} constructor where it is used to set a flag in an array of 64K booleans. But, while the code is very diligent of making sure that the column name is "columns" and that the path segment is an array, it does not check the array value. Instead: {code} for(Integer i : columnIds){ ... fields[i] = true; } {code} We need to add a bounds check to reject array indexes that are not valid: negative or above 64K. It may be that the code further up the hierarchy does the checks. But, if so, it should do the other checks as well. Leaving the checks incomplete is confusing. While we are at it, we might as well fix another bit of bogus code: {code} for(Integer i : columnIds){ maxField = 0; maxField = Math.max(maxField, i); ... } {code} The code to compute maxField simply uses the last value, not the maximum value. This will be thrown off by a query of the form: {code} SELECT columns[20], columns[1] FROM ... {code} This will muck up the text reader. The text reader "shortcuts" the rest of the line if it need not read any more fields. In {{TextReader.parseRecord()}}: {code} earlyTerm = ! parseField(); ... if (earlyTerm) { if (ch != newLine) { input.skipLines(1); } break; } {code} And: {code} private final boolean parseField() throws IOException { ... return output.endField(); {code} And, in {{RepeatedVarCharOutput.endField()}}: {code} public boolean endField() { ... return fieldIndex < maxField; } {code} was: Suppose a user runs a query of form: {code} SELECT columns[7] FROM `dfs`.`mycsv.csv` {code} Internally, this will create a {{PathSegment}} to represent the selected column. This is passed into the {{RepeatedVarCharOutput}} constructor where it is used to set a flag in an array of 64K booleans. But, while the code is very diligent of making sure that the column name is "columns" and that the path segment is an array, it does not check the array value. Instead: {code} for(Integer i : columnIds){ ... fields[i] = true; } {code} We need to add a bounds check to reject array indexes that are not valid: negative or above 64K. While we are at it, we might as well fix another bit of bogus code: {code} for(Integer i : columnIds){ maxField = 0; maxField = Math.max(maxField, i); ... } {code} The code to compute maxField simply uses the last value, not the maximum value. This will be thrown off by a query of the form: {code} SELECT columns[20], columns[1] FROM ... {code} It may be that the code further up the hierarchy does the checks. But, if so, it should do the other checks as well. Leaving the checks incomplete is confusing. > Unprotected array access in RepeatedVarCharOutput ctor > -- > > Key: DRILL-5489 > URL: https://issues.apache.org/jira/browse/DRILL-5489 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Priority: Minor > > Suppose a user runs a query of form: > {code} > SELECT columns[7] FROM `dfs`.`mycsv.csv` > {code} > Internally, this will create a {{PathSegment}} to represent the selected > column. This is passed into the {{RepeatedVarCharOutput}} constructor where > it is used to set a flag in an array of 64K booleans. But, while the code is > very diligent of making sure that the column name is "columns" and that the > path segment is an array, it does not check the array value. Instead: > {code} > for(Integer i : columnIds){ > ... > fields[i] = true; > } > {code} > We need to add a bounds check to reject array indexes that are not valid: > negative or above 64K. It may be that the code further up the hierarchy does > the checks. But, if so, it should do the other checks as well. Leaving the > checks incomplete is confusing. > While we are at it, we might as well fix another bit of bogus code: > {code} > for(Integer i : columnIds){ > maxField = 0; > maxField = Math.max(maxField, i); > ... > } > {code} > The code to compute maxField simply uses the last value, not the maximum > value. This will be thrown off by a query of the form: > {code} > SELECT columns[20], columns[1] FROM ... > {code} > This will muck up the text
[jira] [Created] (DRILL-5489) Unprotected array access in RepeatedVarCharOutput ctor
Paul Rogers created DRILL-5489: -- Summary: Unprotected array access in RepeatedVarCharOutput ctor Key: DRILL-5489 URL: https://issues.apache.org/jira/browse/DRILL-5489 Project: Apache Drill Issue Type: Bug Affects Versions: 1.10.0 Reporter: Paul Rogers Priority: Minor Suppose a user runs a query of form: {code} SELECT columns[7] FROM `dfs`.`mycsv.csv` {code} Internally, this will create a {{PathSegment}} to represent the selected column. This is passed into the {{RepeatedVarCharOutput}} constructor where it is used to set a flag in an array of 64K booleans. But, while the code is very diligent of making sure that the column name is "columns" and that the path segment is an array, it does not check the array value. Instead: {code} for(Integer i : columnIds){ ... fields[i] = true; } {code} We need to add a bounds check to reject array indexes that are not valid: negative or above 64K. While we are at it, we might as well fix another bit of bogus code: {code} for(Integer i : columnIds){ maxField = 0; maxField = Math.max(maxField, i); ... } {code} The code to compute maxField simply uses the last value, not the maximum value. This will be thrown off by a query of the form: {code} SELECT columns[20], columns[1] FROM ... {code} It may be that the code further up the hierarchy does the checks. But, if so, it should do the other checks as well. Leaving the checks incomplete is confusing. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5486) Compliant text reader tries, but fails, to ignore empty rows
[ https://issues.apache.org/jira/browse/DRILL-5486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001784#comment-16001784 ] Paul Rogers commented on DRILL-5486: The net result seems to be that the per-batch row count and setting in {{BaseRepeatedValueVector}} is buggy, but irrelevant. > Compliant text reader tries, but fails, to ignore empty rows > > > Key: DRILL-5486 > URL: https://issues.apache.org/jira/browse/DRILL-5486 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Priority: Minor > > The "compliant" text reader (for CSV files, etc.) has two modes: reading with > headers (which creates a scalar vector per file column), or without headers > (that creates a single {{columns}} vector with an array of all columns. > When run in array mode, the code uses a class called > {{RepeatedVarCharOutput}}. This class attempts to ignore empty records: > {code} > public void finishRecord() { > ... > // if there were no defined fields, skip. > if (fieldIndex > -1) { > batchIndex++; > recordCount++; > } > ... > {code} > As it turns out, this works only on the *first* row. On the first row, the > {{fieldIndex}} has its initial value of -1. But, for subsequent records, > {{fieldIndex}} is never reset and retains its value from the last field set. > The result is that the code skips the first empty row, but not empty rows > elsewhere in the file. > Further, on the first row, the logic is flawed. The part shown above is only > for the row counts. Here is more of the logic: > {code} > @Override > public void finishRecord() { > recordStart = characterData; > ... > int newOffset = ((int) (charLengthOffset - charLengthOffsetOriginal))/4; > PlatformDependent.putInt(repeatedOffset, newOffset); > repeatedOffset += 4; > // if there were no defined fields, skip. > if (fieldIndex > -1) { > batchIndex++; > recordCount++; > } > } > {code} > Note that the underlying vector *is* bumped for the record, even though the > row count is not. Later, this will cause the container to have one less > record. > This would be a problem if the row count ({{batchIndex}}) was used for the > batch row count. But, it is not, the next level of code keeps a separate > count, so the "skip empty row" logic causes the counts to get out of sync > between the {{RepeatedVarCharOutput}} and the next level of reader. > In sense, there are multiple self-cancelling bugs: > * The skip-empty-row logic is not synchronized with the upper layer. > * That bug is partly masked because the logic is not triggered except on the > first row. > * That bug is masked because we always set the row offset, even if we ignore > the row in the row count. > * That bug is masked because the incorrect row count is ignored when setting > the container row count. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (DRILL-5486) Compliant text reader tries, but fails, to ignore empty rows
[ https://issues.apache.org/jira/browse/DRILL-5486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001777#comment-16001777 ] Paul Rogers edited comment on DRILL-5486 at 5/8/17 11:24 PM: - Another complication. Although {{BaseRepeatedValueVector}} sets the {{`columns`}} vector column count (incorrectly, as noted above), the caller, {{ScanBatch}}, keeps its own track of record count, and uses code to set that record count on each vector. So, even if {{setValueCount()}} per vector did anything (which it does not), the wrong value set by {{BaseRepeatedValueVector}} would be overwritten by {{ScanBatch}}. was (Author: paul-rogers): Another complications. Although {{BaseRepeatedValueVector}} sets the {{`columns`}} vector column count (incorrectly, as noted above), the caller, {{ScanBatch}}, keeps its own track of record count, and uses code to set that record count on each vector. So, even if {{setValueCount()}} per vector did anything (which it does not), the wrong value set by {{BaseRepeatedValueVector}} would be overwritten by {{ScanBatch}}. > Compliant text reader tries, but fails, to ignore empty rows > > > Key: DRILL-5486 > URL: https://issues.apache.org/jira/browse/DRILL-5486 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Priority: Minor > > The "compliant" text reader (for CSV files, etc.) has two modes: reading with > headers (which creates a scalar vector per file column), or without headers > (that creates a single {{columns}} vector with an array of all columns. > When run in array mode, the code uses a class called > {{RepeatedVarCharOutput}}. This class attempts to ignore empty records: > {code} > public void finishRecord() { > ... > // if there were no defined fields, skip. > if (fieldIndex > -1) { > batchIndex++; > recordCount++; > } > ... > {code} > As it turns out, this works only on the *first* row. On the first row, the > {{fieldIndex}} has its initial value of -1. But, for subsequent records, > {{fieldIndex}} is never reset and retains its value from the last field set. > The result is that the code skips the first empty row, but not empty rows > elsewhere in the file. > Further, on the first row, the logic is flawed. The part shown above is only > for the row counts. Here is more of the logic: > {code} > @Override > public void finishRecord() { > recordStart = characterData; > ... > int newOffset = ((int) (charLengthOffset - charLengthOffsetOriginal))/4; > PlatformDependent.putInt(repeatedOffset, newOffset); > repeatedOffset += 4; > // if there were no defined fields, skip. > if (fieldIndex > -1) { > batchIndex++; > recordCount++; > } > } > {code} > Note that the underlying vector *is* bumped for the record, even though the > row count is not. Later, this will cause the container to have one less > record. > This would be a problem if the row count ({{batchIndex}}) was used for the > batch row count. But, it is not, the next level of code keeps a separate > count, so the "skip empty row" logic causes the counts to get out of sync > between the {{RepeatedVarCharOutput}} and the next level of reader. > In sense, there are multiple self-cancelling bugs: > * The skip-empty-row logic is not synchronized with the upper layer. > * That bug is partly masked because the logic is not triggered except on the > first row. > * That bug is masked because we always set the row offset, even if we ignore > the row in the row count. > * That bug is masked because the incorrect row count is ignored when setting > the container row count. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5486) Compliant text reader tries, but fails, to ignore empty rows
[ https://issues.apache.org/jira/browse/DRILL-5486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001777#comment-16001777 ] Paul Rogers commented on DRILL-5486: Another complications. Although {{BaseRepeatedValueVector}} sets the {{`columns`}} vector column count (incorrectly, as noted above), the caller, {{ScanBatch}}, keeps its own track of record count, and uses code to set that record count on each vector. So, even if {{setValueCount()}} per vector did anything (which it does not), the wrong value set by {{BaseRepeatedValueVector}} would be overwritten by {{ScanBatch}}. > Compliant text reader tries, but fails, to ignore empty rows > > > Key: DRILL-5486 > URL: https://issues.apache.org/jira/browse/DRILL-5486 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Priority: Minor > > The "compliant" text reader (for CSV files, etc.) has two modes: reading with > headers (which creates a scalar vector per file column), or without headers > (that creates a single {{columns}} vector with an array of all columns. > When run in array mode, the code uses a class called > {{RepeatedVarCharOutput}}. This class attempts to ignore empty records: > {code} > public void finishRecord() { > ... > // if there were no defined fields, skip. > if (fieldIndex > -1) { > batchIndex++; > recordCount++; > } > ... > {code} > As it turns out, this works only on the *first* row. On the first row, the > {{fieldIndex}} has its initial value of -1. But, for subsequent records, > {{fieldIndex}} is never reset and retains its value from the last field set. > The result is that the code skips the first empty row, but not empty rows > elsewhere in the file. > Further, on the first row, the logic is flawed. The part shown above is only > for the row counts. Here is more of the logic: > {code} > @Override > public void finishRecord() { > recordStart = characterData; > ... > int newOffset = ((int) (charLengthOffset - charLengthOffsetOriginal))/4; > PlatformDependent.putInt(repeatedOffset, newOffset); > repeatedOffset += 4; > // if there were no defined fields, skip. > if (fieldIndex > -1) { > batchIndex++; > recordCount++; > } > } > {code} > Note that the underlying vector *is* bumped for the record, even though the > row count is not. Later, this will cause the container to have one less > record. > This would be a problem if the row count ({{batchIndex}}) was used for the > batch row count. But, it is not, the next level of code keeps a separate > count, so the "skip empty row" logic causes the counts to get out of sync > between the {{RepeatedVarCharOutput}} and the next level of reader. > In sense, there are multiple self-cancelling bugs: > * The skip-empty-row logic is not synchronized with the upper layer. > * That bug is partly masked because the logic is not triggered except on the > first row. > * That bug is masked because we always set the row offset, even if we ignore > the row in the row count. > * That bug is masked because the incorrect row count is ignored when setting > the container row count. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5327) Hash aggregate can return empty batch which can cause schema change exception
[ https://issues.apache.org/jira/browse/DRILL-5327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001776#comment-16001776 ] Chun Chang commented on DRILL-5327: --- [~jni] I ran against apache master branch with assertion enabled but only hit the schema change exception. > Hash aggregate can return empty batch which can cause schema change exception > - > > Key: DRILL-5327 > URL: https://issues.apache.org/jira/browse/DRILL-5327 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.10.0 >Reporter: Chun Chang >Assignee: Jinfeng Ni > Attachments: 271dbe08-c725-8e82-b6ea-53e9dab64e8b.sys.drill > > > Hash aggregate can return empty batches which cause drill to throw schema > change exception (not handling this type of schema change). This is not a new > bug. But a recent hash function change (a theoretically correct change) may > have increased the chance of hitting this issue. I don't have scientific data > to support my claim (in fact I don't believe it's the case), but a regular > regression run used to pass fails now due to this bug. My concern is that > existing drill users out there may have queries that used to work but fail > now. It will be difficult to explain why the new release is better for them. > I put this bug as blocker so we can discuss it before releasing 1.10. > {noformat} > /root/drillAutomation/framework-master/framework/resources/Advanced/tpcds/tpcds_sf1/original/text/query66.sql > Query: > -- start query 66 in stream 0 using template query66.tpl > SELECT w_warehouse_name, >w_warehouse_sq_ft, >w_city, >w_county, >w_state, >w_country, >ship_carriers, >year1, >Sum(jan_sales) AS jan_sales, >Sum(feb_sales) AS feb_sales, >Sum(mar_sales) AS mar_sales, >Sum(apr_sales) AS apr_sales, >Sum(may_sales) AS may_sales, >Sum(jun_sales) AS jun_sales, >Sum(jul_sales) AS jul_sales, >Sum(aug_sales) AS aug_sales, >Sum(sep_sales) AS sep_sales, >Sum(oct_sales) AS oct_sales, >Sum(nov_sales) AS nov_sales, >Sum(dec_sales) AS dec_sales, >Sum(jan_sales / w_warehouse_sq_ft) AS jan_sales_per_sq_foot, >Sum(feb_sales / w_warehouse_sq_ft) AS feb_sales_per_sq_foot, >Sum(mar_sales / w_warehouse_sq_ft) AS mar_sales_per_sq_foot, >Sum(apr_sales / w_warehouse_sq_ft) AS apr_sales_per_sq_foot, >Sum(may_sales / w_warehouse_sq_ft) AS may_sales_per_sq_foot, >Sum(jun_sales / w_warehouse_sq_ft) AS jun_sales_per_sq_foot, >Sum(jul_sales / w_warehouse_sq_ft) AS jul_sales_per_sq_foot, >Sum(aug_sales / w_warehouse_sq_ft) AS aug_sales_per_sq_foot, >Sum(sep_sales / w_warehouse_sq_ft) AS sep_sales_per_sq_foot, >Sum(oct_sales / w_warehouse_sq_ft) AS oct_sales_per_sq_foot, >Sum(nov_sales / w_warehouse_sq_ft) AS nov_sales_per_sq_foot, >Sum(dec_sales / w_warehouse_sq_ft) AS dec_sales_per_sq_foot, >Sum(jan_net) AS jan_net, >Sum(feb_net) AS feb_net, >Sum(mar_net) AS mar_net, >Sum(apr_net) AS apr_net, >Sum(may_net) AS may_net, >Sum(jun_net) AS jun_net, >Sum(jul_net) AS jul_net, >Sum(aug_net) AS aug_net, >Sum(sep_net) AS sep_net, >Sum(oct_net) AS oct_net, >Sum(nov_net) AS nov_net, >Sum(dec_net) AS dec_net > FROM (SELECT w_warehouse_name, >w_warehouse_sq_ft, >w_city, >w_county, >w_state, >w_country, >'ZOUROS' >|| ',' >|| 'ZHOU' AS ship_carriers, >d_yearAS year1, >Sum(CASE > WHEN d_moy = 1 THEN ws_ext_sales_price * ws_quantity >
[jira] [Commented] (DRILL-5486) Compliant text reader tries, but fails, to ignore empty rows
[ https://issues.apache.org/jira/browse/DRILL-5486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001748#comment-16001748 ] Paul Rogers commented on DRILL-5486: As it turns out, the offset vector {{setValueCount}} does not actually do anything, so the fact that it is setting the wrong value is moot. See DRILL-5488. > Compliant text reader tries, but fails, to ignore empty rows > > > Key: DRILL-5486 > URL: https://issues.apache.org/jira/browse/DRILL-5486 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Priority: Minor > > The "compliant" text reader (for CSV files, etc.) has two modes: reading with > headers (which creates a scalar vector per file column), or without headers > (that creates a single {{columns}} vector with an array of all columns. > When run in array mode, the code uses a class called > {{RepeatedVarCharOutput}}. This class attempts to ignore empty records: > {code} > public void finishRecord() { > ... > // if there were no defined fields, skip. > if (fieldIndex > -1) { > batchIndex++; > recordCount++; > } > ... > {code} > As it turns out, this works only on the *first* row. On the first row, the > {{fieldIndex}} has its initial value of -1. But, for subsequent records, > {{fieldIndex}} is never reset and retains its value from the last field set. > The result is that the code skips the first empty row, but not empty rows > elsewhere in the file. > Further, on the first row, the logic is flawed. The part shown above is only > for the row counts. Here is more of the logic: > {code} > @Override > public void finishRecord() { > recordStart = characterData; > ... > int newOffset = ((int) (charLengthOffset - charLengthOffsetOriginal))/4; > PlatformDependent.putInt(repeatedOffset, newOffset); > repeatedOffset += 4; > // if there were no defined fields, skip. > if (fieldIndex > -1) { > batchIndex++; > recordCount++; > } > } > {code} > Note that the underlying vector *is* bumped for the record, even though the > row count is not. Later, this will cause the container to have one less > record. > This would be a problem if the row count ({{batchIndex}}) was used for the > batch row count. But, it is not, the next level of code keeps a separate > count, so the "skip empty row" logic causes the counts to get out of sync > between the {{RepeatedVarCharOutput}} and the next level of reader. > In sense, there are multiple self-cancelling bugs: > * The skip-empty-row logic is not synchronized with the upper layer. > * That bug is partly masked because the logic is not triggered except on the > first row. > * That bug is masked because we always set the row offset, even if we ignore > the row in the row count. > * That bug is masked because the incorrect row count is ignored when setting > the container row count. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5488) Useless code in VectorTrimmer, fixed-width vector setValueCount()
[ https://issues.apache.org/jira/browse/DRILL-5488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001746#comment-16001746 ] Paul Rogers commented on DRILL-5488: A similar analysis can be done for the variable-length vectors, such as {{VarCharVector}}. > Useless code in VectorTrimmer, fixed-width vector setValueCount() > - > > Key: DRILL-5488 > URL: https://issues.apache.org/jira/browse/DRILL-5488 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Priority: Trivial > > Consider this code in a generated fixed-width vector, such as UInt4Vector: > {code} > @Override > public void setValueCount(int valueCount) { > ... > final int idx = (VALUE_WIDTH * valueCount); > ... > VectorTrimmer.trim(data, idx); > data.writerIndex(valueCount * VALUE_WIDTH); > } > {code} > Consider the {{trim()}} method: > {code} > public class VectorTrimmer { > ... > public static void trim(ByteBuf data, int idx) { > data.writerIndex(idx); > if (data instanceof DrillBuf) { > // data.capacity(idx); > data.writerIndex(idx); > } > } > } > {code} > This method is called {{trim}}, but it actually sets the writer index in the > buffer (though we never use that index.) Since all buffers we use are > {{DrillBuf}}, the if-statement is a no-op: we simply set the writer index > twice. > But, notice that the {{setValueCount()}} method itself calls the same > {{writerIndex()}} method, so it is actually being called three times. > It seems this code can simply be discarded: it is called from only two > places; neither of which end up using the writer index. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (DRILL-5488) Useless code in VectorTrimmer, fixed-width vector setValueCount()
[ https://issues.apache.org/jira/browse/DRILL-5488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001706#comment-16001706 ] Paul Rogers edited comment on DRILL-5488 at 5/8/17 11:04 PM: - Consider the {{setValueCount()}} method again: {code} public void setValueCount(int valueCount) { final int currentValueCapacity = getValueCapacity(); ... while(valueCount > getValueCapacity()) { reAlloc(); } if (valueCount > 0 && currentValueCapacity > valueCount * 2) { ... {code} The {{if}} statement is very probably wrong. We get the current capacity. If too small, we realloc. Then, we check the value, after the possible reallocation, without updating the capacity. It is not clear if we are trying to do check the before or after. Then: {code} if (valueCount > 0 && currentValueCapacity > valueCount * 2) { incrementAllocationMonitor(); // is ++allocationMonitor } else if (allocationMonitor > 0) { allocationMonitor = 0; } {code} It seems we are trying to guess if we did a realloc rather than simply incrementing the monitor when we actually do the reAlloc. It is also not clear that the if-statement mimics the if that does the reAlloc. In fact, a close inspection of the {{setValueCount()}} method, shows that it does not even work correctly. It is allocating enough space to hold the value count, which will implicitly set the unused slots to 0. But, this is an offset vector; we should be copying forward the last set offset. Consider an example: \[0, 3, 9] as set values. Set row count as 5. Then, the offset count is (row count + 1) = 6. What we create is: \[0, 3, 9, 0, 0, 0] What we should have set is: \[0, 3, 9, 9, 9, 9] It seems to be the responsibility of the caller to set the last value, so what we actually set today, in this case, is: \[0, 3, 9, 0, 0, 9] The caveat is that the above works if the caller diligently fills in the offsets row by row. But, if it die, there would be no reason to extend the vector in {{setValueCount()}}. It has to be one way or the other, not the ambiguous middle that we have in the code now. was (Author: paul-rogers): Consider the {{setValueCount()}} method again: {code} public void setValueCount(int valueCount) { final int currentValueCapacity = getValueCapacity(); ... while(valueCount > getValueCapacity()) { reAlloc(); } if (valueCount > 0 && currentValueCapacity > valueCount * 2) { ... {code} The {{if}} statement is very probably wrong. We get the current capacity. If too small, we realloc. Then, we check the value, after the possible reallocation, without updating the capacity. It is not clear if we are trying to do check the before or after. Then: {code} if (valueCount > 0 && currentValueCapacity > valueCount * 2) { incrementAllocationMonitor(); // is ++allocationMonitor } else if (allocationMonitor > 0) { allocationMonitor = 0; } {code} It seems we are trying to guess if we did a realloc rather than simply incrementing the monitor when we actually do the reAlloc. It is also not clear that the if-statement mimics the if that does the reAlloc. In fact, a close inspection of the {{setValueCount()}} method, shows that it does not even work correctly. It is allocating enough space to hold the value count, which will implicitly set the unused slots to 0. But, this is an offset vector; we should be copying forward the last set offset. Consider an example: \[0, 3, 9] as set values. Set row count as 5. What we create is: \[0, 3, 9, 0, 0, 0] What we should have set is: \[0, 3, 9, 9, 9, 9] It seems to be the responsibility of the caller to set the last value, so what we actually set today, in this case, is: \[0, 3, 9, 0, 0, 9] The caveat is that the above works if the caller diligently fills in the offsets row by row. But, if it die, there would be no reason to extend the vector in {{setValueCount()}}. It has to be one way or the other, not the ambiguous middle that we have in the code now. > Useless code in VectorTrimmer, fixed-width vector setValueCount() > - > > Key: DRILL-5488 > URL: https://issues.apache.org/jira/browse/DRILL-5488 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Priority: Trivial > > Consider this code in a generated fixed-width vector, such as UInt4Vector: > {code} > @Override > public void setValueCount(int valueCount) { > ... > final int idx = (VALUE_WIDTH * valueCount); > ... > VectorTrimmer.trim(data, idx); > data.writerIndex(valueCount * VALUE_WIDTH); > } > {code} > Consider the {{trim()}} method: > {code} > public cla
[jira] [Comment Edited] (DRILL-5488) Useless code in VectorTrimmer, fixed-width vector setValueCount()
[ https://issues.apache.org/jira/browse/DRILL-5488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001706#comment-16001706 ] Paul Rogers edited comment on DRILL-5488 at 5/8/17 11:00 PM: - Consider the {{setValueCount()}} method again: {code} public void setValueCount(int valueCount) { final int currentValueCapacity = getValueCapacity(); ... while(valueCount > getValueCapacity()) { reAlloc(); } if (valueCount > 0 && currentValueCapacity > valueCount * 2) { ... {code} The {{if}} statement is very probably wrong. We get the current capacity. If too small, we realloc. Then, we check the value, after the possible reallocation, without updating the capacity. It is not clear if we are trying to do check the before or after. Then: {code} if (valueCount > 0 && currentValueCapacity > valueCount * 2) { incrementAllocationMonitor(); // is ++allocationMonitor } else if (allocationMonitor > 0) { allocationMonitor = 0; } {code} It seems we are trying to guess if we did a realloc rather than simply incrementing the monitor when we actually do the reAlloc. It is also not clear that the if-statement mimics the if that does the reAlloc. In fact, a close inspection of the {{setValueCount()}} method, shows that it does not even work correctly. It is allocating enough space to hold the value count, which will implicitly set the unused slots to 0. But, this is an offset vector; we should be copying forward the last set offset. Consider an example: \[0, 3, 9] as set values. Set row count as 5. What we create is: \[0, 3, 9, 0, 0, 0] What we should have set is: \[0, 3, 9, 9, 9, 9] It seems to be the responsibility of the caller to set the last value, so what we actually set today, in this case, is: \[0, 3, 9, 0, 0, 9] The caveat is that the above works if the caller diligently fills in the offsets row by row. But, if it die, there would be no reason to extend the vector in {{setValueCount()}}. It has to be one way or the other, not the ambiguous middle that we have in the code now. was (Author: paul-rogers): Consider the {{setValueCount()}} method again: {code} public void setValueCount(int valueCount) { final int currentValueCapacity = getValueCapacity(); ... while(valueCount > getValueCapacity()) { reAlloc(); } if (valueCount > 0 && currentValueCapacity > valueCount * 2) { ... {code} The {{if}} statement is very probably wrong. We get the current capacity. If too small, we realloc. Then, we check the value, after the possible reallocation, without updating the capacity. It is not clear if we are trying to do check the before or after. Then: {code} if (valueCount > 0 && currentValueCapacity > valueCount * 2) { incrementAllocationMonitor(); // is ++allocationMonitor } else if (allocationMonitor > 0) { allocationMonitor = 0; } {code} It seems we are trying to guess if we did a realloc rather than simply incrementing the monitor when we actually do the reAlloc. It is also not clear that the if-statement mimics the if that does the reAlloc. In fact, a close inspection of the {{setValueCount()}} method, shows that it does not even work correctly. It is allocating enough space to hold the value count, which will implicitly set the unused slots to 0. But, this is an offset vector; we should be copying forward the last set offset. Consider an example: \[0, 3, 9] as set values. Set row count as 5. What we create is: \[0, 3, 9, 0, 0, 0] What we should have set is: \[0, 3, 9, 9, 9, 9] It seems to be the responsibility of the caller to set the last value, so what we actually set today, in this case, is: \[0, 3, 9, 0, 0, 9] > Useless code in VectorTrimmer, fixed-width vector setValueCount() > - > > Key: DRILL-5488 > URL: https://issues.apache.org/jira/browse/DRILL-5488 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Priority: Trivial > > Consider this code in a generated fixed-width vector, such as UInt4Vector: > {code} > @Override > public void setValueCount(int valueCount) { > ... > final int idx = (VALUE_WIDTH * valueCount); > ... > VectorTrimmer.trim(data, idx); > data.writerIndex(valueCount * VALUE_WIDTH); > } > {code} > Consider the {{trim()}} method: > {code} > public class VectorTrimmer { > ... > public static void trim(ByteBuf data, int idx) { > data.writerIndex(idx); > if (data instanceof DrillBuf) { > // data.capacity(idx); > data.writerIndex(idx); > } > } > } > {code} > This method is called {{trim}}, but it actually sets the writer index in the > buf
[jira] [Comment Edited] (DRILL-5488) Useless code in VectorTrimmer, fixed-width vector setValueCount()
[ https://issues.apache.org/jira/browse/DRILL-5488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001706#comment-16001706 ] Paul Rogers edited comment on DRILL-5488 at 5/8/17 10:57 PM: - Consider the {{setValueCount()}} method again: {code} public void setValueCount(int valueCount) { final int currentValueCapacity = getValueCapacity(); ... while(valueCount > getValueCapacity()) { reAlloc(); } if (valueCount > 0 && currentValueCapacity > valueCount * 2) { ... {code} The {{if}} statement is very probably wrong. We get the current capacity. If too small, we realloc. Then, we check the value, after the possible reallocation, without updating the capacity. It is not clear if we are trying to do check the before or after. Then: {code} if (valueCount > 0 && currentValueCapacity > valueCount * 2) { incrementAllocationMonitor(); // is ++allocationMonitor } else if (allocationMonitor > 0) { allocationMonitor = 0; } {code} It seems we are trying to guess if we did a realloc rather than simply incrementing the monitor when we actually do the reAlloc. It is also not clear that the if-statement mimics the if that does the reAlloc. In fact, a close inspection of the {{setValueCount()}} method, shows that it does not even work correctly. It is allocating enough space to hold the value count, which will implicitly set the unused slots to 0. But, this is an offset vector; we should be copying forward the last set offset. Consider an example: \[0, 3, 9] as set values. Set row count as 5. What we create is: \[0, 3, 9, 0, 0, 0] What we should have set is: \[0, 3, 9, 9, 9, 9] It seems to be the responsibility of the caller to set the last value, so what we actually set today, in this case, is: \[0, 3, 9, 0, 0, 9] was (Author: paul-rogers): Consider the {{setValueCount()}} method again: {code} public void setValueCount(int valueCount) { final int currentValueCapacity = getValueCapacity(); ... while(valueCount > getValueCapacity()) { reAlloc(); } if (valueCount > 0 && currentValueCapacity > valueCount * 2) { ... {code} The {{if}} statement is very probably wrong. We get the current capacity. If too small, we realloc. Then, we check the value, after the possible reallocation, without updating the capacity. It is not clear if we are trying to do check the before or after. Then: {code} if (valueCount > 0 && currentValueCapacity > valueCount * 2) { incrementAllocationMonitor(); // is ++allocationMonitor } else if (allocationMonitor > 0) { allocationMonitor = 0; } {code} It seems we are trying to guess if we did a realloc rather than simply incrementing the monitor when we actually do the reAlloc. It is also not clear that the if-statement mimics the if that does the reAlloc. In fact, a close inspection of the {{setValueCount()}} method, shows that it does not even work correctly. It is allocating enough space to hold the value count, which will implicitly set the unused slots to 0. But, this is an offset vector; we should be copying forward the last set offset. Consider an example: \[0, 3, 9] as set values. Set row count as 5. What we create is: \[0, 3, 9, 0, 0] What we should have set is: \[0, 3, 9, 9, 9] > Useless code in VectorTrimmer, fixed-width vector setValueCount() > - > > Key: DRILL-5488 > URL: https://issues.apache.org/jira/browse/DRILL-5488 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Priority: Trivial > > Consider this code in a generated fixed-width vector, such as UInt4Vector: > {code} > @Override > public void setValueCount(int valueCount) { > ... > final int idx = (VALUE_WIDTH * valueCount); > ... > VectorTrimmer.trim(data, idx); > data.writerIndex(valueCount * VALUE_WIDTH); > } > {code} > Consider the {{trim()}} method: > {code} > public class VectorTrimmer { > ... > public static void trim(ByteBuf data, int idx) { > data.writerIndex(idx); > if (data instanceof DrillBuf) { > // data.capacity(idx); > data.writerIndex(idx); > } > } > } > {code} > This method is called {{trim}}, but it actually sets the writer index in the > buffer (though we never use that index.) Since all buffers we use are > {{DrillBuf}}, the if-statement is a no-op: we simply set the writer index > twice. > But, notice that the {{setValueCount()}} method itself calls the same > {{writerIndex()}} method, so it is actually being called three times. > It seems this code can simply be discarded: it is called from only two > places; neither of which end up using the write
[jira] [Comment Edited] (DRILL-5488) Useless code in VectorTrimmer, fixed-width vector setValueCount()
[ https://issues.apache.org/jira/browse/DRILL-5488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001706#comment-16001706 ] Paul Rogers edited comment on DRILL-5488 at 5/8/17 10:55 PM: - Consider the {{setValueCount()}} method again: {code} public void setValueCount(int valueCount) { final int currentValueCapacity = getValueCapacity(); ... while(valueCount > getValueCapacity()) { reAlloc(); } if (valueCount > 0 && currentValueCapacity > valueCount * 2) { ... {code} The {{if}} statement is very probably wrong. We get the current capacity. If too small, we realloc. Then, we check the value, after the possible reallocation, without updating the capacity. It is not clear if we are trying to do check the before or after. Then: {code} if (valueCount > 0 && currentValueCapacity > valueCount * 2) { incrementAllocationMonitor(); // is ++allocationMonitor } else if (allocationMonitor > 0) { allocationMonitor = 0; } {code} It seems we are trying to guess if we did a realloc rather than simply incrementing the monitor when we actually do the reAlloc. It is also not clear that the if-statement mimics the if that does the reAlloc. In fact, a close inspection of the {{setValueCount()}} method, shows that it does not even work correctly. It is allocating enough space to hold the value count, which will implicitly set the unused slots to 0. But, this is an offset vector; we should be copying forward the last set offset. Consider an example: \[0, 3, 9] as set values. Set row count as 5. What we create is: \[0, 3, 9, 0, 0] What we should have set is: \[0, 3, 9, 9, 9] was (Author: paul-rogers): Consider the {{setValueCount()}} method again: {code} public void setValueCount(int valueCount) { final int currentValueCapacity = getValueCapacity(); ... while(valueCount > getValueCapacity()) { reAlloc(); } if (valueCount > 0 && currentValueCapacity > valueCount * 2) { ... {code} The {{if}} statement is very probably wrong. We get the current capacity. If too small, we realloc. Then, we check the value, after the possible reallocation, without updating the capacity. It is not clear if we are trying to do check the before or after. Then: {code} if (valueCount > 0 && currentValueCapacity > valueCount * 2) { incrementAllocationMonitor(); // is ++allocationMonitor } else if (allocationMonitor > 0) { allocationMonitor = 0; } {code} It seems we are trying to guess if we did a realloc rather than simply incrementing the monitor when we actually do the reAlloc. It is also not clear that the if-statement mimics the if that does the reAlloc. > Useless code in VectorTrimmer, fixed-width vector setValueCount() > - > > Key: DRILL-5488 > URL: https://issues.apache.org/jira/browse/DRILL-5488 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Priority: Trivial > > Consider this code in a generated fixed-width vector, such as UInt4Vector: > {code} > @Override > public void setValueCount(int valueCount) { > ... > final int idx = (VALUE_WIDTH * valueCount); > ... > VectorTrimmer.trim(data, idx); > data.writerIndex(valueCount * VALUE_WIDTH); > } > {code} > Consider the {{trim()}} method: > {code} > public class VectorTrimmer { > ... > public static void trim(ByteBuf data, int idx) { > data.writerIndex(idx); > if (data instanceof DrillBuf) { > // data.capacity(idx); > data.writerIndex(idx); > } > } > } > {code} > This method is called {{trim}}, but it actually sets the writer index in the > buffer (though we never use that index.) Since all buffers we use are > {{DrillBuf}}, the if-statement is a no-op: we simply set the writer index > twice. > But, notice that the {{setValueCount()}} method itself calls the same > {{writerIndex()}} method, so it is actually being called three times. > It seems this code can simply be discarded: it is called from only two > places; neither of which end up using the writer index. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5488) Useless code in VectorTrimmer, fixed-width vector setValueCount()
[ https://issues.apache.org/jira/browse/DRILL-5488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001706#comment-16001706 ] Paul Rogers commented on DRILL-5488: Consider the {{setValueCount()}} method again: {code} public void setValueCount(int valueCount) { final int currentValueCapacity = getValueCapacity(); ... while(valueCount > getValueCapacity()) { reAlloc(); } if (valueCount > 0 && currentValueCapacity > valueCount * 2) { ... {code} The {{if}} statement is very probably wrong. We get the current capacity. If too small, we realloc. Then, we check the value, after the possible reallocation, without updating the capacity. It is not clear if we are trying to do check the before or after. Then: {code} if (valueCount > 0 && currentValueCapacity > valueCount * 2) { incrementAllocationMonitor(); // is ++allocationMonitor } else if (allocationMonitor > 0) { allocationMonitor = 0; } {code} It seems we are trying to guess if we did a realloc rather than simply incrementing the monitor when we actually do the reAlloc. It is also not clear that the if-statement mimics the if that does the reAlloc. > Useless code in VectorTrimmer, fixed-width vector setValueCount() > - > > Key: DRILL-5488 > URL: https://issues.apache.org/jira/browse/DRILL-5488 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Priority: Trivial > > Consider this code in a generated fixed-width vector, such as UInt4Vector: > {code} > @Override > public void setValueCount(int valueCount) { > ... > final int idx = (VALUE_WIDTH * valueCount); > ... > VectorTrimmer.trim(data, idx); > data.writerIndex(valueCount * VALUE_WIDTH); > } > {code} > Consider the {{trim()}} method: > {code} > public class VectorTrimmer { > ... > public static void trim(ByteBuf data, int idx) { > data.writerIndex(idx); > if (data instanceof DrillBuf) { > // data.capacity(idx); > data.writerIndex(idx); > } > } > } > {code} > This method is called {{trim}}, but it actually sets the writer index in the > buffer (though we never use that index.) Since all buffers we use are > {{DrillBuf}}, the if-statement is a no-op: we simply set the writer index > twice. > But, notice that the {{setValueCount()}} method itself calls the same > {{writerIndex()}} method, so it is actually being called three times. > It seems this code can simply be discarded: it is called from only two > places; neither of which end up using the writer index. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (DRILL-5488) Useless code in VectorTrimmer, fixed-width vector setValueCount()
[ https://issues.apache.org/jira/browse/DRILL-5488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers updated DRILL-5488: --- Summary: Useless code in VectorTrimmer, fixed-width vector setValueCount() (was: Useless code in VectorTrimmer) > Useless code in VectorTrimmer, fixed-width vector setValueCount() > - > > Key: DRILL-5488 > URL: https://issues.apache.org/jira/browse/DRILL-5488 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Priority: Trivial > > Consider this code in a generated fixed-width vector, such as UInt4Vector: > {code} > @Override > public void setValueCount(int valueCount) { > ... > final int idx = (VALUE_WIDTH * valueCount); > ... > VectorTrimmer.trim(data, idx); > data.writerIndex(valueCount * VALUE_WIDTH); > } > {code} > Consider the {{trim()}} method: > {code} > public class VectorTrimmer { > ... > public static void trim(ByteBuf data, int idx) { > data.writerIndex(idx); > if (data instanceof DrillBuf) { > // data.capacity(idx); > data.writerIndex(idx); > } > } > } > {code} > This method is called {{trim}}, but it actually sets the writer index in the > buffer (though we never use that index.) Since all buffers we use are > {{DrillBuf}}, the if-statement is a no-op: we simply set the writer index > twice. > But, notice that the {{setValueCount()}} method itself calls the same > {{writerIndex()}} method, so it is actually being called three times. > It seems this code can simply be discarded: it is called from only two > places; neither of which end up using the writer index. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (DRILL-5488) Useless code in VectorTrimmer
[ https://issues.apache.org/jira/browse/DRILL-5488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers updated DRILL-5488: --- Description: Consider this code in a generated fixed-width vector, such as UInt4Vector: {code} @Override public void setValueCount(int valueCount) { ... final int idx = (VALUE_WIDTH * valueCount); ... VectorTrimmer.trim(data, idx); data.writerIndex(valueCount * VALUE_WIDTH); } {code} Consider the {{trim()}} method: {code} public class VectorTrimmer { ... public static void trim(ByteBuf data, int idx) { data.writerIndex(idx); if (data instanceof DrillBuf) { // data.capacity(idx); data.writerIndex(idx); } } } {code} This method is called {{trim}}, but it actually sets the writer index in the buffer (though we never use that index.) Since all buffers we use are {{DrillBuf}}, the if-statement is a no-op: we simply set the writer index twice. But, notice that the {{setValueCount()}} method itself calls the same {{writerIndex()}} method, so it is actually being called three times. It seems this code can simply be discarded: it is called from only two places; neither of which end up using the writer index. was: Consider this code in a generated fixed-width vector, such as UInt4Vector: {code} @Override public void setValueCount(int valueCount) { ... final int idx = (VALUE_WIDTH * valueCount); ... VectorTrimmer.trim(data, idx); data.writerIndex(valueCount * VALUE_WIDTH); } {code} Consider the {{trim()}} method: {code} public class VectorTrimmer { ... public static void trim(ByteBuf data, int idx) { data.writerIndex(idx); if (data instanceof DrillBuf) { // data.capacity(idx); data.writerIndex(idx); } } } {code} This method is called {{trim}}, but it actually sets the writer index in the buffer (though we never use that index.) Since all buffers we use are {{DrillBuf}}, the if-statement is a no-op: we simply set the writer index twice. It seems this code can simply be discarded: it is called from only two places; neither of which end up using the writer index. > Useless code in VectorTrimmer > - > > Key: DRILL-5488 > URL: https://issues.apache.org/jira/browse/DRILL-5488 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Priority: Trivial > > Consider this code in a generated fixed-width vector, such as UInt4Vector: > {code} > @Override > public void setValueCount(int valueCount) { > ... > final int idx = (VALUE_WIDTH * valueCount); > ... > VectorTrimmer.trim(data, idx); > data.writerIndex(valueCount * VALUE_WIDTH); > } > {code} > Consider the {{trim()}} method: > {code} > public class VectorTrimmer { > ... > public static void trim(ByteBuf data, int idx) { > data.writerIndex(idx); > if (data instanceof DrillBuf) { > // data.capacity(idx); > data.writerIndex(idx); > } > } > } > {code} > This method is called {{trim}}, but it actually sets the writer index in the > buffer (though we never use that index.) Since all buffers we use are > {{DrillBuf}}, the if-statement is a no-op: we simply set the writer index > twice. > But, notice that the {{setValueCount()}} method itself calls the same > {{writerIndex()}} method, so it is actually being called three times. > It seems this code can simply be discarded: it is called from only two > places; neither of which end up using the writer index. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (DRILL-5488) Useless code in VectorTrimmer
Paul Rogers created DRILL-5488: -- Summary: Useless code in VectorTrimmer Key: DRILL-5488 URL: https://issues.apache.org/jira/browse/DRILL-5488 Project: Apache Drill Issue Type: Bug Affects Versions: 1.10.0 Reporter: Paul Rogers Priority: Trivial Consider this code in a generated fixed-width vector, such as UInt4Vector: {code} @Override public void setValueCount(int valueCount) { ... final int idx = (VALUE_WIDTH * valueCount); ... VectorTrimmer.trim(data, idx); data.writerIndex(valueCount * VALUE_WIDTH); } {code} Consider the {{trim()}} method: {code} public class VectorTrimmer { ... public static void trim(ByteBuf data, int idx) { data.writerIndex(idx); if (data instanceof DrillBuf) { // data.capacity(idx); data.writerIndex(idx); } } } {code} This method is called {{trim}}, but it actually sets the writer index in the buffer (though we never use that index.) Since all buffers we use are {{DrillBuf}}, the if-statement is a no-op: we simply set the writer index twice. It seems this code can simply be discarded: it is called from only two places; neither of which end up using the writer index. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (DRILL-4335) Apache Drill should support network encryption
[ https://issues.apache.org/jira/browse/DRILL-4335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sudheesh Katkam updated DRILL-4335: --- Labels: ready-to-commit security (was: security) > Apache Drill should support network encryption > -- > > Key: DRILL-4335 > URL: https://issues.apache.org/jira/browse/DRILL-4335 > Project: Apache Drill > Issue Type: New Feature >Reporter: Keys Botzum >Assignee: Sorabh Hamirwasia > Labels: ready-to-commit, security > Attachments: ApacheDrillEncryptionUsingSASLDesign.pdf > > > This is clearly related to Drill-291 but wanted to make explicit that this > needs to include network level encryption and not just authentication. This > is particularly important for the client connection to Drill which will often > be sending passwords in the clear until there is encryption. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-5486) Compliant text reader tries, but fails, to ignore empty rows
[ https://issues.apache.org/jira/browse/DRILL-5486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001672#comment-16001672 ] Paul Rogers commented on DRILL-5486: More cascading errors. In {{RepeatedVarCharOutput.finishBatch()}}: {code} public void finishBatch() { mutator.setValueCount(batchIndex); } {code} Here, we are using {{batchIndex}} to be the row count. We must consider the two cases: it is right, and it is off-by-one. For the correct case, let's look at {{BaseRepeatedValueVector.setValueCount()}}: {code} public void setValueCount(int valueCount) { // TODO: populate offset end points offsets.getMutator().setValueCount(valueCount == 0 ? 0 : valueCount+1); final int childValueCount = valueCount == 0 ? 0 : offsets.getAccessor().get(valueCount); vector.getMutator().setValueCount(childValueCount); } {code} Consider that the row count is one. The offset vector must contain two entries: \[0, x], where x is the length of the first (and only) array. That is the offset count is always one more that the row count (except when we have zero rows.) Some background if we have just one row: * The first entry must always be zero. * To get the length of the ith row, we do: offset\[i+1] - offset\[i]; So, we need the result \[0, final-offset]. Then, we need to tell the array the number of entries it contains. This is the sum of all array entries in the batch. That value is in the (row count)st position of the offset vector as mentioned above. This works because the {{RepeatedVarCharOutput}} updates the (I+1)st postiion for each row. Now, consider the case with the bug mentioned earlier: we update the offset vector, but not the row count. Consider this input: {code} a,b,c {code} The first line is blank, so the {{batchIndex}} is *not* incremented, but the offset vector *is* written. At the end: {code} batchIndex = 1 offset vector = [0, 0, 3] {code} Then, the above method is given a row count of 1, and so: {code} count of row offsets: 2 offset values: [0, 0, 3] (but, value of 3 is ignored because count is set to 2) childValueCount = row offsets[1] = 0 `columns` array item count: 0 (though there are actually three children) {code} Then, the next layer sets the row count to 2 (not knowing that the first row was skipped.) So, in the end we have: * A row count of 2 * A `columns` offset vector of length 2: \[0, 0], meaning 1 row of zero length. * A `columns` item offset vector of length 0, meaning an empty array (in the only row) This is an unstable situation: 2 rows at the container level, but only one row set up in the `columns` vector. This is very confusing because the code is wrong, so one not only has to understand what should happen, but the incorrect behavior that actually happens. > Compliant text reader tries, but fails, to ignore empty rows > > > Key: DRILL-5486 > URL: https://issues.apache.org/jira/browse/DRILL-5486 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Priority: Minor > > The "compliant" text reader (for CSV files, etc.) has two modes: reading with > headers (which creates a scalar vector per file column), or without headers > (that creates a single {{columns}} vector with an array of all columns. > When run in array mode, the code uses a class called > {{RepeatedVarCharOutput}}. This class attempts to ignore empty records: > {code} > public void finishRecord() { > ... > // if there were no defined fields, skip. > if (fieldIndex > -1) { > batchIndex++; > recordCount++; > } > ... > {code} > As it turns out, this works only on the *first* row. On the first row, the > {{fieldIndex}} has its initial value of -1. But, for subsequent records, > {{fieldIndex}} is never reset and retains its value from the last field set. > The result is that the code skips the first empty row, but not empty rows > elsewhere in the file. > Further, on the first row, the logic is flawed. The part shown above is only > for the row counts. Here is more of the logic: > {code} > @Override > public void finishRecord() { > recordStart = characterData; > ... > int newOffset = ((int) (charLengthOffset - charLengthOffsetOriginal))/4; > PlatformDependent.putInt(repeatedOffset, newOffset); > repeatedOffset += 4; > // if there were no defined fields, skip. > if (fieldIndex > -1) { > batchIndex++; > recordCount++; > } > } > {code} > Note that the underlying vector *is* bumped for the record, even though the > row count is not. Later, this will cause the container to have one less > record. > This would be a problem if the row count ({{batchIndex}}) was used for the > batch row count. But, it is not, the n
[jira] [Commented] (DRILL-5470) CSV reader data corruption on truncated lines
[ https://issues.apache.org/jira/browse/DRILL-5470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001556#comment-16001556 ] Paul Rogers commented on DRILL-5470: Turns out the user is not actually using the columns header feature of CSV. Created DRILL-5487 to describe the bug found above in the CSV-with headers case. Instead, the user uses the {{columns}} form of CSV files. This ticket will focus on that issue. > CSV reader data corruption on truncated lines > - > > Key: DRILL-5470 > URL: https://issues.apache.org/jira/browse/DRILL-5470 > Project: Apache Drill > Issue Type: Bug > Components: Server >Affects Versions: 1.10.0 > Environment: - ubuntu 14.04 > - r3.8xl (32 CPU/240GB Mem) > - openjdk version "1.8.0_111" > - drill 1.10.0 with 8656c83b00f8ab09fb6817e4e9943b2211772541 cherry-picked >Reporter: Nathan Butler >Assignee: Paul Rogers >Priority: Critical > > Per the mailing list discussion and Rahul's and Paul's suggestion I'm filing > this Jira issue. Drill seems to be running out of memory when doing an > External Sort. Per Zelaine's suggestion I enabled > sort.external.disable_managed in drill-override.conf and in the sqlline > session. This caused the query to run for longer but it still would fail with > the same message. > Per Paul's suggestion, I enabled debug logging for the > org.apache.drill.exec.physical.impl.xsort.managed package and re-ran the > query. > Here's the initial DEBUG line for ExternalSortBatch for our query: > bq. 2017-05-03 12:02:56,095 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:15] > DEBUG o.a.d.e.p.i.x.m.ExternalSortBatch - Config: memory limit = 10737418240, > spill file size = 268435456, spill batch size = 8388608, merge limit = > 2147483647, merge batch size = 16777216 > And here's the last DEBUG line before the stack trace: > bq. 2017-05-03 12:37:44,249 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:4] > DEBUG o.a.d.e.p.i.x.m.ExternalSortBatch - Available memory: 10737418240, > buffer memory = 10719535268, merge memory = 10707140978 > And the stacktrace: > {quote} > 2017-05-03 12:38:02,927 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:6] INFO > o.a.d.e.p.i.x.m.ExternalSortBatch - User Error Occurred: External Sort > encountered an error while spilling to disk (Un > able to allocate buffer of size 268435456 due to memory limit. Current > allocation: 10579849472) > org.apache.drill.common.exceptions.UserException: RESOURCE ERROR: External > Sort encountered an error while spilling to disk > [Error Id: 5d53c677-0cd9-4c01-a664-c02089670a1c ] > at > org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:544) > ~[drill-common-1.10.0.jar:1.10.0] > at > org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.doMergeAndSpill(ExternalSortBatch.java:1447) > [drill-java-exec-1.10.0.jar:1.10.0] > at > org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.mergeAndSpill(ExternalSortBatch.java:1376) > [drill-java-exec-1.10.0.jar:1.10.0] > at > org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.spillFromMemory(ExternalSortBatch.java:1339) > [drill-java-exec-1.10.0.jar:1.10.0] > at > org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.processBatch(ExternalSortBatch.java:831) > [drill-java-exec-1.10.0.jar:1.10.0] > at > org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.loadBatch(ExternalSortBatch.java:618) > [drill-java-exec-1.10.0.jar:1.10.0] > at > org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.load(ExternalSortBatch.java:660) > [drill-java-exec-1.10.0.jar:1.10.0] > at > org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.innerNext(ExternalSortBatch.java:559) > [drill-java-exec-1.10.0.jar:1.10.0] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162) > [drill-java-exec-1.10.0.jar:1.10.0] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119) > [drill-java-exec-1.10.0.jar:1.10.0] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109) > [drill-java-exec-1.10.0.jar:1.10.0] > at > org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch.innerNext(StreamingAggBatch.java:137) > [drill-java-exec-1.10.0.jar:1.10.0] > at > org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162) > [drill-java-exec-1.10.0.jar:1.10.0] > at > org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:104) > [drill-java-exec-1.10.0.jar:1.10.0] > at > org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec
[jira] [Created] (DRILL-5487) Vector corruption in CSV with headers and truncated last row
Paul Rogers created DRILL-5487: -- Summary: Vector corruption in CSV with headers and truncated last row Key: DRILL-5487 URL: https://issues.apache.org/jira/browse/DRILL-5487 Project: Apache Drill Issue Type: Bug Affects Versions: 1.10.0 Reporter: Paul Rogers The CSV format plugin allows two ways of reading data: * As named columns * As a single array, called {{columns}}, that holds all columns for a row The named columns feature will corrupt the offset vectors if the last row of the file is truncated: leaves off one or more columns. To illustrate the CSV data corruption, I created a CSV file, test4.csv, of the following form: {code} h,u abc,def ghi {code} Note that the file is truncated: the command and second field is missing on the last line. Then, I created a simple test using the "cluster fixture" framework: {code} @Test public void readerTest() throws Exception { FixtureBuilder builder = ClusterFixture.builder() .maxParallelization(1); try (ClusterFixture cluster = builder.build(); ClientFixture client = cluster.clientFixture()) { TextFormatConfig csvFormat = new TextFormatConfig(); csvFormat.fieldDelimiter = ','; csvFormat.skipFirstLine = false; csvFormat.extractHeader = true; cluster.defineWorkspace("dfs", "data", "/tmp/data", "csv", csvFormat); String sql = "SELECT * FROM `dfs.data`.`csv/test4.csv` LIMIT 10"; client.queryBuilder().sql(sql).printCsv(); } } {code} The results show we've got a problem: {code} Exception (no rows returned): org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: IllegalArgumentException: length: -3 (expected: >= 0) {code} If the last line were: {code} efg, {code} Then the offset vector should look like this: {code} [0, 3, 3] {code} Very likely we have an offset vector that looks like this instead: {code} [0, 3, 0] {code} When we compute the second column of the second row, we should compute: {code} length = offset[2] - offset[1] = 3 - 3 = 0 {code} Instead we get: {code} length = offset[2] - offset[1] = 0 - 3 = -3 {code} The summary is that a premature EOF appears to cause the "missing" columns to be skipped; they are not filled with a blank value to "bump" the offset vectors to fill in the last row. Instead, they are left at 0, causing havoc downstream in the query. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (DRILL-5486) Compliant text reader tries, but fails, to ignore empty rows
[ https://issues.apache.org/jira/browse/DRILL-5486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers updated DRILL-5486: --- Description: The "compliant" text reader (for CSV files, etc.) has two modes: reading with headers (which creates a scalar vector per file column), or without headers (that creates a single {{columns}} vector with an array of all columns. When run in array mode, the code uses a class called {{RepeatedVarCharOutput}}. This class attempts to ignore empty records: {code} public void finishRecord() { ... // if there were no defined fields, skip. if (fieldIndex > -1) { batchIndex++; recordCount++; } ... {code} As it turns out, this works only on the *first* row. On the first row, the {{fieldIndex}} has its initial value of -1. But, for subsequent records, {{fieldIndex}} is never reset and retains its value from the last field set. The result is that the code skips the first empty row, but not empty rows elsewhere in the file. Further, on the first row, the logic is flawed. The part shown above is only for the row counts. Here is more of the logic: {code} @Override public void finishRecord() { recordStart = characterData; ... int newOffset = ((int) (charLengthOffset - charLengthOffsetOriginal))/4; PlatformDependent.putInt(repeatedOffset, newOffset); repeatedOffset += 4; // if there were no defined fields, skip. if (fieldIndex > -1) { batchIndex++; recordCount++; } } {code} Note that the underlying vector *is* bumped for the record, even though the row count is not. Later, this will cause the container to have one less record. This would be a problem if the row count ({{batchIndex}}) was used for the batch row count. But, it is not, the next level of code keeps a separate count, so the "skip empty row" logic causes the counts to get out of sync between the {{RepeatedVarCharOutput}} and the next level of reader. In sense, there are multiple self-cancelling bugs: * The skip-empty-row logic is not synchronized with the upper layer. * That bug is partly masked because the logic is not triggered except on the first row. * That bug is masked because we always set the row offset, even if we ignore the row in the row count. * That bug is masked because the incorrect row count is ignored when setting the container row count. was: The "compliant" text reader (for CSV files, etc.) has two modes: reading with headers (which creates a scalar vector per file column), or without headers (that creates a single {{columns}} vector with an array of all columns. When run in array mode, the code uses a class called {{RepeatedVarCharOutput}}. This class attempts to ignore empty records: {code} public void finishRecord() { ... // if there were no defined fields, skip. if (fieldIndex > -1) { batchIndex++; recordCount++; } ... {code} As it turns out, this works only on the *first* row. On the first row, the {{fieldIndex}} has its initial value of -1. But, for subsequent records, {{fieldIndex}} is never reset and retains its value from the last field set. The result is that the code skips the first empty row, but not empty rows elsewhere in the file. Further, on the first row, the logic is flawed. The part shown above is only for the row counts. Here is more of the logic: {code} @Override public void finishRecord() { recordStart = characterData; ... int newOffset = ((int) (charLengthOffset - charLengthOffsetOriginal))/4; PlatformDependent.putInt(repeatedOffset, newOffset); repeatedOffset += 4; // if there were no defined fields, skip. if (fieldIndex > -1) { batchIndex++; recordCount++; } } {code} Note that the underlying vector *is* bumped for the record, even though the row count is not. Later, this will cause the container to have one less record. Suppose we have this data: [] [abc, def] The above bug leaves the data in this form: [] Why? We counted only one record (the second), but we created an entry for the first (empty) record. When we set row count, we set it to 1, which is only the first. The result is the loss of the second row. Or, that would be true if the row count ({{batchIndex}}) was used for the batch row count. But, it is not, the next level of code keeps a separate count, so the "skip empty row" logic causes the counts to get out of sync between the {{RepeatedVarCharOutput}} and the next level of reader. In sense, there are multiple self-cancelling bugs: * The skip-empty-row logic is not synchronized with the upper layer. * That bug is partly masked because the logic is not triggered except on the first row. * That bug is masked because we always set the row offset, even if we ignore the row in the row count. * That bug is masked because the incorrect row count is ignored when setting the container row co
[jira] [Created] (DRILL-5486) Compliant text reader tries, but fails, to ignore empty rows
Paul Rogers created DRILL-5486: -- Summary: Compliant text reader tries, but fails, to ignore empty rows Key: DRILL-5486 URL: https://issues.apache.org/jira/browse/DRILL-5486 Project: Apache Drill Issue Type: Bug Affects Versions: 1.10.0 Reporter: Paul Rogers Priority: Minor The "compliant" text reader (for CSV files, etc.) has two modes: reading with headers (which creates a scalar vector per file column), or without headers (that creates a single {{columns}} vector with an array of all columns. When run in array mode, the code uses a class called {{RepeatedVarCharOutput}}. This class attempts to ignore empty records: {code} public void finishRecord() { ... // if there were no defined fields, skip. if (fieldIndex > -1) { batchIndex++; recordCount++; } ... {code} As it turns out, this works only on the *first* row. On the first row, the {{fieldIndex}} has its initial value of -1. But, for subsequent records, {{fieldIndex}} is never reset and retains its value from the last field set. The result is that the code skips the first empty row, but not empty rows elsewhere in the file. Further, on the first row, the logic is flawed. The part shown above is only for the row counts. Here is more of the logic: {code} @Override public void finishRecord() { recordStart = characterData; ... int newOffset = ((int) (charLengthOffset - charLengthOffsetOriginal))/4; PlatformDependent.putInt(repeatedOffset, newOffset); repeatedOffset += 4; // if there were no defined fields, skip. if (fieldIndex > -1) { batchIndex++; recordCount++; } } {code} Note that the underlying vector *is* bumped for the record, even though the row count is not. Later, this will cause the container to have one less record. Suppose we have this data: [] [abc, def] The above bug leaves the data in this form: [] Why? We counted only one record (the second), but we created an entry for the first (empty) record. When we set row count, we set it to 1, which is only the first. The result is the loss of the second row. Or, that would be true if the row count ({{batchIndex}}) was used for the batch row count. But, it is not, the next level of code keeps a separate count, so the "skip empty row" logic causes the counts to get out of sync between the {{RepeatedVarCharOutput}} and the next level of reader. In sense, there are multiple self-cancelling bugs: * The skip-empty-row logic is not synchronized with the upper layer. * That bug is partly masked because the logic is not triggered except on the first row. * That bug is masked because we always set the row offset, even if we ignore the row in the row count. * That bug is masked because the incorrect row count is ignored when setting the container row count. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Closed] (DRILL-4982) Hive Queries degrade when queries switch between different formats
[ https://issues.apache.org/jira/browse/DRILL-4982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dechang Gu closed DRILL-4982. - Verified and added testcase to perf test framework. > Hive Queries degrade when queries switch between different formats > -- > > Key: DRILL-4982 > URL: https://issues.apache.org/jira/browse/DRILL-4982 > Project: Apache Drill > Issue Type: Bug >Reporter: Chunhui Shi >Assignee: Karthikeyan Manivannan >Priority: Critical > Fix For: 1.10.0 > > > We have seen degraded performance by doing these steps: > 1) generate the repro data: > python script repro.py as below: > import string > import random > > for i in range(3000): > x1 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ > in range(random.randrange(19, 27))) > x2 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ > in range(random.randrange(19, 27))) > x3 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ > in range(random.randrange(19, 27))) > x4 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ > in range(random.randrange(19, 27))) > x5 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ > in range(random.randrange(19, 27))) > x6 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ > in range(random.randrange(19, 27))) > print > "{0}".format(x1),"{0}".format(x2),"{0}".format(x3),"{0}".format(x4),"{0}".format(x5),"{0}".format(x6) > python repro.py > repro.csv > 2) put these files in a dfs directory e.g. '/tmp/hiveworkspace/plain'. Under > hive prompt, use the following sql command to create an external table: > CREATE EXTERNAL TABLE `hiveworkspace`.`plain` (`id1` string, `id2` string, > `id3` string, `id4` string, `id5` string, `id6` string) ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION > '/tmp/hiveworkspace/plain' > 3) create Hive's table of ORC|PARQUET format: > CREATE TABLE `hiveworkspace`.`plainorc` STORED AS ORC AS SELECT > id1,id2,id3,id4,id5,id6 from `hiveworkspace`.`plain`; > CREATE TABLE `hiveworkspace`.`plainparquet` STORED AS PARQUET AS SELECT > id1,id2,id3,id4,id5,id6 from `hiveworkspace`.`plain`; > 4) Query switch between these two tables, then the query time on the same > table significantly lengthened. On my setup, for ORC, it was 15sec -> 26secs. > Queries on table of other formats, after injecting a query to other formats, > all have significant slow down. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (DRILL-4824) JSON with complex nested data produces incorrect output with missing fields
[ https://issues.apache.org/jira/browse/DRILL-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva updated DRILL-4824: Issue Type: Improvement (was: Bug) > JSON with complex nested data produces incorrect output with missing fields > --- > > Key: DRILL-4824 > URL: https://issues.apache.org/jira/browse/DRILL-4824 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - JSON >Affects Versions: 1.0.0 >Reporter: Roman >Assignee: Volodymyr Vysotskyi > > There is incorrect output in case of JSON file with complex nested data. > _JSON:_ > {code:none|title=example.json|borderStyle=solid} > { > "Field1" : { > } > } > { > "Field1" : { > "InnerField1": {"key1":"value1"}, > "InnerField2": {"key2":"value2"} > } > } > { > "Field1" : { > "InnerField3" : ["value3", "value4"], > "InnerField4" : ["value5", "value6"] > } > } > {code} > _Query:_ > {code:sql} > select Field1 from dfs.`/tmp/example.json` > {code} > _Incorrect result:_ > {code:none} > +---+ > | Field1 | > +---+ > {"InnerField1":{},"InnerField2":{},"InnerField3":[],"InnerField4":[]} > {"InnerField1":{"key1":"value1"},"InnerField2" > {"key2":"value2"},"InnerField3":[],"InnerField4":[]} > {"InnerField1":{},"InnerField2":{},"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]} > +--+ > {code} > Theres is no need to output missing fields. In case of deeply nested > structure we will get unreadable result for user. > _Correct result:_ > {code:none} > +--+ > | Field1 | > +--+ > |{} > {"InnerField1":{"key1":"value1"},"InnerField2":{"key2":"value2"}} > {"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]} > +--+ > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-4917) Unable to alias columns from subquery
[ https://issues.apache.org/jira/browse/DRILL-4917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16000383#comment-16000383 ] sven0726 commented on DRILL-4917: - The same thing happened to me 0: jdbc:drill:zk=local> SELECT `respCode`,SUM(`num`) AS `num` FROM GROUP BY `respCode` ; +---++ | respCode | $f1 | +---++ > Unable to alias columns from subquery > - > > Key: DRILL-4917 > URL: https://issues.apache.org/jira/browse/DRILL-4917 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.8.0 >Reporter: Dan Wild > > Column aliasing works in a simple query (without subqueries): > {code} > select 1 as myVal from (values(1)) > {code} > My result set gives me one column called myVal as expected > |myVal|| > |1| > However, when I run the query > {code} > select myVal as myValAlias FROM(select 1 as myVal from (values(1))) > {code} > the alias myValAlias is not applied, and the resulting column is still called > myVal > |myVal|| > |1| > This is problematic because if my query instead looked like > {code} > select myVal, SUM(myVal) as mySum FROM(select 1 as myVal from (values(1))) > GROUP BY myVal > {code} > I would get a result set back that looks like this, with no way to alias the > second column: > |myVal|$f1| > |1|1| -- This message was sent by Atlassian JIRA (v6.3.15#6346)