[jira] [Commented] (DRILL-4609) Select true,true,true from ... does not always output true,true,true

2017-05-08 Thread Kunal Khatua (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002122#comment-16002122
 ] 

Kunal Khatua commented on DRILL-4609:
-

[~karthikm] Do you think there is a optimization bug in JDK 1.8 for this?

> Select true,true,true from ... does not always output true,true,true
> 
>
> Key: DRILL-4609
> URL: https://issues.apache.org/jira/browse/DRILL-4609
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Client - CLI, Query Planning & Optimization, Storage - 
> Writer
>Affects Versions: 1.5.0, 1.6.0
> Environment: Linux Redhat
> tested in cluster (hdfs) and embedded mode
>Reporter: F Méthot
>
> Doing a simple "select true, true, true from table" won't output 
> true,true,true on all generated rows.
> Step to reproduce.
> generate a simple CSV files:
> {code:sql}
>   for i in {1..100}; do echo "Allo"; done > /users/fmethot/test.csv
> {code}
> Open a new fresh drill CLI.
> Just to help for validation, switch output to CSV: 
> {code:sql}
>   alter session set `store.format`='csv' 
> {code}
> generate a table like this:
> {code:sql}
>create table TEST_OUT as (select true,true,true,true from 
> dfs.`/users/fmethot/test.csv')
> {code}
> Check content of /users/fmethot/test.csv
> You will find false values in there!
> If you generate another table, on the same session, the same way, chances are 
> the value will be fine (all true). We can only reproduce this on the first 
> CTAS run. 
> We came to test this select pattern after we realize our custom boolean UDF 
> (as well as the one provided in Drill like "ilike") were not outputting 
> consistent deterministic results (same input were implausibly generating 
> random boolean output). We hope that fixing this ticket will also fix our 
> issue with boolean UDFs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5300) SYSTEM ERROR: IllegalStateException: Memory was leaked by query while querying parquet files

2017-05-08 Thread Kunal Khatua (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002117#comment-16002117
 ] 

Kunal Khatua commented on DRILL-5300:
-

[~mgelbana] Did Drill 1.10 resolve this?

> SYSTEM ERROR: IllegalStateException: Memory was leaked by query while 
> querying parquet files
> 
>
> Key: DRILL-5300
> URL: https://issues.apache.org/jira/browse/DRILL-5300
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.9.0
> Environment: OS: Linux
>Reporter: Muhammad Gelbana
> Attachments: both_queries_logs.zip
>
>
> Running the following query against parquet files (I modified some values for 
> privacy reasons)
> {code:title=Query causing the long logs|borderStyle=solid}
> SELECT AL4.NAME, AL5.SEGMENT2, SUM(AL1.AMOUNT), AL2.ATTRIBUTE4, 
> AL2.XXX__CODE, AL8.D_BU, AL8.F_PL, AL18.COUNTRY, AL13.COUNTRY, 
> AL11.NAME FROM 
> dfs.`/disk2/XXX/XXX//data/../parquet/XXX_XX/RA__TRX_LINE_GL_DIST_ALL`
>  AL1, 
> dfs.`/disk2/XXX/XXX//data/../parquet/XXX_XX/RA_OMER_TRX_ALL`
>  AL2, 
> dfs.`/disk2/XXX/XXX//data/../parquet/XXX_FIN_COMMON/GL_XXX`
>  AL3, 
> dfs.`/disk2/XXX/XXX//data/../parquet/XXX_HR_COMMON/HR_ALL_ORGANIZATION_UNITS`
>  AL4, 
> dfs.`/disk2/XXX/XXX//data/../parquet/XXX_FIN_COMMON/GL_CODE_COMBINATIONS`
>  AL5, 
> dfs.`/disk2/XXX/XXX//data/../parquet//XXAT_AR_MU_TAB` 
> AL8, 
> dfs.`/disk2/XXX/XXX//data/../parquet/XXX_FIN_COMMON/GL_XXX`
>  AL11, 
> dfs.`/disk2/XXX/XXX//data/../parquet/XXX_X_COMMON/XX_X_S`
>  AL12, 
> dfs.`/disk2/XXX/XXX//data/../parquet/XXX_X_COMMON/XX_LOCATIONS`
>  AL13, 
> dfs.`/disk2/XXX/XXX//data/../parquet/XXX_X_COMMON/XX___S_ALL`
>  AL14, 
> dfs.`/disk2/XXX/XXX//data/../parquet/XXX_X_COMMON/XX___USES_ALL`
>  AL15, 
> dfs.`/disk2/XXX/XXX//data/../parquet/XXX_X_COMMON/XX___S_ALL`
>  AL16, 
> dfs.`/disk2/XXX/XXX//data/../parquet/XXX_X_COMMON/XX___USES_ALL`
>  AL17, 
> dfs.`/disk2/XXX/XXX//data/../parquet/XXX_X_COMMON/XX_LOCATIONS`
>  AL18, 
> dfs.`/disk2/XXX/XXX//data/../parquet/XXX_X_COMMON/XX_X_S`
>  AL19 WHERE (AL2.SHIP_TO__USE_ID = AL15._USE_ID AND 
> AL15.___ID = AL14.___ID AND AL14.X__ID = 
> AL12.X__ID AND AL12.LOCATION_ID = AL13.LOCATION_ID AND 
> AL17.___ID = AL16.___ID AND AL16.X__ID = 
> AL19.X__ID AND AL19.LOCATION_ID = AL18.LOCATION_ID AND 
> AL2.BILL_TO__USE_ID = AL17._USE_ID AND AL2.SET_OF_X_ID = 
> AL3.SET_OF_X_ID AND AL1.CODE_COMBINATION_ID = AL5.CODE_COMBINATION_ID AND 
> AL5.SEGMENT4 = AL8.MU AND AL1.SET_OF_X_ID = AL11.SET_OF_X_ID AND 
> AL2.ORG_ID = AL4.ORGANIZATION_ID AND AL2.OMER_TRX_ID = 
> AL1.OMER_TRX_ID) AND ((AL5.SEGMENT2 = '41' AND AL1.AMOUNT <> 0 AND 
> AL4.NAME IN ('XXX-XX-', 'XXX-XX-', 'XXX-XX-', 'XXX-XX-', 
> 'XXX-XX-', 'XXX-XX-', 'XXX-XX-', 'XXX-XX-', 'XXX-XX-') 
> AND AL3.NAME like '%-PR-%')) GROUP BY AL4.NAME, AL5.SEGMENT2, AL2.ATTRIBUTE4, 
> AL2.XXX__CODE, AL8.D_BU, AL8.F_PL, AL18.COUNTRY, AL13.COUNTRY, 
> AL11.NAME
> {code}
> {code:title=Query causing the short logs|borderStyle=solid}
> SELECT AL11.NAME
> FROM
> dfs.`/XXX/XXX/XXX/data/../parquet/XXX_XXX_COMMON/GL_XXX` 
> LIMIT 10
> {code}
> This issue may be a duplicate for [this 
> one|https://issues.apache.org/jira/browse/DRILL-4398] but I created a new one 
> based on [this 
> suggestion|https://issues.apache.org/jira/browse/DRILL-4398?focusedCommentId=15884846&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15884846].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5492) CSV reader does not validate header names, causes nonsense output

2017-05-08 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002080#comment-16002080
 ] 

Paul Rogers commented on DRILL-5492:


Could not readily find a way to cause serious harm with this bug, so marked it 
as minor.

> CSV reader does not validate header names, causes nonsense output
> -
>
> Key: DRILL-5492
> URL: https://issues.apache.org/jira/browse/DRILL-5492
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Priority: Minor
>
> Consider the same test case as in DRILL-5491, but with a slightly different 
> input file:
> {code}
> ___
> a,b,c
> d,e,f
> {code}
> The underscores represent three spaces: use spaces in the real test.
> In this case, the code discussed in DRILL-5491 finds some characters and 
> happily returns the following array:
> {code}
> ["   "]
> {code}
> The field name of three blanks is returned to the client to produce the 
> following bizarre output:
> {code}
> 2 row(s):
> 
> a
> d
> {code}
> The blank line is normally the header, but the header here was considered to 
> be three blanks. (In fact, the blanks are actually printed.)
> Since the blanks were considered to be a field, the file is assumed to have 
> only one field, so only the first column was returned.
> The expected behavior is that spaces are trimmed from field names, so the 
> field name list would be empty and a User Error thrown. (That is, it is 
> confusing to the user why a blank line produces NPE, some produce the 
> {{ExecutionSetupException}} shown in DRILL-5491, and some produce blank 
> headings. Behavior should be consistent.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (DRILL-5492) CSV reader does not validate header names, causes nonsense output

2017-05-08 Thread Paul Rogers (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-5492:
---
Summary: CSV reader does not validate header names, causes nonsense output  
(was: CSV with spaces for header uses spaces as field name)

> CSV reader does not validate header names, causes nonsense output
> -
>
> Key: DRILL-5492
> URL: https://issues.apache.org/jira/browse/DRILL-5492
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Priority: Minor
>
> Consider the same test case as in DRILL-5491, but with a slightly different 
> input file:
> {code}
> ___
> a,b,c
> d,e,f
> {code}
> The underscores represent three spaces: use spaces in the real test.
> In this case, the code discussed in DRILL-5491 finds some characters and 
> happily returns the following array:
> {code}
> ["   "]
> {code}
> The field name of three blanks is returned to the client to produce the 
> following bizarre output:
> {code}
> 2 row(s):
> 
> a
> d
> {code}
> The blank line is normally the header, but the header here was considered to 
> be three blanks. (In fact, the blanks are actually printed.)
> Since the blanks were considered to be a field, the file is assumed to have 
> only one field, so only the first column was returned.
> The expected behavior is that spaces are trimmed from field names, so the 
> field name list would be empty and a User Error thrown. (That is, it is 
> confusing to the user why a blank line produces NPE, some produce the 
> {{ExecutionSetupException}} shown in DRILL-5491, and some produce blank 
> headings. Behavior should be consistent.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5492) CSV with spaces for header uses spaces as field name

2017-05-08 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002078#comment-16002078
 ] 

Paul Rogers commented on DRILL-5492:


Even more fun. Input file:

{code}
EXPR$0,EXPR$1,EXPR$2
a,b,c
d,e,f
{code}

Output:

{code}
2 row(s):
EXPR$0,EXPR$1,EXPR$2
a,b,c
d,e,f
{code}

Then, use this SQL:

{code}
SELECT `EXPR$1` as A, 1 + 2, 3 + 4 FROM `dfs.data`.`csv/test11.csv`
{code}

To get this output:

{code}
A,EXPR$1,EXPR$2
b,3,7
e,3,7
{code}

And this:

{code}
SELECT `EXPR$1`, 1 + 2, 3 + 4 FROM `dfs.data`.`csv/test11.csv`
{code}

Produces this:

{code}
2 row(s):
EXPR$1,EXPR$10,EXPR$11
b,3,7
e,3,7
{code}

At least Drill was smart enough to remap the first expression to EXPR$10 to 
avoid the name conflict...

Still, would be better to disallow Drill internal names as column headers.

> CSV with spaces for header uses spaces as field name
> 
>
> Key: DRILL-5492
> URL: https://issues.apache.org/jira/browse/DRILL-5492
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Priority: Minor
>
> Consider the same test case as in DRILL-5491, but with a slightly different 
> input file:
> {code}
> ___
> a,b,c
> d,e,f
> {code}
> The underscores represent three spaces: use spaces in the real test.
> In this case, the code discussed in DRILL-5491 finds some characters and 
> happily returns the following array:
> {code}
> ["   "]
> {code}
> The field name of three blanks is returned to the client to produce the 
> following bizarre output:
> {code}
> 2 row(s):
> 
> a
> d
> {code}
> The blank line is normally the header, but the header here was considered to 
> be three blanks. (In fact, the blanks are actually printed.)
> Since the blanks were considered to be a field, the file is assumed to have 
> only one field, so only the first column was returned.
> The expected behavior is that spaces are trimmed from field names, so the 
> field name list would be empty and a User Error thrown. (That is, it is 
> confusing to the user why a blank line produces NPE, some produce the 
> {{ExecutionSetupException}} shown in DRILL-5491, and some produce blank 
> headings. Behavior should be consistent.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5492) CSV with spaces for header uses spaces as field name

2017-05-08 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002064#comment-16002064
 ] 

Paul Rogers commented on DRILL-5492:


The fun never ends. Now try this file:

{code}
 *@! , )( , -_-   
a,b,c
d,e,f
{code}

There are spaces before and after each field.

The code returns the following as column headers:

{code}
[" *@! ","  )( ","  -_-  " ]
{code}

With the following as the output:

{code}
2 row(s):
 *@! , )( , -_-   
a,b,c
d,e,f
{code}

Note that the non-symbol characters are allowed as names, and that the 
whitespace in the "names" is preserved.

Expected whitespace to be remove to allow headings of the form:

{code}
first, second, third
{code}

Expected Drill to reject field names that are not valid Drill symbols, or 
convert the names to some benign form.

> CSV with spaces for header uses spaces as field name
> 
>
> Key: DRILL-5492
> URL: https://issues.apache.org/jira/browse/DRILL-5492
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Priority: Minor
>
> Consider the same test case as in DRILL-5491, but with a slightly different 
> input file:
> {code}
> ___
> a,b,c
> d,e,f
> {code}
> The underscores represent three spaces: use spaces in the real test.
> In this case, the code discussed in DRILL-5491 finds some characters and 
> happily returns the following array:
> {code}
> ["   "]
> {code}
> The field name of three blanks is returned to the client to produce the 
> following bizarre output:
> {code}
> 2 row(s):
> 
> a
> d
> {code}
> The blank line is normally the header, but the header here was considered to 
> be three blanks. (In fact, the blanks are actually printed.)
> Since the blanks were considered to be a field, the file is assumed to have 
> only one field, so only the first column was returned.
> The expected behavior is that spaces are trimmed from field names, so the 
> field name list would be empty and a User Error thrown. (That is, it is 
> confusing to the user why a blank line produces NPE, some produce the 
> {{ExecutionSetupException}} shown in DRILL-5491, and some produce blank 
> headings. Behavior should be consistent.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5492) CSV with spaces for header uses spaces as field name

2017-05-08 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002062#comment-16002062
 ] 

Paul Rogers commented on DRILL-5492:


A variation on this bug. Consider this input file:

{code}
,__,__
a,b,c
d,e,f
{code}

Where, again, the underscores represent spaces. The header parser returns:

{code}
["   ","  ","  "]
{code}

That is, three fields, each some number of spaces. And, query output is:

{code}
,  
a,c
d,f
{code}

Strangely, we pass three field headings into the {{FieldVarCharOutput}} 
constructor (all strings with blanks), but we only get the first and third 
fields in the output.

Expected the query to just fail due to blank headers.

> CSV with spaces for header uses spaces as field name
> 
>
> Key: DRILL-5492
> URL: https://issues.apache.org/jira/browse/DRILL-5492
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Priority: Minor
>
> Consider the same test case as in DRILL-5491, but with a slightly different 
> input file:
> {code}
> ___
> a,b,c
> d,e,f
> {code}
> The underscores represent three spaces: use spaces in the real test.
> In this case, the code discussed in DRILL-5491 finds some characters and 
> happily returns the following array:
> {code}
> ["   "]
> {code}
> The field name of three blanks is returned to the client to produce the 
> following bizarre output:
> {code}
> 2 row(s):
> 
> a
> d
> {code}
> The blank line is normally the header, but the header here was considered to 
> be three blanks. (In fact, the blanks are actually printed.)
> Since the blanks were considered to be a field, the file is assumed to have 
> only one field, so only the first column was returned.
> The expected behavior is that spaces are trimmed from field names, so the 
> field name list would be empty and a User Error thrown. (That is, it is 
> confusing to the user why a blank line produces NPE, some produce the 
> {{ExecutionSetupException}} shown in DRILL-5491, and some produce blank 
> headings. Behavior should be consistent.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5490) RepeatedVarCharOutput.recordStart becomes corrupted on realloc

2017-05-08 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002056#comment-16002056
 ] 

Paul Rogers commented on DRILL-5490:


More bizarre. In actual use, {{recordStart}} is, in fact, 0 during the record. 
But, when extracting headers, the header is read, then {{finishRecord()}} is 
called:

{code}
  public void finishRecord() {
recordStart = characterData;
{code}

At this point, {{recordStart}} is set to "record end"...

So, this line actually works:

{code}
if (this.recordStart != characterData) {
  throw new ExecutionSetupException("record text was requested before 
finishing record");
}
{code}

But, it works in the sense that {{recordStart}} and {{characterData}} both 
point to the *end* of the record. If it were not for the issue described in 
DRILL-5491, the logic would be that {{recordStart}} would be zero if we didn't 
finish the record, but we we always do, so the check is a complete no-op.


> RepeatedVarCharOutput.recordStart becomes corrupted on realloc
> --
>
> Key: DRILL-5490
> URL: https://issues.apache.org/jira/browse/DRILL-5490
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Priority: Minor
>
> The class {{RepeatedVarCharOutput}}, used to read text data into the 
> {{columns}} array, has a member called {{recordStart}} which is supposed to 
> track the memory offset of the start of the record data in the {{columns}} 
> VarChar array. However, the logic for the member is very wrong. First, it is 
> initialized based on an uninitialized variable:
> {{code}}
>   public void startBatch() {
> this.recordStart = characterDataOriginal;
> ...
> loadVarCharDataAddress();
>   }
>   private void loadVarCharDataAddress(){
> ...
> this.characterDataOriginal = buf.memoryAddress();
> ...
>   }
> {{code}}
> Second, the class keeps track of actual memory addresses for the VarChar 
> data. When data would exceed the buffer, memory is reallocated using the 
> following method. But, this method *does not* adjust the {{recordStart}} 
> member, which now points to a bogus memory address:
> {code}
>   private void expandVarCharData(){
> vector.getDataVector().reAlloc();
> long diff = characterData - characterDataOriginal;
> loadVarCharDataAddress();
> characterData += diff;
>   }
> {code}
> Fortunately, like so many bugs in this class, these bugs are offset by the 
> fact that the variable is never used in a way that cause obvious problems 
> (else these issues would have been found sooner.) The only real use is:
> {code}
>   @Override
>   public boolean rowHasData() {
> return recordStart < characterData;
>   }
> {code}
> Which, it turns out, is used only at EOF in {{TextReader.parseRecord()}}:
> {code}
> } catch(StreamFinishedPseudoException e) {
>   // if we've written part of a field or all of a field, we should send 
> this row.
>   if (fieldsWritten == 0 && !output.rowHasData()) {
> throw e;
>   }
> }
> {code}
> Thus, the bug will cause a failure only if:
> * A batch is resized, and
> * The new address is lower than the original address, and
> * It is the last batch in the file.
> Because of the low probability of hitting the bug, the priority is set to 
> Minor.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (DRILL-5492) CSV with spaces for header uses spaces as field name

2017-05-08 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-5492:
--

 Summary: CSV with spaces for header uses spaces as field name
 Key: DRILL-5492
 URL: https://issues.apache.org/jira/browse/DRILL-5492
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers
Priority: Minor


Consider the same test case as in DRILL-5491, but with a slightly different 
input file:

{code}
___
a,b,c
d,e,f
{code}

The underscores represent three spaces: use spaces in the real test.

In this case, the code discussed in DRILL-5491 finds some characters and 
happily returns the following array:

{code}
["   "]
{code}

The field name of three blanks is returned to the client to produce the 
following bizarre output:

{code}
2 row(s):

a
d
{code}

The blank line is normally the header, but the header here was considered to be 
three blanks. (In fact, the blanks are actually printed.)

Since the blanks were considered to be a field, the file is assumed to have 
only one field, so only the first column was returned.

The expected behavior is that spaces are trimmed from field names, so the field 
name list would be empty and a User Error thrown. (That is, it is confusing to 
the user why a blank line produces NPE, some produce the 
{{ExecutionSetupException}} shown in DRILL-5491, and some produce blank 
headings. Behavior should be consistent.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (DRILL-5491) NPE when reading a CSV file, with headers, but blank header line

2017-05-08 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-5491:
--

 Summary: NPE when reading a CSV file, with headers, but blank 
header line
 Key: DRILL-5491
 URL: https://issues.apache.org/jira/browse/DRILL-5491
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.8.0
Reporter: Paul Rogers


See DRILL-5490 for background.

Try this unit test case:

{code}
FixtureBuilder builder = ClusterFixture.builder()
.maxParallelization(1);

try (ClusterFixture cluster = builder.build();
 ClientFixture client = cluster.clientFixture()) {
  TextFormatConfig csvFormat = new TextFormatConfig();
  csvFormat.fieldDelimiter = ',';
  csvFormat.skipFirstLine = false;
  csvFormat.extractHeader = true;
  cluster.defineWorkspace("dfs", "data", "/tmp/data", "csv", csvFormat);
  String sql = "SELECT * FROM `dfs.data`.`csv/test7.csv`";
  client.queryBuilder().sql(sql).printCsv();
}
  }
{code}

The test can also be run as a query using your favorite client.

Using this input file:

{code}

a,b,c
d,e,f
{code}

(The first line is blank.)

The following is the result:

{code}
Exception (no rows returned): 
org.apache.drill.common.exceptions.UserRemoteException: 
SYSTEM ERROR: NullPointerException
{code}

The {{RepeatedVarCharOutput}} class tries (but fails for the reasons outlined 
in DRILL-5490) to detect this case.

The code crashes here in {{CompliantTextRecordReader.extractHeader()}}:

{code}
String [] fieldNames = ((RepeatedVarCharOutput)hOutput).getTextOutput();
{code}

Because of bad code in {{RepeatedVarCharOutput.getTextOutput()}}:

{code}
  public String [] getTextOutput () throws ExecutionSetupException {
if (recordCount == 0 || fieldIndex == -1) {
  return null;
}

if (this.recordStart != characterData) {
  throw new ExecutionSetupException("record text was requested before 
finishing record");
}
{code}

Since there is no text on the line, special code elsewhere (see DRILL-5490) 
elects not to increment the {{recordCount}}.  (BTW: {{recordCount}} is the 
total across-batch count, probably the in-batch count, {{batchIndex}}, was 
wanted here.) Since the count is zero, we return null.

But, if the author probably thought we'd get a zero-length record, and the 
if-statement throws an exception in this case. But, see DRILL-5490 about why 
this code does not actually work.

The result is one bug (not incrementing the record count), triggering another 
(returning a null), which masks a third ({{recordStart}} is not set correctly 
so the exception would not be thrown.)

All that bad code is just fun and games until we get an NPE, however.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (DRILL-5490) RepeatedVarCharOutput.recordStart becomes corrupted on realloc

2017-05-08 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002016#comment-16002016
 ] 

Paul Rogers edited comment on DRILL-5490 at 5/9/17 4:46 AM:


The other place that {{recordStart}} is used is to get the first row of data 
for DRILL-951, to extract the header. Consider 
{{CompliantTextReader.extractHeader()}}:

{code}
HeaderOutputMutator hOutputMutator = new HeaderOutputMutator();
TextOutput hOutput = new RepeatedVarCharOutput(hOutputMutator, 
getColumns(), true);
this.allocate(hOutputMutator.fieldVectorMap);
...
// extract first row only
reader.parseNext();

// grab the field names from output
String [] fieldNames = ((RepeatedVarCharOutput)hOutput).getTextOutput();
{code}

The only way the above can works is that we never call 
{RepeatedVarCharOutput.startBatch()}}, so {{recordStart}} defaults to 0, and so 
the zero-record check accidentally works:

{code}
if (this.recordStart != characterData) {
  throw new ExecutionSetupException("record text was requested before 
finishing record");
}
{code}

It is likely, however, that an empty header line would not be caught and so the 
above exception would never actually fire.


was (Author: paul-rogers):
The other place that {{recordStart}} is used is to get the first row of data 
for DRILL-951, to extract the header. Here again, we are fortunate that one bug 
masks another. The {{RepeatedVarCharOutput}} is never used with headers. 
Consider {{CompliantTextReader.setup()}}:

{code}
  if (settings.isHeaderExtractionEnabled()){
//extract header and use that to setup a set of VarCharVectors
String [] fieldNames = extractHeader();
output = new FieldVarCharOutput(outputMutator, fieldNames, 
getColumns(), isStarQuery());
  } else {
//simply use RepeatedVarCharVector
output = new RepeatedVarCharOutput(outputMutator, getColumns(), 
isStarQuery());
  }
{code}

> RepeatedVarCharOutput.recordStart becomes corrupted on realloc
> --
>
> Key: DRILL-5490
> URL: https://issues.apache.org/jira/browse/DRILL-5490
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Priority: Minor
>
> The class {{RepeatedVarCharOutput}}, used to read text data into the 
> {{columns}} array, has a member called {{recordStart}} which is supposed to 
> track the memory offset of the start of the record data in the {{columns}} 
> VarChar array. However, the logic for the member is very wrong. First, it is 
> initialized based on an uninitialized variable:
> {{code}}
>   public void startBatch() {
> this.recordStart = characterDataOriginal;
> ...
> loadVarCharDataAddress();
>   }
>   private void loadVarCharDataAddress(){
> ...
> this.characterDataOriginal = buf.memoryAddress();
> ...
>   }
> {{code}}
> Second, the class keeps track of actual memory addresses for the VarChar 
> data. When data would exceed the buffer, memory is reallocated using the 
> following method. But, this method *does not* adjust the {{recordStart}} 
> member, which now points to a bogus memory address:
> {code}
>   private void expandVarCharData(){
> vector.getDataVector().reAlloc();
> long diff = characterData - characterDataOriginal;
> loadVarCharDataAddress();
> characterData += diff;
>   }
> {code}
> Fortunately, like so many bugs in this class, these bugs are offset by the 
> fact that the variable is never used in a way that cause obvious problems 
> (else these issues would have been found sooner.) The only real use is:
> {code}
>   @Override
>   public boolean rowHasData() {
> return recordStart < characterData;
>   }
> {code}
> Which, it turns out, is used only at EOF in {{TextReader.parseRecord()}}:
> {code}
> } catch(StreamFinishedPseudoException e) {
>   // if we've written part of a field or all of a field, we should send 
> this row.
>   if (fieldsWritten == 0 && !output.rowHasData()) {
> throw e;
>   }
> }
> {code}
> Thus, the bug will cause a failure only if:
> * A batch is resized, and
> * The new address is lower than the original address, and
> * It is the last batch in the file.
> Because of the low probability of hitting the bug, the priority is set to 
> Minor.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5490) RepeatedVarCharOutput.recordStart becomes corrupted on realloc

2017-05-08 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002016#comment-16002016
 ] 

Paul Rogers commented on DRILL-5490:


The other place that {{recordStart}} is used is to get the first row of data 
for DRILL-951, to extract the header. Here again, we are fortunate that one bug 
masks another. The {{RepeatedVarCharOutput}} is never used with headers. 
Consider {{CompliantTextReader.setup()}}:

{code}
  if (settings.isHeaderExtractionEnabled()){
//extract header and use that to setup a set of VarCharVectors
String [] fieldNames = extractHeader();
output = new FieldVarCharOutput(outputMutator, fieldNames, 
getColumns(), isStarQuery());
  } else {
//simply use RepeatedVarCharVector
output = new RepeatedVarCharOutput(outputMutator, getColumns(), 
isStarQuery());
  }
{code}

> RepeatedVarCharOutput.recordStart becomes corrupted on realloc
> --
>
> Key: DRILL-5490
> URL: https://issues.apache.org/jira/browse/DRILL-5490
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Priority: Minor
>
> The class {{RepeatedVarCharOutput}}, used to read text data into the 
> {{columns}} array, has a member called {{recordStart}} which is supposed to 
> track the memory offset of the start of the record data in the {{columns}} 
> VarChar array. However, the logic for the member is very wrong. First, it is 
> initialized based on an uninitialized variable:
> {{code}}
>   public void startBatch() {
> this.recordStart = characterDataOriginal;
> ...
> loadVarCharDataAddress();
>   }
>   private void loadVarCharDataAddress(){
> ...
> this.characterDataOriginal = buf.memoryAddress();
> ...
>   }
> {{code}}
> Second, the class keeps track of actual memory addresses for the VarChar 
> data. When data would exceed the buffer, memory is reallocated using the 
> following method. But, this method *does not* adjust the {{recordStart}} 
> member, which now points to a bogus memory address:
> {code}
>   private void expandVarCharData(){
> vector.getDataVector().reAlloc();
> long diff = characterData - characterDataOriginal;
> loadVarCharDataAddress();
> characterData += diff;
>   }
> {code}
> Fortunately, like so many bugs in this class, these bugs are offset by the 
> fact that the variable is never used in a way that cause obvious problems 
> (else these issues would have been found sooner.) The only real use is:
> {code}
>   @Override
>   public boolean rowHasData() {
> return recordStart < characterData;
>   }
> {code}
> Which, it turns out, is used only at EOF in {{TextReader.parseRecord()}}:
> {code}
> } catch(StreamFinishedPseudoException e) {
>   // if we've written part of a field or all of a field, we should send 
> this row.
>   if (fieldsWritten == 0 && !output.rowHasData()) {
> throw e;
>   }
> }
> {code}
> Thus, the bug will cause a failure only if:
> * A batch is resized, and
> * The new address is lower than the original address, and
> * It is the last batch in the file.
> Because of the low probability of hitting the bug, the priority is set to 
> Minor.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (DRILL-5490) RepeatedVarCharOutput.recordStart becomes corrupted on realloc

2017-05-08 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-5490:
--

 Summary: RepeatedVarCharOutput.recordStart becomes corrupted on 
realloc
 Key: DRILL-5490
 URL: https://issues.apache.org/jira/browse/DRILL-5490
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.10.0
Reporter: Paul Rogers
Priority: Minor


The class {{RepeatedVarCharOutput}}, used to read text data into the 
{{columns}} array, has a member called {{recordStart}} which is supposed to 
track the memory offset of the start of the record data in the {{columns}} 
VarChar array. However, the logic for the member is very wrong. First, it is 
initialized based on an uninitialized variable:

{{code}}
  public void startBatch() {
this.recordStart = characterDataOriginal;
...
loadVarCharDataAddress();
  }

  private void loadVarCharDataAddress(){
...
this.characterDataOriginal = buf.memoryAddress();
...
  }
{{code}}

Second, the class keeps track of actual memory addresses for the VarChar data. 
When data would exceed the buffer, memory is reallocated using the following 
method. But, this method *does not* adjust the {{recordStart}} member, which 
now points to a bogus memory address:

{code}
  private void expandVarCharData(){
vector.getDataVector().reAlloc();
long diff = characterData - characterDataOriginal;
loadVarCharDataAddress();
characterData += diff;
  }
{code}

Fortunately, like so many bugs in this class, these bugs are offset by the fact 
that the variable is never used in a way that cause obvious problems (else 
these issues would have been found sooner.) The only real use is:

{code}
  @Override
  public boolean rowHasData() {
return recordStart < characterData;
  }
{code}

Which, it turns out, is used only at EOF in {{TextReader.parseRecord()}}:

{code}
} catch(StreamFinishedPseudoException e) {
  // if we've written part of a field or all of a field, we should send 
this row.
  if (fieldsWritten == 0 && !output.rowHasData()) {
throw e;
  }
}
{code}

Thus, the bug will cause a failure only if:

* A batch is resized, and
* The new address is lower than the original address, and
* It is the last batch in the file.

Because of the low probability of hitting the bug, the priority is set to Minor.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5489) Unprotected array access in RepeatedVarCharOutput ctor

2017-05-08 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001908#comment-16001908
 ] 

Paul Rogers commented on DRILL-5489:


Similarly, the {{RepeatedVarCharOutput}} class does not protect itself from a 
file with more than 64K fields on input:

{code}
  @Override
  public void startField(int index) {
fieldIndex = index;
collect = collectedFields[index];
fieldOpen = true;
  }
{code}

Here, the parser counts the fields and calls {{startField}} for each. If the 
field is 65537, then the above method will check the {{collectedFields}} to 
determine if the field is wanted. But, that array is hard-coded at 65536 
entries, so the code will fail with an array out-of-bounds exception.

Instead, the code should simply discard extra fields, or throw a 
{{UserException}} to report too many fields.

> Unprotected array access in RepeatedVarCharOutput ctor
> --
>
> Key: DRILL-5489
> URL: https://issues.apache.org/jira/browse/DRILL-5489
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Priority: Minor
>
> Suppose a user runs a query of form:
> {code}
> SELECT columns[7] FROM `dfs`.`mycsv.csv`
> {code}
> Internally, this will create a {{PathSegment}} to represent the selected 
> column. This is passed into the {{RepeatedVarCharOutput}} constructor where 
> it is used to set a flag in an array of 64K booleans. But, while the code is 
> very diligent of making sure that the column name is "columns" and that the 
> path segment is an array, it does not check the array value. Instead:
> {code}
> for(Integer i : columnIds){
>   ...
>   fields[i] = true;
> }
> {code}
> We need to add a bounds check to reject array indexes that are not valid: 
> negative or above 64K. It may be that the code further up the hierarchy does 
> the checks. But, if so, it should do the other checks as well. Leaving the 
> checks incomplete is confusing.
> The result:
> {code}
> Exception (no rows returned): 
> org.apache.drill.common.exceptions.UserRemoteException: 
> SYSTEM ERROR: ArrayIndexOutOfBoundsException: 7
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (DRILL-5489) Unprotected array access in RepeatedVarCharOutput ctor

2017-05-08 Thread Paul Rogers (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-5489:
---
Description: 
Suppose a user runs a query of form:
{code}
SELECT columns[7] FROM `dfs`.`mycsv.csv`
{code}

Internally, this will create a {{PathSegment}} to represent the selected 
column. This is passed into the {{RepeatedVarCharOutput}} constructor where it 
is used to set a flag in an array of 64K booleans. But, while the code is very 
diligent of making sure that the column name is "columns" and that the path 
segment is an array, it does not check the array value. Instead:

{code}
for(Integer i : columnIds){
  ...
  fields[i] = true;
}
{code}

We need to add a bounds check to reject array indexes that are not valid: 
negative or above 64K. It may be that the code further up the hierarchy does 
the checks. But, if so, it should do the other checks as well. Leaving the 
checks incomplete is confusing.

The result:

{code}
Exception (no rows returned): 
org.apache.drill.common.exceptions.UserRemoteException: 
SYSTEM ERROR: ArrayIndexOutOfBoundsException: 7
{code}


  was:
Suppose a user runs a query of form:
{code}
SELECT columns[7] FROM `dfs`.`mycsv.csv`
{code}

Internally, this will create a {{PathSegment}} to represent the selected 
column. This is passed into the {{RepeatedVarCharOutput}} constructor where it 
is used to set a flag in an array of 64K booleans. But, while the code is very 
diligent of making sure that the column name is "columns" and that the path 
segment is an array, it does not check the array value. Instead:

{code}
for(Integer i : columnIds){
  ...
  fields[i] = true;
}
{code}

We need to add a bounds check to reject array indexes that are not valid: 
negative or above 64K. It may be that the code further up the hierarchy does 
the checks. But, if so, it should do the other checks as well. Leaving the 
checks incomplete is confusing.

While we are at it, we might as well fix another bit of bogus code:

{code}
for(Integer i : columnIds){
  maxField = 0;
  maxField = Math.max(maxField, i);
  ...
}
{code}

The code to compute maxField simply uses the last value, not the maximum value. 
This will be thrown off by a query of the form:

{code}
SELECT columns[20], columns[1] FROM ...
{code}

This will muck up the text reader. The text reader "shortcuts" the rest of the 
line if it need not read any more fields. In {{TextReader.parseRecord()}}:

{code}
earlyTerm = ! parseField();
...
if (earlyTerm) {
  if (ch != newLine) {
input.skipLines(1);
  }
  break;
}
{code}

And:

{code}
  private final boolean parseField() throws IOException {
  ...
  return output.endField();
{code}

And, in {{RepeatedVarCharOutput.endField()}}:

{code}
  public boolean endField() {
...
return fieldIndex < maxField;
  }
{code}

The only reason that this does not cause data loss is the insertion of an 
unnecessary sort earlier in the constructor:

{code}
Collections.sort(columnIds);
  ...
for (Integer i : columnIds){
{code}

We can remove the sort and compute the maxField correct and get the same result 
with simpler code.


> Unprotected array access in RepeatedVarCharOutput ctor
> --
>
> Key: DRILL-5489
> URL: https://issues.apache.org/jira/browse/DRILL-5489
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Priority: Minor
>
> Suppose a user runs a query of form:
> {code}
> SELECT columns[7] FROM `dfs`.`mycsv.csv`
> {code}
> Internally, this will create a {{PathSegment}} to represent the selected 
> column. This is passed into the {{RepeatedVarCharOutput}} constructor where 
> it is used to set a flag in an array of 64K booleans. But, while the code is 
> very diligent of making sure that the column name is "columns" and that the 
> path segment is an array, it does not check the array value. Instead:
> {code}
> for(Integer i : columnIds){
>   ...
>   fields[i] = true;
> }
> {code}
> We need to add a bounds check to reject array indexes that are not valid: 
> negative or above 64K. It may be that the code further up the hierarchy does 
> the checks. But, if so, it should do the other checks as well. Leaving the 
> checks incomplete is confusing.
> The result:
> {code}
> Exception (no rows returned): 
> org.apache.drill.common.exceptions.UserRemoteException: 
> SYSTEM ERROR: ArrayIndexOutOfBoundsException: 7
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5489) Unprotected array access in RepeatedVarCharOutput ctor

2017-05-08 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001904#comment-16001904
 ] 

Paul Rogers commented on DRILL-5489:


While we are at it, we might as well fix another bit of bogus code:

{code}
for(Integer i : columnIds){
  maxField = 0;
  maxField = Math.max(maxField, i);
  ...
}
{code}

The code to compute maxField simply uses the last value, not the maximum value. 
This will be thrown off by a query of the form:

{code}
SELECT columns[20], columns[1] FROM ...
{code}

This will muck up the text reader. The text reader "shortcuts" the rest of the 
line if it need not read any more fields. In {{TextReader.parseRecord()}}:

{code}
earlyTerm = ! parseField();
...
if (earlyTerm) {
  if (ch != newLine) {
input.skipLines(1);
  }
  break;
}
{code}

And:

{code}
  private final boolean parseField() throws IOException {
  ...
  return output.endField();
{code}

And, in {{RepeatedVarCharOutput.endField()}}:

{code}
  public boolean endField() {
...
return fieldIndex < maxField;
  }
{code}

The only reason that this does not cause data loss is the insertion of an 
unnecessary sort earlier in the constructor:

{code}
Collections.sort(columnIds);
  ...
for (Integer i : columnIds){
{code}

We can remove the sort and compute the maxField correct and get the same result 
with simpler code.

> Unprotected array access in RepeatedVarCharOutput ctor
> --
>
> Key: DRILL-5489
> URL: https://issues.apache.org/jira/browse/DRILL-5489
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Priority: Minor
>
> Suppose a user runs a query of form:
> {code}
> SELECT columns[7] FROM `dfs`.`mycsv.csv`
> {code}
> Internally, this will create a {{PathSegment}} to represent the selected 
> column. This is passed into the {{RepeatedVarCharOutput}} constructor where 
> it is used to set a flag in an array of 64K booleans. But, while the code is 
> very diligent of making sure that the column name is "columns" and that the 
> path segment is an array, it does not check the array value. Instead:
> {code}
> for(Integer i : columnIds){
>   ...
>   fields[i] = true;
> }
> {code}
> We need to add a bounds check to reject array indexes that are not valid: 
> negative or above 64K. It may be that the code further up the hierarchy does 
> the checks. But, if so, it should do the other checks as well. Leaving the 
> checks incomplete is confusing.
> The result:
> {code}
> Exception (no rows returned): 
> org.apache.drill.common.exceptions.UserRemoteException: 
> SYSTEM ERROR: ArrayIndexOutOfBoundsException: 7
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (DRILL-5489) Unprotected array access in RepeatedVarCharOutput ctor

2017-05-08 Thread Paul Rogers (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-5489:
---
Description: 
Suppose a user runs a query of form:
{code}
SELECT columns[7] FROM `dfs`.`mycsv.csv`
{code}

Internally, this will create a {{PathSegment}} to represent the selected 
column. This is passed into the {{RepeatedVarCharOutput}} constructor where it 
is used to set a flag in an array of 64K booleans. But, while the code is very 
diligent of making sure that the column name is "columns" and that the path 
segment is an array, it does not check the array value. Instead:

{code}
for(Integer i : columnIds){
  ...
  fields[i] = true;
}
{code}

We need to add a bounds check to reject array indexes that are not valid: 
negative or above 64K. It may be that the code further up the hierarchy does 
the checks. But, if so, it should do the other checks as well. Leaving the 
checks incomplete is confusing.

While we are at it, we might as well fix another bit of bogus code:

{code}
for(Integer i : columnIds){
  maxField = 0;
  maxField = Math.max(maxField, i);
  ...
}
{code}

The code to compute maxField simply uses the last value, not the maximum value. 
This will be thrown off by a query of the form:

{code}
SELECT columns[20], columns[1] FROM ...
{code}

This will muck up the text reader. The text reader "shortcuts" the rest of the 
line if it need not read any more fields. In {{TextReader.parseRecord()}}:

{code}
earlyTerm = ! parseField();
...
if (earlyTerm) {
  if (ch != newLine) {
input.skipLines(1);
  }
  break;
}
{code}

And:

{code}
  private final boolean parseField() throws IOException {
  ...
  return output.endField();
{code}

And, in {{RepeatedVarCharOutput.endField()}}:

{code}
  public boolean endField() {
...
return fieldIndex < maxField;
  }
{code}

The only reason that this does not cause data loss is the insertion of an 
unnecessary sort earlier in the constructor:

{code}
Collections.sort(columnIds);
  ...
for (Integer i : columnIds){
{code}

We can remove the sort and compute the maxField correct and get the same result 
with simpler code.

  was:
Suppose a user runs a query of form:
{code}
SELECT columns[7] FROM `dfs`.`mycsv.csv`
{code}

Internally, this will create a {{PathSegment}} to represent the selected 
column. This is passed into the {{RepeatedVarCharOutput}} constructor where it 
is used to set a flag in an array of 64K booleans. But, while the code is very 
diligent of making sure that the column name is "columns" and that the path 
segment is an array, it does not check the array value. Instead:

{code}
for(Integer i : columnIds){
  ...
  fields[i] = true;
}
{code}

We need to add a bounds check to reject array indexes that are not valid: 
negative or above 64K. It may be that the code further up the hierarchy does 
the checks. But, if so, it should do the other checks as well. Leaving the 
checks incomplete is confusing.

While we are at it, we might as well fix another bit of bogus code:

{code}
for(Integer i : columnIds){
  maxField = 0;
  maxField = Math.max(maxField, i);
  ...
}
{code}

The code to compute maxField simply uses the last value, not the maximum value. 
This will be thrown off by a query of the form:

{code}
SELECT columns[20], columns[1] FROM ...
{code}

This will muck up the text reader. The text reader "shortcuts" the rest of the 
line if it need not read any more fields. In {{TextReader.parseRecord()}}:

{code}
earlyTerm = ! parseField();
...
if (earlyTerm) {
  if (ch != newLine) {
input.skipLines(1);
  }
  break;
}
{code}

And:

{code}
  private final boolean parseField() throws IOException {
  ...
  return output.endField();
{code}

And, in {{RepeatedVarCharOutput.endField()}}:

{code}
  public boolean endField() {
...
return fieldIndex < maxField;
  }
{code}



> Unprotected array access in RepeatedVarCharOutput ctor
> --
>
> Key: DRILL-5489
> URL: https://issues.apache.org/jira/browse/DRILL-5489
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Priority: Minor
>
> Suppose a user runs a query of form:
> {code}
> SELECT columns[7] FROM `dfs`.`mycsv.csv`
> {code}
> Internally, this will create a {{PathSegment}} to represent the selected 
> column. This is passed into the {{RepeatedVarCharOutput}} constructor where 
> it is used to set a flag in an array of 64K booleans. But, while the code is 
> very diligent of making sure that the column name is "c

[jira] [Updated] (DRILL-5489) Unprotected array access in RepeatedVarCharOutput ctor

2017-05-08 Thread Paul Rogers (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-5489:
---
Description: 
Suppose a user runs a query of form:
{code}
SELECT columns[7] FROM `dfs`.`mycsv.csv`
{code}

Internally, this will create a {{PathSegment}} to represent the selected 
column. This is passed into the {{RepeatedVarCharOutput}} constructor where it 
is used to set a flag in an array of 64K booleans. But, while the code is very 
diligent of making sure that the column name is "columns" and that the path 
segment is an array, it does not check the array value. Instead:

{code}
for(Integer i : columnIds){
  ...
  fields[i] = true;
}
{code}

We need to add a bounds check to reject array indexes that are not valid: 
negative or above 64K. It may be that the code further up the hierarchy does 
the checks. But, if so, it should do the other checks as well. Leaving the 
checks incomplete is confusing.

While we are at it, we might as well fix another bit of bogus code:

{code}
for(Integer i : columnIds){
  maxField = 0;
  maxField = Math.max(maxField, i);
  ...
}
{code}

The code to compute maxField simply uses the last value, not the maximum value. 
This will be thrown off by a query of the form:

{code}
SELECT columns[20], columns[1] FROM ...
{code}

This will muck up the text reader. The text reader "shortcuts" the rest of the 
line if it need not read any more fields. In {{TextReader.parseRecord()}}:

{code}
earlyTerm = ! parseField();
...
if (earlyTerm) {
  if (ch != newLine) {
input.skipLines(1);
  }
  break;
}
{code}

And:

{code}
  private final boolean parseField() throws IOException {
  ...
  return output.endField();
{code}

And, in {{RepeatedVarCharOutput.endField()}}:

{code}
  public boolean endField() {
...
return fieldIndex < maxField;
  }
{code}


  was:
Suppose a user runs a query of form:
{code}
SELECT columns[7] FROM `dfs`.`mycsv.csv`
{code}

Internally, this will create a {{PathSegment}} to represent the selected 
column. This is passed into the {{RepeatedVarCharOutput}} constructor where it 
is used to set a flag in an array of 64K booleans. But, while the code is very 
diligent of making sure that the column name is "columns" and that the path 
segment is an array, it does not check the array value. Instead:

{code}
for(Integer i : columnIds){
  ...
  fields[i] = true;
}
{code}

We need to add a bounds check to reject array indexes that are not valid: 
negative or above 64K.

While we are at it, we might as well fix another bit of bogus code:

{code}
for(Integer i : columnIds){
  maxField = 0;
  maxField = Math.max(maxField, i);
  ...
}
{code}

The code to compute maxField simply uses the last value, not the maximum value. 
This will be thrown off by a query of the form:

{code}
SELECT columns[20], columns[1] FROM ...
{code}

It may be that the code further up the hierarchy does the checks. But, if so, 
it should do the other checks as well. Leaving the checks incomplete is 
confusing.


> Unprotected array access in RepeatedVarCharOutput ctor
> --
>
> Key: DRILL-5489
> URL: https://issues.apache.org/jira/browse/DRILL-5489
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Priority: Minor
>
> Suppose a user runs a query of form:
> {code}
> SELECT columns[7] FROM `dfs`.`mycsv.csv`
> {code}
> Internally, this will create a {{PathSegment}} to represent the selected 
> column. This is passed into the {{RepeatedVarCharOutput}} constructor where 
> it is used to set a flag in an array of 64K booleans. But, while the code is 
> very diligent of making sure that the column name is "columns" and that the 
> path segment is an array, it does not check the array value. Instead:
> {code}
> for(Integer i : columnIds){
>   ...
>   fields[i] = true;
> }
> {code}
> We need to add a bounds check to reject array indexes that are not valid: 
> negative or above 64K. It may be that the code further up the hierarchy does 
> the checks. But, if so, it should do the other checks as well. Leaving the 
> checks incomplete is confusing.
> While we are at it, we might as well fix another bit of bogus code:
> {code}
> for(Integer i : columnIds){
>   maxField = 0;
>   maxField = Math.max(maxField, i);
>   ...
> }
> {code}
> The code to compute maxField simply uses the last value, not the maximum 
> value. This will be thrown off by a query of the form:
> {code}
> SELECT columns[20], columns[1] FROM ...
> {code}
> This will muck up the text 

[jira] [Created] (DRILL-5489) Unprotected array access in RepeatedVarCharOutput ctor

2017-05-08 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-5489:
--

 Summary: Unprotected array access in RepeatedVarCharOutput ctor
 Key: DRILL-5489
 URL: https://issues.apache.org/jira/browse/DRILL-5489
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.10.0
Reporter: Paul Rogers
Priority: Minor


Suppose a user runs a query of form:
{code}
SELECT columns[7] FROM `dfs`.`mycsv.csv`
{code}

Internally, this will create a {{PathSegment}} to represent the selected 
column. This is passed into the {{RepeatedVarCharOutput}} constructor where it 
is used to set a flag in an array of 64K booleans. But, while the code is very 
diligent of making sure that the column name is "columns" and that the path 
segment is an array, it does not check the array value. Instead:

{code}
for(Integer i : columnIds){
  ...
  fields[i] = true;
}
{code}

We need to add a bounds check to reject array indexes that are not valid: 
negative or above 64K.

While we are at it, we might as well fix another bit of bogus code:

{code}
for(Integer i : columnIds){
  maxField = 0;
  maxField = Math.max(maxField, i);
  ...
}
{code}

The code to compute maxField simply uses the last value, not the maximum value. 
This will be thrown off by a query of the form:

{code}
SELECT columns[20], columns[1] FROM ...
{code}

It may be that the code further up the hierarchy does the checks. But, if so, 
it should do the other checks as well. Leaving the checks incomplete is 
confusing.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5486) Compliant text reader tries, but fails, to ignore empty rows

2017-05-08 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001784#comment-16001784
 ] 

Paul Rogers commented on DRILL-5486:


The net result seems to be that the per-batch row count and setting in 
{{BaseRepeatedValueVector}} is buggy, but irrelevant.


> Compliant text reader tries, but fails, to ignore empty rows
> 
>
> Key: DRILL-5486
> URL: https://issues.apache.org/jira/browse/DRILL-5486
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Priority: Minor
>
> The "compliant" text reader (for CSV files, etc.) has two modes: reading with 
> headers (which creates a scalar vector per file column), or without headers 
> (that creates a single {{columns}} vector with an array of all columns.
> When run in array mode, the code uses a class called 
> {{RepeatedVarCharOutput}}. This class attempts to ignore empty records:
> {code}
>   public void finishRecord() {
> ...
> // if there were no defined fields, skip.
> if (fieldIndex > -1) {
>   batchIndex++;
>   recordCount++;
> }
> ...
> {code}
> As it turns out, this works only on the *first* row. On the first row, the 
> {{fieldIndex}} has its initial value of -1. But, for subsequent records, 
> {{fieldIndex}} is never reset and retains its value from the last field set.
> The result is that the code skips the first empty row, but not empty rows 
> elsewhere in the file.
> Further, on the first row, the logic is flawed. The part shown above is only 
> for the row counts. Here is more of the logic:
> {code}
>   @Override
>   public void finishRecord() {
> recordStart = characterData;
> ...
> int newOffset = ((int) (charLengthOffset - charLengthOffsetOriginal))/4;
> PlatformDependent.putInt(repeatedOffset, newOffset);
> repeatedOffset += 4;
> // if there were no defined fields, skip.
> if (fieldIndex > -1) {
>   batchIndex++;
>   recordCount++;
> }
>   }
> {code}
> Note that the underlying vector *is* bumped for the record, even though the 
> row count is not. Later, this will cause the container to have one less 
> record.
> This would be a problem if the row count ({{batchIndex}}) was used for the 
> batch row count. But, it is not, the next level of code keeps a separate 
> count, so the "skip empty row" logic causes the counts to get out of sync 
> between the {{RepeatedVarCharOutput}} and the next level of reader.
> In sense, there are multiple self-cancelling bugs:
> * The skip-empty-row logic is not synchronized with the upper layer.
> * That bug is partly masked because the logic is not triggered except on the 
> first row.
> * That bug is masked because we always set the row offset, even if we ignore 
> the row in the row count.
> * That bug is masked because the incorrect row count is ignored when setting 
> the container row count.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (DRILL-5486) Compliant text reader tries, but fails, to ignore empty rows

2017-05-08 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001777#comment-16001777
 ] 

Paul Rogers edited comment on DRILL-5486 at 5/8/17 11:24 PM:
-

Another complication. Although {{BaseRepeatedValueVector}} sets the 
{{`columns`}} vector column count (incorrectly, as noted above), the caller, 
{{ScanBatch}}, keeps its own track of record count, and uses code to set that 
record count on each vector. So, even if {{setValueCount()}} per vector did 
anything (which it does not), the wrong value set by 
{{BaseRepeatedValueVector}} would be overwritten by {{ScanBatch}}.


was (Author: paul-rogers):
Another complications. Although {{BaseRepeatedValueVector}} sets the 
{{`columns`}} vector column count (incorrectly, as noted above), the caller, 
{{ScanBatch}}, keeps its own track of record count, and uses code to set that 
record count on each vector. So, even if {{setValueCount()}} per vector did 
anything (which it does not), the wrong value set by 
{{BaseRepeatedValueVector}} would be overwritten by {{ScanBatch}}.

> Compliant text reader tries, but fails, to ignore empty rows
> 
>
> Key: DRILL-5486
> URL: https://issues.apache.org/jira/browse/DRILL-5486
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Priority: Minor
>
> The "compliant" text reader (for CSV files, etc.) has two modes: reading with 
> headers (which creates a scalar vector per file column), or without headers 
> (that creates a single {{columns}} vector with an array of all columns.
> When run in array mode, the code uses a class called 
> {{RepeatedVarCharOutput}}. This class attempts to ignore empty records:
> {code}
>   public void finishRecord() {
> ...
> // if there were no defined fields, skip.
> if (fieldIndex > -1) {
>   batchIndex++;
>   recordCount++;
> }
> ...
> {code}
> As it turns out, this works only on the *first* row. On the first row, the 
> {{fieldIndex}} has its initial value of -1. But, for subsequent records, 
> {{fieldIndex}} is never reset and retains its value from the last field set.
> The result is that the code skips the first empty row, but not empty rows 
> elsewhere in the file.
> Further, on the first row, the logic is flawed. The part shown above is only 
> for the row counts. Here is more of the logic:
> {code}
>   @Override
>   public void finishRecord() {
> recordStart = characterData;
> ...
> int newOffset = ((int) (charLengthOffset - charLengthOffsetOriginal))/4;
> PlatformDependent.putInt(repeatedOffset, newOffset);
> repeatedOffset += 4;
> // if there were no defined fields, skip.
> if (fieldIndex > -1) {
>   batchIndex++;
>   recordCount++;
> }
>   }
> {code}
> Note that the underlying vector *is* bumped for the record, even though the 
> row count is not. Later, this will cause the container to have one less 
> record.
> This would be a problem if the row count ({{batchIndex}}) was used for the 
> batch row count. But, it is not, the next level of code keeps a separate 
> count, so the "skip empty row" logic causes the counts to get out of sync 
> between the {{RepeatedVarCharOutput}} and the next level of reader.
> In sense, there are multiple self-cancelling bugs:
> * The skip-empty-row logic is not synchronized with the upper layer.
> * That bug is partly masked because the logic is not triggered except on the 
> first row.
> * That bug is masked because we always set the row offset, even if we ignore 
> the row in the row count.
> * That bug is masked because the incorrect row count is ignored when setting 
> the container row count.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5486) Compliant text reader tries, but fails, to ignore empty rows

2017-05-08 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001777#comment-16001777
 ] 

Paul Rogers commented on DRILL-5486:


Another complications. Although {{BaseRepeatedValueVector}} sets the 
{{`columns`}} vector column count (incorrectly, as noted above), the caller, 
{{ScanBatch}}, keeps its own track of record count, and uses code to set that 
record count on each vector. So, even if {{setValueCount()}} per vector did 
anything (which it does not), the wrong value set by 
{{BaseRepeatedValueVector}} would be overwritten by {{ScanBatch}}.

> Compliant text reader tries, but fails, to ignore empty rows
> 
>
> Key: DRILL-5486
> URL: https://issues.apache.org/jira/browse/DRILL-5486
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Priority: Minor
>
> The "compliant" text reader (for CSV files, etc.) has two modes: reading with 
> headers (which creates a scalar vector per file column), or without headers 
> (that creates a single {{columns}} vector with an array of all columns.
> When run in array mode, the code uses a class called 
> {{RepeatedVarCharOutput}}. This class attempts to ignore empty records:
> {code}
>   public void finishRecord() {
> ...
> // if there were no defined fields, skip.
> if (fieldIndex > -1) {
>   batchIndex++;
>   recordCount++;
> }
> ...
> {code}
> As it turns out, this works only on the *first* row. On the first row, the 
> {{fieldIndex}} has its initial value of -1. But, for subsequent records, 
> {{fieldIndex}} is never reset and retains its value from the last field set.
> The result is that the code skips the first empty row, but not empty rows 
> elsewhere in the file.
> Further, on the first row, the logic is flawed. The part shown above is only 
> for the row counts. Here is more of the logic:
> {code}
>   @Override
>   public void finishRecord() {
> recordStart = characterData;
> ...
> int newOffset = ((int) (charLengthOffset - charLengthOffsetOriginal))/4;
> PlatformDependent.putInt(repeatedOffset, newOffset);
> repeatedOffset += 4;
> // if there were no defined fields, skip.
> if (fieldIndex > -1) {
>   batchIndex++;
>   recordCount++;
> }
>   }
> {code}
> Note that the underlying vector *is* bumped for the record, even though the 
> row count is not. Later, this will cause the container to have one less 
> record.
> This would be a problem if the row count ({{batchIndex}}) was used for the 
> batch row count. But, it is not, the next level of code keeps a separate 
> count, so the "skip empty row" logic causes the counts to get out of sync 
> between the {{RepeatedVarCharOutput}} and the next level of reader.
> In sense, there are multiple self-cancelling bugs:
> * The skip-empty-row logic is not synchronized with the upper layer.
> * That bug is partly masked because the logic is not triggered except on the 
> first row.
> * That bug is masked because we always set the row offset, even if we ignore 
> the row in the row count.
> * That bug is masked because the incorrect row count is ignored when setting 
> the container row count.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5327) Hash aggregate can return empty batch which can cause schema change exception

2017-05-08 Thread Chun Chang (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001776#comment-16001776
 ] 

Chun Chang commented on DRILL-5327:
---

[~jni] I ran against apache master branch with assertion enabled but only hit 
the schema change exception.

> Hash aggregate can return empty batch which can cause schema change exception
> -
>
> Key: DRILL-5327
> URL: https://issues.apache.org/jira/browse/DRILL-5327
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.10.0
>Reporter: Chun Chang
>Assignee: Jinfeng Ni
> Attachments: 271dbe08-c725-8e82-b6ea-53e9dab64e8b.sys.drill
>
>
> Hash aggregate can return empty batches which cause drill to throw schema 
> change exception (not handling this type of schema change). This is not a new 
> bug. But a recent hash function change (a theoretically correct change) may 
> have increased the chance of hitting this issue. I don't have scientific data 
> to support my claim (in fact I don't believe it's the case), but a regular 
> regression run used to pass fails now due to this bug. My concern is that 
> existing drill users out there may have queries that used to work but fail 
> now. It will be difficult to explain why the new release is better for them. 
> I put this bug as blocker so we can discuss it before releasing 1.10.
> {noformat}
> /root/drillAutomation/framework-master/framework/resources/Advanced/tpcds/tpcds_sf1/original/text/query66.sql
> Query: 
> -- start query 66 in stream 0 using template query66.tpl 
> SELECT w_warehouse_name, 
>w_warehouse_sq_ft, 
>w_city, 
>w_county, 
>w_state, 
>w_country, 
>ship_carriers, 
>year1,
>Sum(jan_sales) AS jan_sales, 
>Sum(feb_sales) AS feb_sales, 
>Sum(mar_sales) AS mar_sales, 
>Sum(apr_sales) AS apr_sales, 
>Sum(may_sales) AS may_sales, 
>Sum(jun_sales) AS jun_sales, 
>Sum(jul_sales) AS jul_sales, 
>Sum(aug_sales) AS aug_sales, 
>Sum(sep_sales) AS sep_sales, 
>Sum(oct_sales) AS oct_sales, 
>Sum(nov_sales) AS nov_sales, 
>Sum(dec_sales) AS dec_sales, 
>Sum(jan_sales / w_warehouse_sq_ft) AS jan_sales_per_sq_foot, 
>Sum(feb_sales / w_warehouse_sq_ft) AS feb_sales_per_sq_foot, 
>Sum(mar_sales / w_warehouse_sq_ft) AS mar_sales_per_sq_foot, 
>Sum(apr_sales / w_warehouse_sq_ft) AS apr_sales_per_sq_foot, 
>Sum(may_sales / w_warehouse_sq_ft) AS may_sales_per_sq_foot, 
>Sum(jun_sales / w_warehouse_sq_ft) AS jun_sales_per_sq_foot, 
>Sum(jul_sales / w_warehouse_sq_ft) AS jul_sales_per_sq_foot, 
>Sum(aug_sales / w_warehouse_sq_ft) AS aug_sales_per_sq_foot, 
>Sum(sep_sales / w_warehouse_sq_ft) AS sep_sales_per_sq_foot, 
>Sum(oct_sales / w_warehouse_sq_ft) AS oct_sales_per_sq_foot, 
>Sum(nov_sales / w_warehouse_sq_ft) AS nov_sales_per_sq_foot, 
>Sum(dec_sales / w_warehouse_sq_ft) AS dec_sales_per_sq_foot, 
>Sum(jan_net)   AS jan_net, 
>Sum(feb_net)   AS feb_net, 
>Sum(mar_net)   AS mar_net, 
>Sum(apr_net)   AS apr_net, 
>Sum(may_net)   AS may_net, 
>Sum(jun_net)   AS jun_net, 
>Sum(jul_net)   AS jul_net, 
>Sum(aug_net)   AS aug_net, 
>Sum(sep_net)   AS sep_net, 
>Sum(oct_net)   AS oct_net, 
>Sum(nov_net)   AS nov_net, 
>Sum(dec_net)   AS dec_net 
> FROM   (SELECT w_warehouse_name, 
>w_warehouse_sq_ft, 
>w_city, 
>w_county, 
>w_state, 
>w_country, 
>'ZOUROS' 
>|| ',' 
>|| 'ZHOU' AS ship_carriers, 
>d_yearAS year1, 
>Sum(CASE 
>  WHEN d_moy = 1 THEN ws_ext_sales_price * ws_quantity 
> 

[jira] [Commented] (DRILL-5486) Compliant text reader tries, but fails, to ignore empty rows

2017-05-08 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001748#comment-16001748
 ] 

Paul Rogers commented on DRILL-5486:


As it turns out, the offset vector {{setValueCount}} does not actually do 
anything, so the fact that it is setting the wrong value is moot. See 
DRILL-5488.

> Compliant text reader tries, but fails, to ignore empty rows
> 
>
> Key: DRILL-5486
> URL: https://issues.apache.org/jira/browse/DRILL-5486
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Priority: Minor
>
> The "compliant" text reader (for CSV files, etc.) has two modes: reading with 
> headers (which creates a scalar vector per file column), or without headers 
> (that creates a single {{columns}} vector with an array of all columns.
> When run in array mode, the code uses a class called 
> {{RepeatedVarCharOutput}}. This class attempts to ignore empty records:
> {code}
>   public void finishRecord() {
> ...
> // if there were no defined fields, skip.
> if (fieldIndex > -1) {
>   batchIndex++;
>   recordCount++;
> }
> ...
> {code}
> As it turns out, this works only on the *first* row. On the first row, the 
> {{fieldIndex}} has its initial value of -1. But, for subsequent records, 
> {{fieldIndex}} is never reset and retains its value from the last field set.
> The result is that the code skips the first empty row, but not empty rows 
> elsewhere in the file.
> Further, on the first row, the logic is flawed. The part shown above is only 
> for the row counts. Here is more of the logic:
> {code}
>   @Override
>   public void finishRecord() {
> recordStart = characterData;
> ...
> int newOffset = ((int) (charLengthOffset - charLengthOffsetOriginal))/4;
> PlatformDependent.putInt(repeatedOffset, newOffset);
> repeatedOffset += 4;
> // if there were no defined fields, skip.
> if (fieldIndex > -1) {
>   batchIndex++;
>   recordCount++;
> }
>   }
> {code}
> Note that the underlying vector *is* bumped for the record, even though the 
> row count is not. Later, this will cause the container to have one less 
> record.
> This would be a problem if the row count ({{batchIndex}}) was used for the 
> batch row count. But, it is not, the next level of code keeps a separate 
> count, so the "skip empty row" logic causes the counts to get out of sync 
> between the {{RepeatedVarCharOutput}} and the next level of reader.
> In sense, there are multiple self-cancelling bugs:
> * The skip-empty-row logic is not synchronized with the upper layer.
> * That bug is partly masked because the logic is not triggered except on the 
> first row.
> * That bug is masked because we always set the row offset, even if we ignore 
> the row in the row count.
> * That bug is masked because the incorrect row count is ignored when setting 
> the container row count.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5488) Useless code in VectorTrimmer, fixed-width vector setValueCount()

2017-05-08 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001746#comment-16001746
 ] 

Paul Rogers commented on DRILL-5488:


A similar analysis can be done for the variable-length vectors, such as 
{{VarCharVector}}.

> Useless code in VectorTrimmer, fixed-width vector setValueCount()
> -
>
> Key: DRILL-5488
> URL: https://issues.apache.org/jira/browse/DRILL-5488
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Priority: Trivial
>
> Consider this code in a generated fixed-width vector, such as UInt4Vector:
> {code}
> @Override
> public void setValueCount(int valueCount) {
>   ...
>   final int idx = (VALUE_WIDTH * valueCount);
>   ...
>   VectorTrimmer.trim(data, idx);
>   data.writerIndex(valueCount * VALUE_WIDTH);
> }
> {code}
> Consider the {{trim()}} method:
> {code}
> public class VectorTrimmer {
>   ...
>   public static void trim(ByteBuf data, int idx) {
> data.writerIndex(idx);
> if (data instanceof DrillBuf) {
>   // data.capacity(idx);
>   data.writerIndex(idx);
> }
>   }
> }
> {code}
> This method is called {{trim}}, but it actually sets the writer index in the 
> buffer (though we never use that index.) Since all buffers we use are 
> {{DrillBuf}}, the if-statement is a no-op: we simply set the writer index 
> twice.
> But, notice that the {{setValueCount()}} method itself calls the same 
> {{writerIndex()}} method, so it is actually being called three times.
> It seems this code can simply be discarded: it is called from only two 
> places; neither of which end up using the writer index.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (DRILL-5488) Useless code in VectorTrimmer, fixed-width vector setValueCount()

2017-05-08 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001706#comment-16001706
 ] 

Paul Rogers edited comment on DRILL-5488 at 5/8/17 11:04 PM:
-

Consider the {{setValueCount()}} method again:

{code}
public void setValueCount(int valueCount) {
  final int currentValueCapacity = getValueCapacity();
  ...
  while(valueCount > getValueCapacity()) {
reAlloc();
  }
  if (valueCount > 0 && currentValueCapacity > valueCount * 2) {
...
{code}

The {{if}} statement is very probably wrong. We get the current capacity. If 
too small, we realloc. Then, we check the value, after the possible 
reallocation, without updating the capacity. It is not clear if we are trying 
to do check the before or after.

Then:

{code}
  if (valueCount > 0 && currentValueCapacity > valueCount * 2) {
incrementAllocationMonitor(); // is ++allocationMonitor
  } else if (allocationMonitor > 0) {
allocationMonitor = 0;
  }
{code}

It seems we are trying to guess if we did a realloc rather than simply 
incrementing the monitor when we actually do the reAlloc. It is also not clear 
that the if-statement mimics the if that does the reAlloc.

In fact, a close inspection of the {{setValueCount()}} method, shows that it 
does not even work correctly. It is allocating enough space to hold the value 
count, which will implicitly set the unused slots to 0. But, this is an offset 
vector; we should be copying forward the last set offset. Consider an example:

\[0, 3, 9] as set values.

Set row count as 5. Then, the offset count is (row count + 1) = 6. What we 
create is:

\[0, 3, 9, 0, 0, 0]

What we should have set is:

\[0, 3, 9, 9, 9, 9]

It seems to be the responsibility of the caller to set the last value, so what 
we actually set today, in this case, is:

\[0, 3, 9, 0, 0, 9]

The caveat is that the above works if the caller diligently fills in the 
offsets row by row. But, if it die, there would be no reason to extend the 
vector in {{setValueCount()}}. It has to be one way or the other, not the 
ambiguous middle that we have in the code now.


was (Author: paul-rogers):
Consider the {{setValueCount()}} method again:

{code}
public void setValueCount(int valueCount) {
  final int currentValueCapacity = getValueCapacity();
  ...
  while(valueCount > getValueCapacity()) {
reAlloc();
  }
  if (valueCount > 0 && currentValueCapacity > valueCount * 2) {
...
{code}

The {{if}} statement is very probably wrong. We get the current capacity. If 
too small, we realloc. Then, we check the value, after the possible 
reallocation, without updating the capacity. It is not clear if we are trying 
to do check the before or after.

Then:

{code}
  if (valueCount > 0 && currentValueCapacity > valueCount * 2) {
incrementAllocationMonitor(); // is ++allocationMonitor
  } else if (allocationMonitor > 0) {
allocationMonitor = 0;
  }
{code}

It seems we are trying to guess if we did a realloc rather than simply 
incrementing the monitor when we actually do the reAlloc. It is also not clear 
that the if-statement mimics the if that does the reAlloc.

In fact, a close inspection of the {{setValueCount()}} method, shows that it 
does not even work correctly. It is allocating enough space to hold the value 
count, which will implicitly set the unused slots to 0. But, this is an offset 
vector; we should be copying forward the last set offset. Consider an example:

\[0, 3, 9] as set values.

Set row count as 5. What we create is:

\[0, 3, 9, 0, 0, 0]

What we should have set is:

\[0, 3, 9, 9, 9, 9]

It seems to be the responsibility of the caller to set the last value, so what 
we actually set today, in this case, is:

\[0, 3, 9, 0, 0, 9]

The caveat is that the above works if the caller diligently fills in the 
offsets row by row. But, if it die, there would be no reason to extend the 
vector in {{setValueCount()}}. It has to be one way or the other, not the 
ambiguous middle that we have in the code now.

> Useless code in VectorTrimmer, fixed-width vector setValueCount()
> -
>
> Key: DRILL-5488
> URL: https://issues.apache.org/jira/browse/DRILL-5488
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Priority: Trivial
>
> Consider this code in a generated fixed-width vector, such as UInt4Vector:
> {code}
> @Override
> public void setValueCount(int valueCount) {
>   ...
>   final int idx = (VALUE_WIDTH * valueCount);
>   ...
>   VectorTrimmer.trim(data, idx);
>   data.writerIndex(valueCount * VALUE_WIDTH);
> }
> {code}
> Consider the {{trim()}} method:
> {code}
> public cla

[jira] [Comment Edited] (DRILL-5488) Useless code in VectorTrimmer, fixed-width vector setValueCount()

2017-05-08 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001706#comment-16001706
 ] 

Paul Rogers edited comment on DRILL-5488 at 5/8/17 11:00 PM:
-

Consider the {{setValueCount()}} method again:

{code}
public void setValueCount(int valueCount) {
  final int currentValueCapacity = getValueCapacity();
  ...
  while(valueCount > getValueCapacity()) {
reAlloc();
  }
  if (valueCount > 0 && currentValueCapacity > valueCount * 2) {
...
{code}

The {{if}} statement is very probably wrong. We get the current capacity. If 
too small, we realloc. Then, we check the value, after the possible 
reallocation, without updating the capacity. It is not clear if we are trying 
to do check the before or after.

Then:

{code}
  if (valueCount > 0 && currentValueCapacity > valueCount * 2) {
incrementAllocationMonitor(); // is ++allocationMonitor
  } else if (allocationMonitor > 0) {
allocationMonitor = 0;
  }
{code}

It seems we are trying to guess if we did a realloc rather than simply 
incrementing the monitor when we actually do the reAlloc. It is also not clear 
that the if-statement mimics the if that does the reAlloc.

In fact, a close inspection of the {{setValueCount()}} method, shows that it 
does not even work correctly. It is allocating enough space to hold the value 
count, which will implicitly set the unused slots to 0. But, this is an offset 
vector; we should be copying forward the last set offset. Consider an example:

\[0, 3, 9] as set values.

Set row count as 5. What we create is:

\[0, 3, 9, 0, 0, 0]

What we should have set is:

\[0, 3, 9, 9, 9, 9]

It seems to be the responsibility of the caller to set the last value, so what 
we actually set today, in this case, is:

\[0, 3, 9, 0, 0, 9]

The caveat is that the above works if the caller diligently fills in the 
offsets row by row. But, if it die, there would be no reason to extend the 
vector in {{setValueCount()}}. It has to be one way or the other, not the 
ambiguous middle that we have in the code now.


was (Author: paul-rogers):
Consider the {{setValueCount()}} method again:

{code}
public void setValueCount(int valueCount) {
  final int currentValueCapacity = getValueCapacity();
  ...
  while(valueCount > getValueCapacity()) {
reAlloc();
  }
  if (valueCount > 0 && currentValueCapacity > valueCount * 2) {
...
{code}

The {{if}} statement is very probably wrong. We get the current capacity. If 
too small, we realloc. Then, we check the value, after the possible 
reallocation, without updating the capacity. It is not clear if we are trying 
to do check the before or after.

Then:

{code}
  if (valueCount > 0 && currentValueCapacity > valueCount * 2) {
incrementAllocationMonitor(); // is ++allocationMonitor
  } else if (allocationMonitor > 0) {
allocationMonitor = 0;
  }
{code}

It seems we are trying to guess if we did a realloc rather than simply 
incrementing the monitor when we actually do the reAlloc. It is also not clear 
that the if-statement mimics the if that does the reAlloc.

In fact, a close inspection of the {{setValueCount()}} method, shows that it 
does not even work correctly. It is allocating enough space to hold the value 
count, which will implicitly set the unused slots to 0. But, this is an offset 
vector; we should be copying forward the last set offset. Consider an example:

\[0, 3, 9] as set values.

Set row count as 5. What we create is:

\[0, 3, 9, 0, 0, 0]

What we should have set is:

\[0, 3, 9, 9, 9, 9]

It seems to be the responsibility of the caller to set the last value, so what 
we actually set today, in this case, is:

\[0, 3, 9, 0, 0, 9]

> Useless code in VectorTrimmer, fixed-width vector setValueCount()
> -
>
> Key: DRILL-5488
> URL: https://issues.apache.org/jira/browse/DRILL-5488
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Priority: Trivial
>
> Consider this code in a generated fixed-width vector, such as UInt4Vector:
> {code}
> @Override
> public void setValueCount(int valueCount) {
>   ...
>   final int idx = (VALUE_WIDTH * valueCount);
>   ...
>   VectorTrimmer.trim(data, idx);
>   data.writerIndex(valueCount * VALUE_WIDTH);
> }
> {code}
> Consider the {{trim()}} method:
> {code}
> public class VectorTrimmer {
>   ...
>   public static void trim(ByteBuf data, int idx) {
> data.writerIndex(idx);
> if (data instanceof DrillBuf) {
>   // data.capacity(idx);
>   data.writerIndex(idx);
> }
>   }
> }
> {code}
> This method is called {{trim}}, but it actually sets the writer index in the 
> buf

[jira] [Comment Edited] (DRILL-5488) Useless code in VectorTrimmer, fixed-width vector setValueCount()

2017-05-08 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001706#comment-16001706
 ] 

Paul Rogers edited comment on DRILL-5488 at 5/8/17 10:57 PM:
-

Consider the {{setValueCount()}} method again:

{code}
public void setValueCount(int valueCount) {
  final int currentValueCapacity = getValueCapacity();
  ...
  while(valueCount > getValueCapacity()) {
reAlloc();
  }
  if (valueCount > 0 && currentValueCapacity > valueCount * 2) {
...
{code}

The {{if}} statement is very probably wrong. We get the current capacity. If 
too small, we realloc. Then, we check the value, after the possible 
reallocation, without updating the capacity. It is not clear if we are trying 
to do check the before or after.

Then:

{code}
  if (valueCount > 0 && currentValueCapacity > valueCount * 2) {
incrementAllocationMonitor(); // is ++allocationMonitor
  } else if (allocationMonitor > 0) {
allocationMonitor = 0;
  }
{code}

It seems we are trying to guess if we did a realloc rather than simply 
incrementing the monitor when we actually do the reAlloc. It is also not clear 
that the if-statement mimics the if that does the reAlloc.

In fact, a close inspection of the {{setValueCount()}} method, shows that it 
does not even work correctly. It is allocating enough space to hold the value 
count, which will implicitly set the unused slots to 0. But, this is an offset 
vector; we should be copying forward the last set offset. Consider an example:

\[0, 3, 9] as set values.

Set row count as 5. What we create is:

\[0, 3, 9, 0, 0, 0]

What we should have set is:

\[0, 3, 9, 9, 9, 9]

It seems to be the responsibility of the caller to set the last value, so what 
we actually set today, in this case, is:

\[0, 3, 9, 0, 0, 9]


was (Author: paul-rogers):
Consider the {{setValueCount()}} method again:

{code}
public void setValueCount(int valueCount) {
  final int currentValueCapacity = getValueCapacity();
  ...
  while(valueCount > getValueCapacity()) {
reAlloc();
  }
  if (valueCount > 0 && currentValueCapacity > valueCount * 2) {
...
{code}

The {{if}} statement is very probably wrong. We get the current capacity. If 
too small, we realloc. Then, we check the value, after the possible 
reallocation, without updating the capacity. It is not clear if we are trying 
to do check the before or after.

Then:

{code}
  if (valueCount > 0 && currentValueCapacity > valueCount * 2) {
incrementAllocationMonitor(); // is ++allocationMonitor
  } else if (allocationMonitor > 0) {
allocationMonitor = 0;
  }
{code}

It seems we are trying to guess if we did a realloc rather than simply 
incrementing the monitor when we actually do the reAlloc. It is also not clear 
that the if-statement mimics the if that does the reAlloc.

In fact, a close inspection of the {{setValueCount()}} method, shows that it 
does not even work correctly. It is allocating enough space to hold the value 
count, which will implicitly set the unused slots to 0. But, this is an offset 
vector; we should be copying forward the last set offset. Consider an example:

\[0, 3, 9] as set values.

Set row count as 5. What we create is:

\[0, 3, 9, 0, 0]

What we should have set is:

\[0, 3, 9, 9, 9]

> Useless code in VectorTrimmer, fixed-width vector setValueCount()
> -
>
> Key: DRILL-5488
> URL: https://issues.apache.org/jira/browse/DRILL-5488
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Priority: Trivial
>
> Consider this code in a generated fixed-width vector, such as UInt4Vector:
> {code}
> @Override
> public void setValueCount(int valueCount) {
>   ...
>   final int idx = (VALUE_WIDTH * valueCount);
>   ...
>   VectorTrimmer.trim(data, idx);
>   data.writerIndex(valueCount * VALUE_WIDTH);
> }
> {code}
> Consider the {{trim()}} method:
> {code}
> public class VectorTrimmer {
>   ...
>   public static void trim(ByteBuf data, int idx) {
> data.writerIndex(idx);
> if (data instanceof DrillBuf) {
>   // data.capacity(idx);
>   data.writerIndex(idx);
> }
>   }
> }
> {code}
> This method is called {{trim}}, but it actually sets the writer index in the 
> buffer (though we never use that index.) Since all buffers we use are 
> {{DrillBuf}}, the if-statement is a no-op: we simply set the writer index 
> twice.
> But, notice that the {{setValueCount()}} method itself calls the same 
> {{writerIndex()}} method, so it is actually being called three times.
> It seems this code can simply be discarded: it is called from only two 
> places; neither of which end up using the write

[jira] [Comment Edited] (DRILL-5488) Useless code in VectorTrimmer, fixed-width vector setValueCount()

2017-05-08 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001706#comment-16001706
 ] 

Paul Rogers edited comment on DRILL-5488 at 5/8/17 10:55 PM:
-

Consider the {{setValueCount()}} method again:

{code}
public void setValueCount(int valueCount) {
  final int currentValueCapacity = getValueCapacity();
  ...
  while(valueCount > getValueCapacity()) {
reAlloc();
  }
  if (valueCount > 0 && currentValueCapacity > valueCount * 2) {
...
{code}

The {{if}} statement is very probably wrong. We get the current capacity. If 
too small, we realloc. Then, we check the value, after the possible 
reallocation, without updating the capacity. It is not clear if we are trying 
to do check the before or after.

Then:

{code}
  if (valueCount > 0 && currentValueCapacity > valueCount * 2) {
incrementAllocationMonitor(); // is ++allocationMonitor
  } else if (allocationMonitor > 0) {
allocationMonitor = 0;
  }
{code}

It seems we are trying to guess if we did a realloc rather than simply 
incrementing the monitor when we actually do the reAlloc. It is also not clear 
that the if-statement mimics the if that does the reAlloc.

In fact, a close inspection of the {{setValueCount()}} method, shows that it 
does not even work correctly. It is allocating enough space to hold the value 
count, which will implicitly set the unused slots to 0. But, this is an offset 
vector; we should be copying forward the last set offset. Consider an example:

\[0, 3, 9] as set values.

Set row count as 5. What we create is:

\[0, 3, 9, 0, 0]

What we should have set is:

\[0, 3, 9, 9, 9]


was (Author: paul-rogers):
Consider the {{setValueCount()}} method again:

{code}
public void setValueCount(int valueCount) {
  final int currentValueCapacity = getValueCapacity();
  ...
  while(valueCount > getValueCapacity()) {
reAlloc();
  }
  if (valueCount > 0 && currentValueCapacity > valueCount * 2) {
...
{code}

The {{if}} statement is very probably wrong. We get the current capacity. If 
too small, we realloc. Then, we check the value, after the possible 
reallocation, without updating the capacity. It is not clear if we are trying 
to do check the before or after.

Then:

{code}
  if (valueCount > 0 && currentValueCapacity > valueCount * 2) {
incrementAllocationMonitor(); // is ++allocationMonitor
  } else if (allocationMonitor > 0) {
allocationMonitor = 0;
  }
{code}

It seems we are trying to guess if we did a realloc rather than simply 
incrementing the monitor when we actually do the reAlloc. It is also not clear 
that the if-statement mimics the if that does the reAlloc.

> Useless code in VectorTrimmer, fixed-width vector setValueCount()
> -
>
> Key: DRILL-5488
> URL: https://issues.apache.org/jira/browse/DRILL-5488
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Priority: Trivial
>
> Consider this code in a generated fixed-width vector, such as UInt4Vector:
> {code}
> @Override
> public void setValueCount(int valueCount) {
>   ...
>   final int idx = (VALUE_WIDTH * valueCount);
>   ...
>   VectorTrimmer.trim(data, idx);
>   data.writerIndex(valueCount * VALUE_WIDTH);
> }
> {code}
> Consider the {{trim()}} method:
> {code}
> public class VectorTrimmer {
>   ...
>   public static void trim(ByteBuf data, int idx) {
> data.writerIndex(idx);
> if (data instanceof DrillBuf) {
>   // data.capacity(idx);
>   data.writerIndex(idx);
> }
>   }
> }
> {code}
> This method is called {{trim}}, but it actually sets the writer index in the 
> buffer (though we never use that index.) Since all buffers we use are 
> {{DrillBuf}}, the if-statement is a no-op: we simply set the writer index 
> twice.
> But, notice that the {{setValueCount()}} method itself calls the same 
> {{writerIndex()}} method, so it is actually being called three times.
> It seems this code can simply be discarded: it is called from only two 
> places; neither of which end up using the writer index.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5488) Useless code in VectorTrimmer, fixed-width vector setValueCount()

2017-05-08 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001706#comment-16001706
 ] 

Paul Rogers commented on DRILL-5488:


Consider the {{setValueCount()}} method again:

{code}
public void setValueCount(int valueCount) {
  final int currentValueCapacity = getValueCapacity();
  ...
  while(valueCount > getValueCapacity()) {
reAlloc();
  }
  if (valueCount > 0 && currentValueCapacity > valueCount * 2) {
...
{code}

The {{if}} statement is very probably wrong. We get the current capacity. If 
too small, we realloc. Then, we check the value, after the possible 
reallocation, without updating the capacity. It is not clear if we are trying 
to do check the before or after.

Then:

{code}
  if (valueCount > 0 && currentValueCapacity > valueCount * 2) {
incrementAllocationMonitor(); // is ++allocationMonitor
  } else if (allocationMonitor > 0) {
allocationMonitor = 0;
  }
{code}

It seems we are trying to guess if we did a realloc rather than simply 
incrementing the monitor when we actually do the reAlloc. It is also not clear 
that the if-statement mimics the if that does the reAlloc.

> Useless code in VectorTrimmer, fixed-width vector setValueCount()
> -
>
> Key: DRILL-5488
> URL: https://issues.apache.org/jira/browse/DRILL-5488
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Priority: Trivial
>
> Consider this code in a generated fixed-width vector, such as UInt4Vector:
> {code}
> @Override
> public void setValueCount(int valueCount) {
>   ...
>   final int idx = (VALUE_WIDTH * valueCount);
>   ...
>   VectorTrimmer.trim(data, idx);
>   data.writerIndex(valueCount * VALUE_WIDTH);
> }
> {code}
> Consider the {{trim()}} method:
> {code}
> public class VectorTrimmer {
>   ...
>   public static void trim(ByteBuf data, int idx) {
> data.writerIndex(idx);
> if (data instanceof DrillBuf) {
>   // data.capacity(idx);
>   data.writerIndex(idx);
> }
>   }
> }
> {code}
> This method is called {{trim}}, but it actually sets the writer index in the 
> buffer (though we never use that index.) Since all buffers we use are 
> {{DrillBuf}}, the if-statement is a no-op: we simply set the writer index 
> twice.
> But, notice that the {{setValueCount()}} method itself calls the same 
> {{writerIndex()}} method, so it is actually being called three times.
> It seems this code can simply be discarded: it is called from only two 
> places; neither of which end up using the writer index.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (DRILL-5488) Useless code in VectorTrimmer, fixed-width vector setValueCount()

2017-05-08 Thread Paul Rogers (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-5488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-5488:
---
Summary: Useless code in VectorTrimmer, fixed-width vector setValueCount()  
(was: Useless code in VectorTrimmer)

> Useless code in VectorTrimmer, fixed-width vector setValueCount()
> -
>
> Key: DRILL-5488
> URL: https://issues.apache.org/jira/browse/DRILL-5488
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Priority: Trivial
>
> Consider this code in a generated fixed-width vector, such as UInt4Vector:
> {code}
> @Override
> public void setValueCount(int valueCount) {
>   ...
>   final int idx = (VALUE_WIDTH * valueCount);
>   ...
>   VectorTrimmer.trim(data, idx);
>   data.writerIndex(valueCount * VALUE_WIDTH);
> }
> {code}
> Consider the {{trim()}} method:
> {code}
> public class VectorTrimmer {
>   ...
>   public static void trim(ByteBuf data, int idx) {
> data.writerIndex(idx);
> if (data instanceof DrillBuf) {
>   // data.capacity(idx);
>   data.writerIndex(idx);
> }
>   }
> }
> {code}
> This method is called {{trim}}, but it actually sets the writer index in the 
> buffer (though we never use that index.) Since all buffers we use are 
> {{DrillBuf}}, the if-statement is a no-op: we simply set the writer index 
> twice.
> But, notice that the {{setValueCount()}} method itself calls the same 
> {{writerIndex()}} method, so it is actually being called three times.
> It seems this code can simply be discarded: it is called from only two 
> places; neither of which end up using the writer index.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (DRILL-5488) Useless code in VectorTrimmer

2017-05-08 Thread Paul Rogers (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-5488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-5488:
---
Description: 
Consider this code in a generated fixed-width vector, such as UInt4Vector:

{code}
@Override
public void setValueCount(int valueCount) {
  ...
  final int idx = (VALUE_WIDTH * valueCount);
  ...
  VectorTrimmer.trim(data, idx);
  data.writerIndex(valueCount * VALUE_WIDTH);
}
{code}

Consider the {{trim()}} method:

{code}
public class VectorTrimmer {
  ...
  public static void trim(ByteBuf data, int idx) {
data.writerIndex(idx);
if (data instanceof DrillBuf) {
  // data.capacity(idx);
  data.writerIndex(idx);
}
  }
}
{code}

This method is called {{trim}}, but it actually sets the writer index in the 
buffer (though we never use that index.) Since all buffers we use are 
{{DrillBuf}}, the if-statement is a no-op: we simply set the writer index twice.

But, notice that the {{setValueCount()}} method itself calls the same 
{{writerIndex()}} method, so it is actually being called three times.

It seems this code can simply be discarded: it is called from only two places; 
neither of which end up using the writer index.


  was:
Consider this code in a generated fixed-width vector, such as UInt4Vector:

{code}
@Override
public void setValueCount(int valueCount) {
  ...
  final int idx = (VALUE_WIDTH * valueCount);
  ...
  VectorTrimmer.trim(data, idx);
  data.writerIndex(valueCount * VALUE_WIDTH);
}
{code}

Consider the {{trim()}} method:

{code}
public class VectorTrimmer {
  ...
  public static void trim(ByteBuf data, int idx) {
data.writerIndex(idx);
if (data instanceof DrillBuf) {
  // data.capacity(idx);
  data.writerIndex(idx);
}
  }
}
{code}

This method is called {{trim}}, but it actually sets the writer index in the 
buffer (though we never use that index.) Since all buffers we use are 
{{DrillBuf}}, the if-statement is a no-op: we simply set the writer index twice.

It seems this code can simply be discarded: it is called from only two places; 
neither of which end up using the writer index.



> Useless code in VectorTrimmer
> -
>
> Key: DRILL-5488
> URL: https://issues.apache.org/jira/browse/DRILL-5488
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Priority: Trivial
>
> Consider this code in a generated fixed-width vector, such as UInt4Vector:
> {code}
> @Override
> public void setValueCount(int valueCount) {
>   ...
>   final int idx = (VALUE_WIDTH * valueCount);
>   ...
>   VectorTrimmer.trim(data, idx);
>   data.writerIndex(valueCount * VALUE_WIDTH);
> }
> {code}
> Consider the {{trim()}} method:
> {code}
> public class VectorTrimmer {
>   ...
>   public static void trim(ByteBuf data, int idx) {
> data.writerIndex(idx);
> if (data instanceof DrillBuf) {
>   // data.capacity(idx);
>   data.writerIndex(idx);
> }
>   }
> }
> {code}
> This method is called {{trim}}, but it actually sets the writer index in the 
> buffer (though we never use that index.) Since all buffers we use are 
> {{DrillBuf}}, the if-statement is a no-op: we simply set the writer index 
> twice.
> But, notice that the {{setValueCount()}} method itself calls the same 
> {{writerIndex()}} method, so it is actually being called three times.
> It seems this code can simply be discarded: it is called from only two 
> places; neither of which end up using the writer index.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (DRILL-5488) Useless code in VectorTrimmer

2017-05-08 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-5488:
--

 Summary: Useless code in VectorTrimmer
 Key: DRILL-5488
 URL: https://issues.apache.org/jira/browse/DRILL-5488
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.10.0
Reporter: Paul Rogers
Priority: Trivial


Consider this code in a generated fixed-width vector, such as UInt4Vector:

{code}
@Override
public void setValueCount(int valueCount) {
  ...
  final int idx = (VALUE_WIDTH * valueCount);
  ...
  VectorTrimmer.trim(data, idx);
  data.writerIndex(valueCount * VALUE_WIDTH);
}
{code}

Consider the {{trim()}} method:

{code}
public class VectorTrimmer {
  ...
  public static void trim(ByteBuf data, int idx) {
data.writerIndex(idx);
if (data instanceof DrillBuf) {
  // data.capacity(idx);
  data.writerIndex(idx);
}
  }
}
{code}

This method is called {{trim}}, but it actually sets the writer index in the 
buffer (though we never use that index.) Since all buffers we use are 
{{DrillBuf}}, the if-statement is a no-op: we simply set the writer index twice.

It seems this code can simply be discarded: it is called from only two places; 
neither of which end up using the writer index.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (DRILL-4335) Apache Drill should support network encryption

2017-05-08 Thread Sudheesh Katkam (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sudheesh Katkam updated DRILL-4335:
---
Labels: ready-to-commit security  (was: security)

> Apache Drill should support network encryption
> --
>
> Key: DRILL-4335
> URL: https://issues.apache.org/jira/browse/DRILL-4335
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Keys Botzum
>Assignee: Sorabh Hamirwasia
>  Labels: ready-to-commit, security
> Attachments: ApacheDrillEncryptionUsingSASLDesign.pdf
>
>
> This is clearly related to Drill-291 but wanted to make explicit that this 
> needs to include network level encryption and not just authentication. This 
> is particularly important for the client connection to Drill which will often 
> be sending passwords in the clear until there is encryption.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-5486) Compliant text reader tries, but fails, to ignore empty rows

2017-05-08 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001672#comment-16001672
 ] 

Paul Rogers commented on DRILL-5486:


More cascading errors. In {{RepeatedVarCharOutput.finishBatch()}}:

{code}
  public void finishBatch() {
mutator.setValueCount(batchIndex);
  }
{code}

Here, we are using {{batchIndex}} to be the row count. We must consider the two 
cases: it is right, and it is off-by-one.

For the correct case, let's look at {{BaseRepeatedValueVector.setValueCount()}}:

{code}
public void setValueCount(int valueCount) {
  // TODO: populate offset end points
  offsets.getMutator().setValueCount(valueCount == 0 ? 0 : valueCount+1);
  final int childValueCount = valueCount == 0 ? 0 : 
offsets.getAccessor().get(valueCount);
  vector.getMutator().setValueCount(childValueCount);
}
{code}

Consider that the row count is one. The offset vector must contain two entries: 
\[0, x], where x is the length of the first (and only) array. That is the 
offset count is always one more that the row count (except when we have zero 
rows.)

Some background if we have just one row:

* The first entry must always be zero.
* To get the length of the ith row, we do: offset\[i+1] - offset\[i];

So, we need the result \[0, final-offset].

Then, we need to tell the array the number of entries it contains. This is the 
sum of all array entries in the batch. That value is in the (row count)st 
position of the offset vector as mentioned above.

This works because the {{RepeatedVarCharOutput}} updates the (I+1)st postiion 
for each row.

Now, consider the case with the bug mentioned earlier: we update the offset 
vector, but not the row count. Consider this input:

{code}

a,b,c
{code}

The first line is blank, so the {{batchIndex}} is *not* incremented, but the 
offset vector *is* written. At the end:

{code}
batchIndex = 1
offset vector = [0, 0, 3]
{code}

Then, the above method is given a row count of 1, and so:

{code}
count of row offsets: 2
offset values: [0, 0, 3] (but, value of 3 is ignored because count is set to 2)
childValueCount = row offsets[1] = 0
`columns` array item count: 0 (though there are actually three children)
{code}

Then, the next layer sets the row count to 2 (not knowing that the first row 
was skipped.)

So, in the end we have:

* A row count of 2
* A `columns` offset vector of length 2: \[0, 0], meaning 1 row of zero length.
* A `columns` item offset vector of length 0, meaning an empty array (in the 
only row)

This is an unstable situation: 2 rows at the container level, but only one row 
set up in the `columns` vector.

This is very confusing because the code is wrong, so one not only has to 
understand what should happen, but the incorrect behavior that actually happens.


> Compliant text reader tries, but fails, to ignore empty rows
> 
>
> Key: DRILL-5486
> URL: https://issues.apache.org/jira/browse/DRILL-5486
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Priority: Minor
>
> The "compliant" text reader (for CSV files, etc.) has two modes: reading with 
> headers (which creates a scalar vector per file column), or without headers 
> (that creates a single {{columns}} vector with an array of all columns.
> When run in array mode, the code uses a class called 
> {{RepeatedVarCharOutput}}. This class attempts to ignore empty records:
> {code}
>   public void finishRecord() {
> ...
> // if there were no defined fields, skip.
> if (fieldIndex > -1) {
>   batchIndex++;
>   recordCount++;
> }
> ...
> {code}
> As it turns out, this works only on the *first* row. On the first row, the 
> {{fieldIndex}} has its initial value of -1. But, for subsequent records, 
> {{fieldIndex}} is never reset and retains its value from the last field set.
> The result is that the code skips the first empty row, but not empty rows 
> elsewhere in the file.
> Further, on the first row, the logic is flawed. The part shown above is only 
> for the row counts. Here is more of the logic:
> {code}
>   @Override
>   public void finishRecord() {
> recordStart = characterData;
> ...
> int newOffset = ((int) (charLengthOffset - charLengthOffsetOriginal))/4;
> PlatformDependent.putInt(repeatedOffset, newOffset);
> repeatedOffset += 4;
> // if there were no defined fields, skip.
> if (fieldIndex > -1) {
>   batchIndex++;
>   recordCount++;
> }
>   }
> {code}
> Note that the underlying vector *is* bumped for the record, even though the 
> row count is not. Later, this will cause the container to have one less 
> record.
> This would be a problem if the row count ({{batchIndex}}) was used for the 
> batch row count. But, it is not, the n

[jira] [Commented] (DRILL-5470) CSV reader data corruption on truncated lines

2017-05-08 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001556#comment-16001556
 ] 

Paul Rogers commented on DRILL-5470:


Turns out the user is not actually using the columns header feature of CSV. 
Created DRILL-5487 to describe the bug found above in the CSV-with headers case.

Instead, the user uses the {{columns}} form of CSV files. This ticket will 
focus on that issue.

> CSV reader data corruption on truncated lines
> -
>
> Key: DRILL-5470
> URL: https://issues.apache.org/jira/browse/DRILL-5470
> Project: Apache Drill
>  Issue Type: Bug
>  Components:  Server
>Affects Versions: 1.10.0
> Environment: - ubuntu 14.04
> - r3.8xl (32 CPU/240GB Mem)
> - openjdk version "1.8.0_111"
> - drill 1.10.0 with 8656c83b00f8ab09fb6817e4e9943b2211772541 cherry-picked
>Reporter: Nathan Butler
>Assignee: Paul Rogers
>Priority: Critical
>
> Per the mailing list discussion and Rahul's and Paul's suggestion I'm filing 
> this Jira issue. Drill seems to be running out of memory when doing an 
> External Sort. Per Zelaine's suggestion I enabled 
> sort.external.disable_managed in drill-override.conf and in the sqlline 
> session. This caused the query to run for longer but it still would fail with 
> the same message.
> Per Paul's suggestion, I enabled debug logging for the 
> org.apache.drill.exec.physical.impl.xsort.managed package and re-ran the 
> query.
> Here's the initial DEBUG line for ExternalSortBatch for our query:
> bq. 2017-05-03 12:02:56,095 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:15] 
> DEBUG o.a.d.e.p.i.x.m.ExternalSortBatch - Config: memory limit = 10737418240, 
> spill file size = 268435456, spill batch size = 8388608, merge limit = 
> 2147483647, merge batch size = 16777216
> And here's the last DEBUG line before the stack trace:
> bq. 2017-05-03 12:37:44,249 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:4] 
> DEBUG o.a.d.e.p.i.x.m.ExternalSortBatch - Available memory: 10737418240, 
> buffer memory = 10719535268, merge memory = 10707140978
> And the stacktrace:
> {quote}
> 2017-05-03 12:38:02,927 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:6] INFO  
> o.a.d.e.p.i.x.m.ExternalSortBatch - User Error Occurred: External Sort 
> encountered an error while spilling to disk (Un
> able to allocate buffer of size 268435456 due to memory limit. Current 
> allocation: 10579849472)
> org.apache.drill.common.exceptions.UserException: RESOURCE ERROR: External 
> Sort encountered an error while spilling to disk
> [Error Id: 5d53c677-0cd9-4c01-a664-c02089670a1c ]
> at 
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:544)
>  ~[drill-common-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.doMergeAndSpill(ExternalSortBatch.java:1447)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.mergeAndSpill(ExternalSortBatch.java:1376)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.spillFromMemory(ExternalSortBatch.java:1339)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.processBatch(ExternalSortBatch.java:831)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.loadBatch(ExternalSortBatch.java:618)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.load(ExternalSortBatch.java:660)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.innerNext(ExternalSortBatch.java:559)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch.innerNext(StreamingAggBatch.java:137)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
>  [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:104) 
> [drill-java-exec-1.10.0.jar:1.10.0]
> at 
> org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec

[jira] [Created] (DRILL-5487) Vector corruption in CSV with headers and truncated last row

2017-05-08 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-5487:
--

 Summary: Vector corruption in CSV with headers and truncated last 
row
 Key: DRILL-5487
 URL: https://issues.apache.org/jira/browse/DRILL-5487
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.10.0
Reporter: Paul Rogers


The CSV format plugin allows two ways of reading data:

* As named columns
* As a single array, called {{columns}}, that holds all columns for a row

The named columns feature will corrupt the offset vectors if the last row of 
the file is truncated: leaves off one or more columns.

To illustrate the CSV data corruption, I created a CSV file, test4.csv, of the 
following form:

{code}
h,u
abc,def
ghi
{code}

Note that the file is truncated: the command and second field is missing on the 
last line.

Then, I created a simple test using the "cluster fixture" framework:

{code}
  @Test
  public void readerTest() throws Exception {
FixtureBuilder builder = ClusterFixture.builder()
.maxParallelization(1);

try (ClusterFixture cluster = builder.build();
 ClientFixture client = cluster.clientFixture()) {
  TextFormatConfig csvFormat = new TextFormatConfig();
  csvFormat.fieldDelimiter = ',';
  csvFormat.skipFirstLine = false;
  csvFormat.extractHeader = true;
  cluster.defineWorkspace("dfs", "data", "/tmp/data", "csv", csvFormat);
  String sql = "SELECT * FROM `dfs.data`.`csv/test4.csv` LIMIT 10";
  client.queryBuilder().sql(sql).printCsv();
}
  }
{code}

The results show we've got a problem:

{code}
Exception (no rows returned): 
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
IllegalArgumentException: length: -3 (expected: >= 0)
{code}

If the last line were:

{code}
efg,
{code}

Then the offset vector should look like this:

{code}
[0, 3, 3]
{code}

Very likely we have an offset vector that looks like this instead:

{code}
[0, 3, 0]
{code}

When we compute the second column of the second row, we should compute:

{code}
length = offset[2] - offset[1] = 3 - 3 = 0
{code}

Instead we get:

{code}
length = offset[2] - offset[1] = 0 - 3 = -3
{code}

The summary is that a premature EOF appears to cause the "missing" columns to 
be skipped; they are not filled with a blank value to "bump" the offset vectors 
to fill in the last row. Instead, they are left at 0, causing havoc downstream 
in the query.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (DRILL-5486) Compliant text reader tries, but fails, to ignore empty rows

2017-05-08 Thread Paul Rogers (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-5486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-5486:
---
Description: 
The "compliant" text reader (for CSV files, etc.) has two modes: reading with 
headers (which creates a scalar vector per file column), or without headers 
(that creates a single {{columns}} vector with an array of all columns.

When run in array mode, the code uses a class called {{RepeatedVarCharOutput}}. 
This class attempts to ignore empty records:

{code}
  public void finishRecord() {
...
// if there were no defined fields, skip.
if (fieldIndex > -1) {
  batchIndex++;
  recordCount++;
}
...
{code}

As it turns out, this works only on the *first* row. On the first row, the 
{{fieldIndex}} has its initial value of -1. But, for subsequent records, 
{{fieldIndex}} is never reset and retains its value from the last field set.

The result is that the code skips the first empty row, but not empty rows 
elsewhere in the file.

Further, on the first row, the logic is flawed. The part shown above is only 
for the row counts. Here is more of the logic:

{code}
  @Override
  public void finishRecord() {
recordStart = characterData;
...
int newOffset = ((int) (charLengthOffset - charLengthOffsetOriginal))/4;
PlatformDependent.putInt(repeatedOffset, newOffset);
repeatedOffset += 4;

// if there were no defined fields, skip.
if (fieldIndex > -1) {
  batchIndex++;
  recordCount++;
}
  }
{code}

Note that the underlying vector *is* bumped for the record, even though the row 
count is not. Later, this will cause the container to have one less record.

This would be a problem if the row count ({{batchIndex}}) was used for the 
batch row count. But, it is not, the next level of code keeps a separate count, 
so the "skip empty row" logic causes the counts to get out of sync between the 
{{RepeatedVarCharOutput}} and the next level of reader.

In sense, there are multiple self-cancelling bugs:

* The skip-empty-row logic is not synchronized with the upper layer.
* That bug is partly masked because the logic is not triggered except on the 
first row.
* That bug is masked because we always set the row offset, even if we ignore 
the row in the row count.
* That bug is masked because the incorrect row count is ignored when setting 
the container row count.


  was:
The "compliant" text reader (for CSV files, etc.) has two modes: reading with 
headers (which creates a scalar vector per file column), or without headers 
(that creates a single {{columns}} vector with an array of all columns.

When run in array mode, the code uses a class called {{RepeatedVarCharOutput}}. 
This class attempts to ignore empty records:

{code}
  public void finishRecord() {
...
// if there were no defined fields, skip.
if (fieldIndex > -1) {
  batchIndex++;
  recordCount++;
}
...
{code}

As it turns out, this works only on the *first* row. On the first row, the 
{{fieldIndex}} has its initial value of -1. But, for subsequent records, 
{{fieldIndex}} is never reset and retains its value from the last field set.

The result is that the code skips the first empty row, but not empty rows 
elsewhere in the file.

Further, on the first row, the logic is flawed. The part shown above is only 
for the row counts. Here is more of the logic:

{code}
  @Override
  public void finishRecord() {
recordStart = characterData;
...
int newOffset = ((int) (charLengthOffset - charLengthOffsetOriginal))/4;
PlatformDependent.putInt(repeatedOffset, newOffset);
repeatedOffset += 4;

// if there were no defined fields, skip.
if (fieldIndex > -1) {
  batchIndex++;
  recordCount++;
}
  }
{code}

Note that the underlying vector *is* bumped for the record, even though the row 
count is not. Later, this will cause the container to have one less record.

Suppose we have this data:
[]
[abc, def]

The above bug leaves the data in this form:

[]

Why? We counted only one record (the second), but we created an entry for the 
first (empty) record. When we set row count, we set it to 1, which is only the 
first. The result is the loss of the second row.

Or, that would be true if the row count ({{batchIndex}}) was used for the batch 
row count. But, it is not, the next level of code keeps a separate count, so 
the "skip empty row" logic causes the counts to get out of sync between the 
{{RepeatedVarCharOutput}} and the next level of reader.

In sense, there are multiple self-cancelling bugs:

* The skip-empty-row logic is not synchronized with the upper layer.
* That bug is partly masked because the logic is not triggered except on the 
first row.
* That bug is masked because we always set the row offset, even if we ignore 
the row in the row count.
* That bug is masked because the incorrect row count is ignored when setting 
the container row co

[jira] [Created] (DRILL-5486) Compliant text reader tries, but fails, to ignore empty rows

2017-05-08 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-5486:
--

 Summary: Compliant text reader tries, but fails, to ignore empty 
rows
 Key: DRILL-5486
 URL: https://issues.apache.org/jira/browse/DRILL-5486
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.10.0
Reporter: Paul Rogers
Priority: Minor


The "compliant" text reader (for CSV files, etc.) has two modes: reading with 
headers (which creates a scalar vector per file column), or without headers 
(that creates a single {{columns}} vector with an array of all columns.

When run in array mode, the code uses a class called {{RepeatedVarCharOutput}}. 
This class attempts to ignore empty records:

{code}
  public void finishRecord() {
...
// if there were no defined fields, skip.
if (fieldIndex > -1) {
  batchIndex++;
  recordCount++;
}
...
{code}

As it turns out, this works only on the *first* row. On the first row, the 
{{fieldIndex}} has its initial value of -1. But, for subsequent records, 
{{fieldIndex}} is never reset and retains its value from the last field set.

The result is that the code skips the first empty row, but not empty rows 
elsewhere in the file.

Further, on the first row, the logic is flawed. The part shown above is only 
for the row counts. Here is more of the logic:

{code}
  @Override
  public void finishRecord() {
recordStart = characterData;
...
int newOffset = ((int) (charLengthOffset - charLengthOffsetOriginal))/4;
PlatformDependent.putInt(repeatedOffset, newOffset);
repeatedOffset += 4;

// if there were no defined fields, skip.
if (fieldIndex > -1) {
  batchIndex++;
  recordCount++;
}
  }
{code}

Note that the underlying vector *is* bumped for the record, even though the row 
count is not. Later, this will cause the container to have one less record.

Suppose we have this data:
[]
[abc, def]

The above bug leaves the data in this form:

[]

Why? We counted only one record (the second), but we created an entry for the 
first (empty) record. When we set row count, we set it to 1, which is only the 
first. The result is the loss of the second row.

Or, that would be true if the row count ({{batchIndex}}) was used for the batch 
row count. But, it is not, the next level of code keeps a separate count, so 
the "skip empty row" logic causes the counts to get out of sync between the 
{{RepeatedVarCharOutput}} and the next level of reader.

In sense, there are multiple self-cancelling bugs:

* The skip-empty-row logic is not synchronized with the upper layer.
* That bug is partly masked because the logic is not triggered except on the 
first row.
* That bug is masked because we always set the row offset, even if we ignore 
the row in the row count.
* That bug is masked because the incorrect row count is ignored when setting 
the container row count.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Closed] (DRILL-4982) Hive Queries degrade when queries switch between different formats

2017-05-08 Thread Dechang Gu (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dechang Gu closed DRILL-4982.
-

Verified and added testcase to perf test framework.

> Hive Queries degrade when queries switch between different formats
> --
>
> Key: DRILL-4982
> URL: https://issues.apache.org/jira/browse/DRILL-4982
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Chunhui Shi
>Assignee: Karthikeyan Manivannan
>Priority: Critical
> Fix For: 1.10.0
>
>
> We have seen degraded performance by doing these steps:
> 1) generate the repro data:
> python script repro.py as below:
> import string
> import random
>  
> for i in range(3000):
> x1 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ 
> in range(random.randrange(19, 27)))
> x2 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ 
> in range(random.randrange(19, 27)))
> x3 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ 
> in range(random.randrange(19, 27)))
> x4 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ 
> in range(random.randrange(19, 27)))
> x5 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ 
> in range(random.randrange(19, 27)))
> x6 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ 
> in range(random.randrange(19, 27)))
> print 
> "{0}".format(x1),"{0}".format(x2),"{0}".format(x3),"{0}".format(x4),"{0}".format(x5),"{0}".format(x6)
> python repro.py > repro.csv
> 2) put these files in a dfs directory e.g. '/tmp/hiveworkspace/plain'. Under 
> hive prompt, use the following sql command to create an external table:
> CREATE EXTERNAL TABLE `hiveworkspace`.`plain` (`id1` string, `id2` string, 
> `id3` string, `id4` string, `id5` string, `id6` string) ROW FORMAT SERDE 
> 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION 
> '/tmp/hiveworkspace/plain'
> 3) create Hive's table of ORC|PARQUET format:
> CREATE TABLE `hiveworkspace`.`plainorc` STORED AS ORC AS SELECT 
> id1,id2,id3,id4,id5,id6 from `hiveworkspace`.`plain`;
> CREATE TABLE `hiveworkspace`.`plainparquet` STORED AS PARQUET AS SELECT 
> id1,id2,id3,id4,id5,id6 from `hiveworkspace`.`plain`;
> 4) Query switch between these two tables, then the query time on the same 
> table significantly lengthened. On my setup, for ORC, it was 15sec -> 26secs. 
> Queries on table of other formats, after injecting a query to other formats, 
> all have significant slow down.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (DRILL-4824) JSON with complex nested data produces incorrect output with missing fields

2017-05-08 Thread Arina Ielchiieva (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-4824:

Issue Type: Improvement  (was: Bug)

> JSON with complex nested data produces incorrect output with missing fields
> ---
>
> Key: DRILL-4824
> URL: https://issues.apache.org/jira/browse/DRILL-4824
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.0.0
>Reporter: Roman
>Assignee: Volodymyr Vysotskyi
>
> There is incorrect output in case of JSON file with complex nested data.
> _JSON:_
> {code:none|title=example.json|borderStyle=solid}
> {
> "Field1" : {
> }
> }
> {
> "Field1" : {
> "InnerField1": {"key1":"value1"},
> "InnerField2": {"key2":"value2"}
> }
> }
> {
> "Field1" : {
> "InnerField3" : ["value3", "value4"],
> "InnerField4" : ["value5", "value6"]
> }
> }
> {code}
> _Query:_
> {code:sql}
> select Field1 from dfs.`/tmp/example.json`
> {code}
> _Incorrect result:_
> {code:none}
> +---+
> |  Field1   |
> +---+
> {"InnerField1":{},"InnerField2":{},"InnerField3":[],"InnerField4":[]}
> {"InnerField1":{"key1":"value1"},"InnerField2" 
> {"key2":"value2"},"InnerField3":[],"InnerField4":[]}
> {"InnerField1":{},"InnerField2":{},"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]}
> +--+
> {code}
> Theres is no need to output missing fields. In case of deeply nested 
> structure we will get unreadable result for user.
> _Correct result:_
> {code:none}
> +--+
> | Field1   |
> +--+
> |{} 
> {"InnerField1":{"key1":"value1"},"InnerField2":{"key2":"value2"}}
> {"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]}
> +--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-4917) Unable to alias columns from subquery

2017-05-08 Thread sven0726 (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16000383#comment-16000383
 ] 

sven0726 commented on DRILL-4917:
-

The same thing happened to me

0: jdbc:drill:zk=local> SELECT `respCode`,SUM(`num`) AS `num` FROM  GROUP 
BY `respCode`  ;
+---++
| respCode  |  $f1   |
+---++

> Unable to alias columns from subquery
> -
>
> Key: DRILL-4917
> URL: https://issues.apache.org/jira/browse/DRILL-4917
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.8.0
>Reporter: Dan Wild
>
> Column aliasing works in a simple query (without subqueries):
> {code}
> select 1 as myVal from (values(1))
> {code}
> My result set gives me one column called myVal as expected
> |myVal||
> |1|
> However, when I run the query
> {code}
> select myVal as myValAlias FROM(select 1 as myVal from (values(1)))
> {code}
> the alias myValAlias is not applied, and the resulting column is still called 
> myVal
> |myVal||
> |1|
> This is problematic because if my query instead looked like
> {code}
> select myVal, SUM(myVal) as mySum FROM(select 1 as myVal from (values(1))) 
> GROUP BY myVal
> {code}
> I would get a result set back that looks like this, with no way to alias the 
> second column:
> |myVal|$f1|
> |1|1|



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)