[jira] [Created] (DRILL-5970) DrillParquetReader always builds the schema with "OPTIONAL" dataMode columns instead of "REQUIRED" ones

Vitalii Diravka (JIRA) Wed, 15 Nov 2017 09:42:17 -0800

Vitalii Diravka created DRILL-5970:
--------------------------------------

             Summary: DrillParquetReader always builds the schema with 
"OPTIONAL" dataMode columns instead of "REQUIRED" ones
                 Key: DRILL-5970
                 URL: https://issues.apache.org/jira/browse/DRILL-5970
             Project: Apache Drill
          Issue Type: Bug
          Components: Execution - Codegen, Execution - Data Types, Storage - 
Parquet
    Affects Versions: 1.11.0
            Reporter: Vitalii Diravka
            Assignee: Vitalii Diravka



The root cause of the issue is that adding REQUIRED (not-nullable) data types 
to the container in the all MapWriters is not implemented.

It can lead to get invalid schema. 

{code}
0: jdbc:drill:zk=local> CREATE TABLE dfs.tmp.bof_repro_1 as select * from 
(select CONVERT_FROM('["hello","hai"]','JSON') AS MYCOL, 'Bucket1' AS Bucket 
FROM (VALUES(1)));
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 0_0       | 1                          |
+-----------+----------------------------+
1 row selected (2.376 seconds)
{code}
Run from Drill unit test framework (to see "data mode"):
{code}
@Test
  public void test() throws Exception {
    setColumnWidths(new int[] {25, 25});
    List<QueryDataBatch> queryDataBatches = testSqlWithResults("select * from 
dfs.tmp.bof_repro_1");
    printResult(queryDataBatches);
  }

1 row(s):
-------------------------------------------------------
| MYCOL<VARCHAR(REPEATED)> | Bucket<VARCHAR(OPTIONAL)>|
-------------------------------------------------------
| ["hello","hai"]          | Bucket1                  |
-------------------------------------------------------
Total record count: 1

{code}

{code}
vitalii@vitalii-pc:~/parquet-tools/parquet-mr/parquet-tools/target$ java -jar 
parquet-tools-1.6.0rc3-SNAPSHOT.jar schema /tmp/bof_repro_1/0_0_0.parquet 
message root {
  repeated binary MYCOL (UTF8);
  required binary Bucket (UTF8);
}
{code}



To simulate of obtaining the wrong result you can try the query with 
aggregation by using a new parquet reader (used by default for complex data 
types) and old parquet reader. False "Hash aggregate does not support schema 
changes" error will happen. 

1) Create two parquet files.

{code}
0: jdbc:drill:schema=dfs> CREATE TABLE dfs.tmp.bof_repro_1 as select * from 
(select CONVERT_FROM('["hello","hai"]','JSON') AS MYCOL, 'Bucket1' AS Bucket 
FROM (VALUES(1)));
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 0_0       | 1                          |
+-----------+----------------------------+
1 row selected (1.122 seconds)
0: jdbc:drill:schema=dfs> CREATE TABLE dfs.tmp.bof_repro_2 as select * from 
(select CONVERT_FROM('[]','JSON') AS MYCOL, 'Bucket1' AS Bucket FROM 
(VALUES(1)));
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 0_0       | 1                          |
+-----------+----------------------------+
1 row selected (0.552 seconds)
0: jdbc:drill:schema=dfs> select * from dfs.tmp.bof_repro_2;
{code}

2) Copy the parquet files from bof_repro_1 to bof_repro_2.

{code}
[root@naravm1 ~]# hadoop fs -ls /tmp/bof_repro_1
Found 1 items
-rw-r--r--   3 mapr mapr        415 2017-07-25 11:46 
/tmp/bof_repro_1/0_0_0.parquet
[root@naravm1 ~]# hadoop fs -ls /tmp/bof_repro_2
Found 1 items
-rw-r--r--   3 mapr mapr        368 2017-07-25 11:46 
/tmp/bof_repro_2/0_0_0.parquet
[root@naravm1 ~]# hadoop fs -cp /tmp/bof_repro_1/0_0_0.parquet 
/tmp/bof_repro_2/0_0_1.parquet
[root@naravm1 ~]#
{code}

3) Query the table.

{code}
0: jdbc:drill:schema=dfs> ALTER SESSION SET  `planner.enable_streamagg`=false;
+-------+------------------------------------+
|  ok   |              summary               |
+-------+------------------------------------+
| true  | planner.enable_streamagg updated.  |
+-------+------------------------------------+
1 row selected (0.124 seconds)
0: jdbc:drill:schema=dfs> select * from dfs.tmp.bof_repro_2;
+------------------+----------+
|      MYCOL       |  Bucket  |
+------------------+----------+
| ["hello","hai"]  | Bucket1  |
| null             | Bucket1  |
+------------------+----------+
2 rows selected (0.247 seconds)
0: jdbc:drill:schema=dfs> select bucket, count(*) from dfs.tmp.bof_repro_2 
group by bucket;
Error: UNSUPPORTED_OPERATION ERROR: Hash aggregate does not support schema 
changes

Fragment 0:0

[Error Id: 60f8ada3-5f00-4413-a676-4881fc8cb409 on naravm3:31010] 
(state=,code=0)
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (DRILL-5970) DrillParquetReader always builds the schema with "OPTIONAL" dataMode columns instead of "REQUIRED" ones

Reply via email to