Vitalii Diravka created DRILL-5970: -------------------------------------- Summary: DrillParquetReader always builds the schema with "OPTIONAL" dataMode columns instead of "REQUIRED" ones Key: DRILL-5970 URL: https://issues.apache.org/jira/browse/DRILL-5970 Project: Apache Drill Issue Type: Bug Components: Execution - Codegen, Execution - Data Types, Storage - Parquet Affects Versions: 1.11.0 Reporter: Vitalii Diravka Assignee: Vitalii Diravka
The root cause of the issue is that adding REQUIRED (not-nullable) data types to the container in the all MapWriters is not implemented. It can lead to get invalid schema. {code} 0: jdbc:drill:zk=local> CREATE TABLE dfs.tmp.bof_repro_1 as select * from (select CONVERT_FROM('["hello","hai"]','JSON') AS MYCOL, 'Bucket1' AS Bucket FROM (VALUES(1))); SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. +-----------+----------------------------+ | Fragment | Number of records written | +-----------+----------------------------+ | 0_0 | 1 | +-----------+----------------------------+ 1 row selected (2.376 seconds) {code} Run from Drill unit test framework (to see "data mode"): {code} @Test public void test() throws Exception { setColumnWidths(new int[] {25, 25}); List<QueryDataBatch> queryDataBatches = testSqlWithResults("select * from dfs.tmp.bof_repro_1"); printResult(queryDataBatches); } 1 row(s): ------------------------------------------------------- | MYCOL<VARCHAR(REPEATED)> | Bucket<VARCHAR(OPTIONAL)>| ------------------------------------------------------- | ["hello","hai"] | Bucket1 | ------------------------------------------------------- Total record count: 1 {code} {code} vitalii@vitalii-pc:~/parquet-tools/parquet-mr/parquet-tools/target$ java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar schema /tmp/bof_repro_1/0_0_0.parquet message root { repeated binary MYCOL (UTF8); required binary Bucket (UTF8); } {code} To simulate of obtaining the wrong result you can try the query with aggregation by using a new parquet reader (used by default for complex data types) and old parquet reader. False "Hash aggregate does not support schema changes" error will happen. 1) Create two parquet files. {code} 0: jdbc:drill:schema=dfs> CREATE TABLE dfs.tmp.bof_repro_1 as select * from (select CONVERT_FROM('["hello","hai"]','JSON') AS MYCOL, 'Bucket1' AS Bucket FROM (VALUES(1))); +-----------+----------------------------+ | Fragment | Number of records written | +-----------+----------------------------+ | 0_0 | 1 | +-----------+----------------------------+ 1 row selected (1.122 seconds) 0: jdbc:drill:schema=dfs> CREATE TABLE dfs.tmp.bof_repro_2 as select * from (select CONVERT_FROM('[]','JSON') AS MYCOL, 'Bucket1' AS Bucket FROM (VALUES(1))); +-----------+----------------------------+ | Fragment | Number of records written | +-----------+----------------------------+ | 0_0 | 1 | +-----------+----------------------------+ 1 row selected (0.552 seconds) 0: jdbc:drill:schema=dfs> select * from dfs.tmp.bof_repro_2; {code} 2) Copy the parquet files from bof_repro_1 to bof_repro_2. {code} [root@naravm1 ~]# hadoop fs -ls /tmp/bof_repro_1 Found 1 items -rw-r--r-- 3 mapr mapr 415 2017-07-25 11:46 /tmp/bof_repro_1/0_0_0.parquet [root@naravm1 ~]# hadoop fs -ls /tmp/bof_repro_2 Found 1 items -rw-r--r-- 3 mapr mapr 368 2017-07-25 11:46 /tmp/bof_repro_2/0_0_0.parquet [root@naravm1 ~]# hadoop fs -cp /tmp/bof_repro_1/0_0_0.parquet /tmp/bof_repro_2/0_0_1.parquet [root@naravm1 ~]# {code} 3) Query the table. {code} 0: jdbc:drill:schema=dfs> ALTER SESSION SET `planner.enable_streamagg`=false; +-------+------------------------------------+ | ok | summary | +-------+------------------------------------+ | true | planner.enable_streamagg updated. | +-------+------------------------------------+ 1 row selected (0.124 seconds) 0: jdbc:drill:schema=dfs> select * from dfs.tmp.bof_repro_2; +------------------+----------+ | MYCOL | Bucket | +------------------+----------+ | ["hello","hai"] | Bucket1 | | null | Bucket1 | +------------------+----------+ 2 rows selected (0.247 seconds) 0: jdbc:drill:schema=dfs> select bucket, count(*) from dfs.tmp.bof_repro_2 group by bucket; Error: UNSUPPORTED_OPERATION ERROR: Hash aggregate does not support schema changes Fragment 0:0 [Error Id: 60f8ada3-5f00-4413-a676-4881fc8cb409 on naravm3:31010] (state=,code=0) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)