[jira] [Commented] (DRILL-5970) DrillParquetReader always builds the schema with "OPTIONAL" dataMode columns instead of "REQUIRED" ones

Vitalii Diravka (JIRA) Thu, 04 Jan 2018 06:43:19 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-5970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311417#comment-16311417
 ]


Vitalii Diravka commented on DRILL-5970:
----------------------------------------

[~paul-rogers] I just was not concentrated on MYCOL column, since the reason of 
using that to show the case of using two readers for one simple query. And the 
point of original issue is connected to the schema changing of BUCKET column:
{code}
0: jdbc:drill:zk=local> select count(*) from dfs.tmp.bof_repro_2 group by 
bucket;
Error: UNSUPPORTED_OPERATION ERROR: Hash aggregate does not support schema 
change
Prior schema : 
BatchSchema [fields=[Bucket(VARCHAR:REQUIRED)], selectionVector=NONE]
New schema : 
BatchSchema [fields=[bucket(VARCHAR:OPTIONAL)], selectionVector=NONE]
{code}
I think this schema changing should not be happened there. 

There are performance and memory benefits from using required vectors where we 
can. So it can be one step to the improvement of performance of Drill New 
Parquet reader (possibly we will switch to only one reader sometimes in 
future). Besides that the schema of parquet files is defined. So looks better 
to use Drill Data Modes, which are relevant to parquet repetition types. But if 
it is right way, I will create a new tickets for the cases, where Drill data 
Modes are not correspond to Parquet repetition types.



> DrillParquetReader always builds the schema with "OPTIONAL" dataMode columns 
> instead of "REQUIRED" ones
> -------------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-5970
>                 URL: https://issues.apache.org/jira/browse/DRILL-5970
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Codegen, Execution - Data Types, Storage - 
> Parquet
>    Affects Versions: 1.11.0
>            Reporter: Vitalii Diravka
>            Assignee: Vitalii Diravka
>
> The root cause of the issue is that adding REQUIRED (not-nullable) data types 
> to the container in the all MapWriters is not implemented.
> It can lead to get invalid schema. 
> {code}
> 0: jdbc:drill:zk=local> CREATE TABLE dfs.tmp.bof_repro_1 as select * from 
> (select CONVERT_FROM('["hello","hai"]','JSON') AS MYCOL, 'Bucket1' AS Bucket 
> FROM (VALUES(1)));
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> +-----------+----------------------------+
> | Fragment  | Number of records written  |
> +-----------+----------------------------+
> | 0_0       | 1                          |
> +-----------+----------------------------+
> 1 row selected (2.376 seconds)
> {code}
> Run from Drill unit test framework (to see "data mode"):
> {code}
> @Test
>   public void test() throws Exception {
>     setColumnWidths(new int[] {25, 25});
>     List<QueryDataBatch> queryDataBatches = testSqlWithResults("select * from 
> dfs.tmp.bof_repro_1");
>     printResult(queryDataBatches);
>   }
> 1 row(s):
> -------------------------------------------------------
> | MYCOL<VARCHAR(REPEATED)> | Bucket<VARCHAR(OPTIONAL)>|
> -------------------------------------------------------
> | ["hello","hai"]          | Bucket1                  |
> -------------------------------------------------------
> Total record count: 1
> {code}
> {code}
> vitalii@vitalii-pc:~/parquet-tools/parquet-mr/parquet-tools/target$ java -jar 
> parquet-tools-1.6.0rc3-SNAPSHOT.jar schema /tmp/bof_repro_1/0_0_0.parquet 
> message root {
>   repeated binary MYCOL (UTF8);
>   required binary Bucket (UTF8);
> }
> {code}
> To simulate of obtaining the wrong result you can try the query with 
> aggregation by using a new parquet reader (used by default for complex data 
> types) and old parquet reader. False "Hash aggregate does not support schema 
> changes" error will happen. 
> 1) Create two parquet files.
> {code}
> 0: jdbc:drill:schema=dfs> CREATE TABLE dfs.tmp.bof_repro_1 as select * from 
> (select CONVERT_FROM('["hello","hai"]','JSON') AS MYCOL, 'Bucket1' AS Bucket 
> FROM (VALUES(1)));
> +-----------+----------------------------+
> | Fragment  | Number of records written  |
> +-----------+----------------------------+
> | 0_0       | 1                          |
> +-----------+----------------------------+
> 1 row selected (1.122 seconds)
> 0: jdbc:drill:schema=dfs> CREATE TABLE dfs.tmp.bof_repro_2 as select * from 
> (select CONVERT_FROM('[]','JSON') AS MYCOL, 'Bucket1' AS Bucket FROM 
> (VALUES(1)));
> +-----------+----------------------------+
> | Fragment  | Number of records written  |
> +-----------+----------------------------+
> | 0_0       | 1                          |
> +-----------+----------------------------+
> 1 row selected (0.552 seconds)
> 0: jdbc:drill:schema=dfs> select * from dfs.tmp.bof_repro_2;
> {code}
> 2) Copy the parquet files from bof_repro_1 to bof_repro_2.
> {code}
> [root@naravm1 ~]# hadoop fs -ls /tmp/bof_repro_1
> Found 1 items
> -rw-r--r--   3 mapr mapr        415 2017-07-25 11:46 
> /tmp/bof_repro_1/0_0_0.parquet
> [root@naravm1 ~]# hadoop fs -ls /tmp/bof_repro_2
> Found 1 items
> -rw-r--r--   3 mapr mapr        368 2017-07-25 11:46 
> /tmp/bof_repro_2/0_0_0.parquet
> [root@naravm1 ~]# hadoop fs -cp /tmp/bof_repro_1/0_0_0.parquet 
> /tmp/bof_repro_2/0_0_1.parquet
> [root@naravm1 ~]#
> {code}
> 3) Query the table.
> {code}
> 0: jdbc:drill:schema=dfs> ALTER SESSION SET  `planner.enable_streamagg`=false;
> +-------+------------------------------------+
> |  ok   |              summary               |
> +-------+------------------------------------+
> | true  | planner.enable_streamagg updated.  |
> +-------+------------------------------------+
> 1 row selected (0.124 seconds)
> 0: jdbc:drill:schema=dfs> select * from dfs.tmp.bof_repro_2;
> +------------------+----------+
> |      MYCOL       |  Bucket  |
> +------------------+----------+
> | ["hello","hai"]  | Bucket1  |
> | null             | Bucket1  |
> +------------------+----------+
> 2 rows selected (0.247 seconds)
> 0: jdbc:drill:schema=dfs> select bucket, count(*) from dfs.tmp.bof_repro_2 
> group by bucket;
> Error: UNSUPPORTED_OPERATION ERROR: Hash aggregate does not support schema 
> changes
> Fragment 0:0
> [Error Id: 60f8ada3-5f00-4413-a676-4881fc8cb409 on naravm3:31010] 
> (state=,code=0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (DRILL-5970) DrillParquetReader always builds the schema with "OPTIONAL" dataMode columns instead of "REQUIRED" ones

Reply via email to