[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771971#comment-16771971
 ] 

Antal Sinkovits commented on HIVE-20523:
----------------------------------------

Hi [~george.pachitariu]

Thanks for the answer. I understood your code, this was the initial approach I 
was planing to do as well. :)
So the issue I see is that you only implemented the write (serialize) path. But 
the read part (deserialize) remains as is. 
Let me give an example, which might put some light on what I mean.

For the setup, I've applied your patch on top of master and nothing else.

create table case1 (col string) stored as parquet;
insert into case1 values("This is a test string"); // -> rawDataSize: 105   
analyze table case1 compute statistics; // -> rawDataSize: 144
analyze table case1 compute statistics for columns; // -> rawDataSize: 1

Now if I start to mix these, things gets more interesting, because your change 
only calculates for the data it writes. So for example if I run these commands:
create table case2 (col string) stored as parquet;
insert into case2 values("This is a test string"); // -> rawDataSize: 105
analyze table case2 compute statistics for columns; // -> rawDataSize: 1
insert into case2 values("This is a test string"); // -> rawDataSize: 106 
(1+105)

Thats why I think, there should be a single source of truth.
I've checked with the parquet team, and unfortunately, parquet (unlike ORC) 
doesn't provide any api on the writer side to get the total size. It's there 
only in the reader, because the value is internal in parquet, and only gets 
written, when the file is closed.
So it makes sense, to use this, as our single source of truth. HIVE-20079  was 
done by [~aihuaxu] I don't want to take credit for this. That change moves the 
stat calculation from the serde to the writer, and when the writer closes the 
file, and parquet writes the footer, it reads from the closed file, and updates 
the stats.
This fixes the write path.

HIVE-21284 was done by me, which fixes the read portion to use the same footer 
value, on analyze compute statistics for columns.
This way, the calculated value stays consistent, no matter which path you take.

Let me know, if this makes sense or not. Thanks.

> Improve table statistics for Parquet format
> -------------------------------------------
>
>                 Key: HIVE-20523
>                 URL: https://issues.apache.org/jira/browse/HIVE-20523
>             Project: Hive
>          Issue Type: Improvement
>          Components: Physical Optimizer
>            Reporter: George Pachitariu
>            Assignee: George Pachitariu
>            Priority: Minor
>         Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, 
> HIVE-20523.11.patch, HIVE-20523.12.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, 
> HIVE-20523.9.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to