[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2019-02-19 Thread George Pachitariu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772033#comment-16772033
 ] 

George Pachitariu commented on HIVE-20523:
--

Ok, I understand now (and you know much more about this than me :P). Thank you 
for the lesson.

I will close this task.

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, 
> HIVE-20523.11.patch, HIVE-20523.12.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, 
> HIVE-20523.9.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2019-02-19 Thread George Pachitariu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772032#comment-16772032
 ] 

George Pachitariu commented on HIVE-20523:
--

The implementation in the other tasks is better than this one.

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, 
> HIVE-20523.11.patch, HIVE-20523.12.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, 
> HIVE-20523.9.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2019-02-19 Thread Antal Sinkovits (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771971#comment-16771971
 ] 

Antal Sinkovits commented on HIVE-20523:


Hi [~george.pachitariu]

Thanks for the answer. I understood your code, this was the initial approach I 
was planing to do as well. :)
So the issue I see is that you only implemented the write (serialize) path. But 
the read part (deserialize) remains as is. 
Let me give an example, which might put some light on what I mean.

For the setup, I've applied your patch on top of master and nothing else.

create table case1 (col string) stored as parquet;
insert into case1 values("This is a test string"); // -> rawDataSize: 105   
analyze table case1 compute statistics; // -> rawDataSize: 144
analyze table case1 compute statistics for columns; // -> rawDataSize: 1

Now if I start to mix these, things gets more interesting, because your change 
only calculates for the data it writes. So for example if I run these commands:
create table case2 (col string) stored as parquet;
insert into case2 values("This is a test string"); // -> rawDataSize: 105
analyze table case2 compute statistics for columns; // -> rawDataSize: 1
insert into case2 values("This is a test string"); // -> rawDataSize: 106 
(1+105)

Thats why I think, there should be a single source of truth.
I've checked with the parquet team, and unfortunately, parquet (unlike ORC) 
doesn't provide any api on the writer side to get the total size. It's there 
only in the reader, because the value is internal in parquet, and only gets 
written, when the file is closed.
So it makes sense, to use this, as our single source of truth. HIVE-20079  was 
done by [~aihuaxu] I don't want to take credit for this. That change moves the 
stat calculation from the serde to the writer, and when the writer closes the 
file, and parquet writes the footer, it reads from the closed file, and updates 
the stats.
This fixes the write path.

HIVE-21284 was done by me, which fixes the read portion to use the same footer 
value, on analyze compute statistics for columns.
This way, the calculated value stays consistent, no matter which path you take.

Let me know, if this makes sense or not. Thanks.

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, 
> HIVE-20523.11.patch, HIVE-20523.12.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, 
> HIVE-20523.9.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2019-02-19 Thread George Pachitariu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771853#comment-16771853
 ] 

George Pachitariu commented on HIVE-20523:
--

Hi [~asinkovits], thank you for having a look on my patch :).

Can you please give me a concrete example when my solution will give less 
consistent results, compared to your solution?

My code gets called for each row, so it sees everything. 
The same is reading from the footer in parquet files (you see all the rows).

I initially went with this approach because I wanted to be consistent with the 
Orc implementation (so that later we can merge the code together, instead of 
always implementing things the Parquet way and duplicating the effort to do it 
the Orc way).

Cheers.

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, 
> HIVE-20523.11.patch, HIVE-20523.12.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, 
> HIVE-20523.9.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2019-02-18 Thread Antal Sinkovits (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771127#comment-16771127
 ] 

Antal Sinkovits commented on HIVE-20523:


[~george.pachitariu] [~ashutoshc]

I've been working on the same issue, and I don't think we should go with this 
approach. The reason, because there will be a discrepancy, between calculating 
the raw data size with this logic, and by using footer scan, which parquet 
provides.

I think HIVE-20079 and HIVE-21284 would give a more consistent value, on insert 
and analyze.

What do you think?

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, 
> HIVE-20523.11.patch, HIVE-20523.12.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, 
> HIVE-20523.9.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2019-02-06 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762260#comment-16762260
 ] 

Ashutosh Chauhan commented on HIVE-20523:
-

+1 patch needs a refresh and rerun.

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, 
> HIVE-20523.11.patch, HIVE-20523.12.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, 
> HIVE-20523.9.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2019-01-22 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748601#comment-16748601
 ] 

Hive QA commented on HIVE-20523:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12955749/HIVE-20523.12.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 5 failed/errored test(s), 15698 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_part_project]
 (batchId=38)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_join] 
(batchId=120)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_decimal_date]
 (batchId=125)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_part_project]
 (batchId=127)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vectorization_parquet_projection]
 (batchId=131)
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/15729/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/15729/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-15729/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 5 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12955749 - PreCommit-HIVE-Build

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, 
> HIVE-20523.11.patch, HIVE-20523.12.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, 
> HIVE-20523.9.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2019-01-22 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748567#comment-16748567
 ] 

Hive QA commented on HIVE-20523:


| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  7m 
47s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
58s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
35s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  3m 
41s{color} | {color:blue} ql in master has 2310 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
54s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
19s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
1s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
1s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
36s{color} | {color:green} ql: The patch generated 0 new + 10 unchanged - 4 
fixed = 10 total (was 14) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
12s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 22m 26s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-15729/dev-support/hive-personality.sh
 |
| git revision | master / cb74a68 |
| Default Java | 1.8.0_111 |
| findbugs | v3.0.0 |
| modules | C: ql U: ql |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-15729/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, 
> HIVE-20523.11.patch, HIVE-20523.12.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, 
> HIVE-20523.9.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2019-01-03 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16733187#comment-16733187
 ] 

Hive QA commented on HIVE-20523:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12953626/HIVE-20523.10.patch

{color:red}ERROR:{color} -1 due to build exiting with an error

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/15482/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/15482/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-15482/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Tests exited with: NonZeroExitCodeException
Command 'bash /data/hiveptest/working/scratch/source-prep.sh' failed with exit 
status 1 and output '+ date '+%Y-%m-%d %T.%3N'
2019-01-03 16:17:11.280
+ [[ -n /usr/lib/jvm/java-8-openjdk-amd64 ]]
+ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
+ JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
+ export 
PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
+ 
PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
+ export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m '
+ ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m '
+ export 'MAVEN_OPTS=-Xmx1g '
+ MAVEN_OPTS='-Xmx1g '
+ cd /data/hiveptest/working/
+ tee /data/hiveptest/logs/PreCommit-HIVE-Build-15482/source-prep.txt
+ [[ false == \t\r\u\e ]]
+ mkdir -p maven ivy
+ [[ git = \s\v\n ]]
+ [[ git = \g\i\t ]]
+ [[ -z master ]]
+ [[ -d apache-github-source-source ]]
+ [[ ! -d apache-github-source-source/.git ]]
+ [[ ! -d apache-github-source-source ]]
+ date '+%Y-%m-%d %T.%3N'
2019-01-03 16:17:11.283
+ cd apache-github-source-source
+ git fetch origin
+ git reset --hard HEAD
HEAD is now at 691c4cb HIVE-21044: Add SLF4J reporter to the metastore metrics 
system (Karthik Manamcheri, reviewed by Morio Ramdenbourg and Peter Vary)
+ git clean -f -d
Removing standalone-metastore/metastore-server/src/gen/
+ git checkout master
Already on 'master'
Your branch is up-to-date with 'origin/master'.
+ git reset --hard origin/master
HEAD is now at 691c4cb HIVE-21044: Add SLF4J reporter to the metastore metrics 
system (Karthik Manamcheri, reviewed by Morio Ramdenbourg and Peter Vary)
+ git merge --ff-only origin/master
Already up-to-date.
+ date '+%Y-%m-%d %T.%3N'
2019-01-03 16:17:12.026
+ rm -rf ../yetus_PreCommit-HIVE-Build-15482
+ mkdir ../yetus_PreCommit-HIVE-Build-15482
+ git gc
+ cp -R . ../yetus_PreCommit-HIVE-Build-15482
+ mkdir /data/hiveptest/logs/PreCommit-HIVE-Build-15482/yetus
+ patchCommandPath=/data/hiveptest/working/scratch/smart-apply-patch.sh
+ patchFilePath=/data/hiveptest/working/scratch/build.patch
+ [[ -f /data/hiveptest/working/scratch/build.patch ]]
+ chmod +x /data/hiveptest/working/scratch/smart-apply-patch.sh
+ /data/hiveptest/working/scratch/smart-apply-patch.sh 
/data/hiveptest/working/scratch/build.patch
error: 
a/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java: 
does not exist in index
error: 
a/ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetSerDe.java: does 
not exist in index
error: 
a/ql/src/test/results/clientpositive/llap/vector_partitioned_date_time.q.out: 
does not exist in index
error: a/ql/src/test/results/clientpositive/nested_column_pruning.q.out: does 
not exist in index
error: a/ql/src/test/results/clientpositive/parquet_analyze.q.out: does not 
exist in index
error: 
a/ql/src/test/results/clientpositive/parquet_complex_types_vectorization.q.out: 
does not exist in index
error: a/ql/src/test/results/clientpositive/parquet_join.q.out: does not exist 
in index
error: 
a/ql/src/test/results/clientpositive/parquet_map_type_vectorization.q.out: does 
not exist in index
error: a/ql/src/test/results/clientpositive/parquet_no_row_serde.q.out: does 
not exist in index
error: 
a/ql/src/test/results/clientpositive/parquet_struct_type_vectorization.q.out: 
does not exist in index
error: 
a/ql/src/test/results/clientpositive/parquet_types_non_dictionary_encoding_vectorization.q.out:
 does not exist in index
error: a/ql/src/test/results/clientpositive/parquet_types_vectorization.q.out: 
does not exist in index
error: 
a/ql/src/test/results/clientpositive/parquet_vectorization_decimal_date.q.out: 
does not exist in index
error: 
a/ql/src/test/results/clientpositive/parquet_vectorization_part_project.q.out: 
does not exist in index
error: 
a/ql/src/test/results/clientpositive/spark/spark_dynamic_partition_pruning.q.out:
 does not exist in index
error: 
a/ql/src/test/results/clientpositive/spark/vectorization_input_format_excludes.q.out:
 does not exist in index
error: 
a/ql/src/test/results/clientpositive/vectorization_numeric_overflows.q.out: 
does not exist in index
error: 

[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2019-01-03 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732878#comment-16732878
 ] 

Hive QA commented on HIVE-20523:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12953586/HIVE-20523.9.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 15761 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vectorization_input_format_excludes]
 (batchId=132)
org.apache.hadoop.hive.ql.lockmgr.TestDbTxnManager2.testLoadData (batchId=324)
org.apache.hive.jdbc.TestJdbcWithMiniLlapArrow.testComplexQuery (batchId=259)
org.apache.hive.jdbc.TestJdbcWithMiniLlapArrow.testKillQuery (batchId=259)
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/15471/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/15471/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-15471/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 4 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12953586 - PreCommit-HIVE-Build

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, 
> HIVE-20523.9.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2019-01-03 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732823#comment-16732823
 ] 

Hive QA commented on HIVE-20523:


| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  7m 
39s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
4s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
39s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  3m 
38s{color} | {color:blue} ql in master has 2312 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
55s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
19s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
34s{color} | {color:green} ql: The patch generated 0 new + 10 unchanged - 4 
fixed = 10 total (was 14) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
52s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
12s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 22m 30s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-15471/dev-support/hive-personality.sh
 |
| git revision | master / 691c4cb |
| Default Java | 1.8.0_111 |
| findbugs | v3.0.0 |
| modules | C: ql U: ql |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-15471/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, 
> HIVE-20523.9.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2019-01-03 Thread George Pachitariu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732749#comment-16732749
 ] 

George Pachitariu commented on HIVE-20523:
--

I submitted patch ...8.patch ( .9.patch ) again, to reactivate Jenkins 
PreCommit-Hive-Build.

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, 
> HIVE-20523.9.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2018-12-20 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725988#comment-16725988
 ] 

Hive QA commented on HIVE-20523:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12952530/HIVE-20523.7.patch

{color:red}ERROR:{color} -1 due to build exiting with an error

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/15404/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/15404/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-15404/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Tests exited with: NonZeroExitCodeException
Command 'bash /data/hiveptest/working/scratch/source-prep.sh' failed with exit 
status 1 and output '+ date '+%Y-%m-%d %T.%3N'
2018-12-20 16:14:40.671
+ [[ -n /usr/lib/jvm/java-8-openjdk-amd64 ]]
+ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
+ JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
+ export 
PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
+ 
PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
+ export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m '
+ ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m '
+ export 'MAVEN_OPTS=-Xmx1g '
+ MAVEN_OPTS='-Xmx1g '
+ cd /data/hiveptest/working/
+ tee /data/hiveptest/logs/PreCommit-HIVE-Build-15404/source-prep.txt
+ [[ false == \t\r\u\e ]]
+ mkdir -p maven ivy
+ [[ git = \s\v\n ]]
+ [[ git = \g\i\t ]]
+ [[ -z master ]]
+ [[ -d apache-github-source-source ]]
+ [[ ! -d apache-github-source-source/.git ]]
+ [[ ! -d apache-github-source-source ]]
+ date '+%Y-%m-%d %T.%3N'
2018-12-20 16:14:40.674
+ cd apache-github-source-source
+ git fetch origin
+ git reset --hard HEAD
HEAD is now at 1020be0 HIVE-20943: Handle Compactor transaction abort properly 
(Eugene Koifman, reviewed by Vaibhav Gumashta)
+ git clean -f -d
Removing standalone-metastore/metastore-server/src/gen/
Removing standalone-metastore/src/gen/
+ git checkout master
Already on 'master'
Your branch is up-to-date with 'origin/master'.
+ git reset --hard origin/master
HEAD is now at 1020be0 HIVE-20943: Handle Compactor transaction abort properly 
(Eugene Koifman, reviewed by Vaibhav Gumashta)
+ git merge --ff-only origin/master
Already up-to-date.
+ date '+%Y-%m-%d %T.%3N'
2018-12-20 16:14:41.758
+ rm -rf ../yetus_PreCommit-HIVE-Build-15404
+ mkdir ../yetus_PreCommit-HIVE-Build-15404
+ git gc
+ cp -R . ../yetus_PreCommit-HIVE-Build-15404
+ mkdir /data/hiveptest/logs/PreCommit-HIVE-Build-15404/yetus
+ patchCommandPath=/data/hiveptest/working/scratch/smart-apply-patch.sh
+ patchFilePath=/data/hiveptest/working/scratch/build.patch
+ [[ -f /data/hiveptest/working/scratch/build.patch ]]
+ chmod +x /data/hiveptest/working/scratch/smart-apply-patch.sh
+ /data/hiveptest/working/scratch/smart-apply-patch.sh 
/data/hiveptest/working/scratch/build.patch
error: 
a/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java: 
does not exist in index
error: 
a/ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetSerDe.java: does 
not exist in index
error: 
a/ql/src/test/results/clientpositive/llap/vector_partitioned_date_time.q.out: 
does not exist in index
error: a/ql/src/test/results/clientpositive/nested_column_pruning.q.out: does 
not exist in index
error: a/ql/src/test/results/clientpositive/parquet_analyze.q.out: does not 
exist in index
error: 
a/ql/src/test/results/clientpositive/parquet_complex_types_vectorization.q.out: 
does not exist in index
error: a/ql/src/test/results/clientpositive/parquet_join.q.out: does not exist 
in index
error: 
a/ql/src/test/results/clientpositive/parquet_map_type_vectorization.q.out: does 
not exist in index
error: a/ql/src/test/results/clientpositive/parquet_no_row_serde.q.out: does 
not exist in index
error: 
a/ql/src/test/results/clientpositive/parquet_struct_type_vectorization.q.out: 
does not exist in index
error: 
a/ql/src/test/results/clientpositive/parquet_types_non_dictionary_encoding_vectorization.q.out:
 does not exist in index
error: a/ql/src/test/results/clientpositive/parquet_types_vectorization.q.out: 
does not exist in index
error: 
a/ql/src/test/results/clientpositive/parquet_vectorization_decimal_date.q.out: 
does not exist in index
error: 
a/ql/src/test/results/clientpositive/parquet_vectorization_part_project.q.out: 
does not exist in index
error: 
a/ql/src/test/results/clientpositive/spark/spark_dynamic_partition_pruning.q.out:
 does not exist in index
error: 
a/ql/src/test/results/clientpositive/spark/vectorization_input_format_excludes.q.out:
 does not exist in index
error: 
a/ql/src/test/results/clientpositive/vectorization_numeric_overflows.q.out: 
does not exist in index
error: 

[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2018-10-18 Thread Zoltan Haindrich (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655826#comment-16655826
 ] 

Zoltan Haindrich commented on HIVE-20523:
-

+1 makes sense to me ; its way better estimate than 1 :)
I think you will need to update the output of some other parquet tests

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2018-10-13 Thread George Pachitariu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16648897#comment-16648897
 ] 

George Pachitariu commented on HIVE-20523:
--

Hi, can anyone please have a look at this patch?

[~asherman] [~janulatha]

I'm not sure how I should proceed.

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2018-09-27 Thread George Pachitariu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630911#comment-16630911
 ] 

George Pachitariu commented on HIVE-20523:
--

Hi Zoltan Haindrich, can you please have a look at this patch?

I have looked at the failed tests. They all failed because the expected raw 
data size has changed, which is the expected behaviour of this patch.

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2018-09-27 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630800#comment-16630800
 ] 

Hive QA commented on HIVE-20523:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12941445/HIVE-20523.6.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 21 failed/errored test(s), 15000 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[nested_column_pruning] 
(batchId=36)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_analyze] 
(batchId=24)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_complex_types_vectorization]
 (batchId=77)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_join] 
(batchId=21)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_map_type_vectorization]
 (batchId=90)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_no_row_serde] 
(batchId=74)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_struct_type_vectorization]
 (batchId=28)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_types_non_dictionary_encoding_vectorization]
 (batchId=91)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_types_vectorization]
 (batchId=15)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_decimal_date]
 (batchId=32)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_part_project]
 (batchId=37)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorization_numeric_overflows]
 (batchId=73)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorization_parquet_projection]
 (batchId=46)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorized_parquet_types]
 (batchId=71)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_partitioned_date_time]
 (batchId=179)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_dynamic_partition_pruning]
 (batchId=188)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_join] 
(batchId=119)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_decimal_date]
 (batchId=124)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_part_project]
 (batchId=126)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vectorization_input_format_excludes]
 (batchId=131)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vectorization_parquet_projection]
 (batchId=130)
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/14084/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/14084/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-14084/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 21 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12941445 - PreCommit-HIVE-Build

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2018-09-27 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630738#comment-16630738
 ] 

Hive QA commented on HIVE-20523:


| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  8m 
10s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
2s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
43s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  4m  
0s{color} | {color:blue} ql in master has 2325 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
57s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
6s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
6s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
39s{color} | {color:green} ql: The patch generated 0 new + 10 unchanged - 4 
fixed = 10 total (was 14) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m  
9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
13s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 23m 47s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-14084/dev-support/hive-personality.sh
 |
| git revision | master / b382391 |
| Default Java | 1.8.0_111 |
| findbugs | v3.0.0 |
| modules | C: ql U: ql |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-14084/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2018-09-26 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629401#comment-16629401
 ] 

Hive QA commented on HIVE-20523:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12941348/HIVE-20523.5.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 22 failed/errored test(s), 14999 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[nested_column_pruning] 
(batchId=36)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_analyze] 
(batchId=24)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_complex_types_vectorization]
 (batchId=77)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_join] 
(batchId=21)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_map_type_vectorization]
 (batchId=90)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_no_row_serde] 
(batchId=74)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_struct_type_vectorization]
 (batchId=28)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_types_non_dictionary_encoding_vectorization]
 (batchId=91)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_types_vectorization]
 (batchId=15)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_decimal_date]
 (batchId=32)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_part_project]
 (batchId=37)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorization_numeric_overflows]
 (batchId=73)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorization_parquet_projection]
 (batchId=46)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorized_parquet_types]
 (batchId=71)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_partitioned_date_time]
 (batchId=179)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_dynamic_partition_pruning]
 (batchId=188)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_join] 
(batchId=119)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_decimal_date]
 (batchId=124)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_part_project]
 (batchId=126)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vectorization_input_format_excludes]
 (batchId=131)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vectorization_parquet_projection]
 (batchId=130)
org.apache.hadoop.hive.metastore.TestHiveMetaStoreSchemaMethods.schemaQuery 
(batchId=228)
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/14064/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/14064/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-14064/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 22 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12941348 - PreCommit-HIVE-Build

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2018-09-26 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629352#comment-16629352
 ] 

Hive QA commented on HIVE-20523:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  7m 
58s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
1s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
40s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  3m 
55s{color} | {color:blue} ql in master has 2326 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
55s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
5s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
5s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
38s{color} | {color:red} ql: The patch generated 1 new + 12 unchanged - 2 fixed 
= 13 total (was 14) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
13s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 23m 20s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-14064/dev-support/hive-personality.sh
 |
| git revision | master / dca6ef0 |
| Default Java | 1.8.0_111 |
| findbugs | v3.0.0 |
| checkstyle | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-14064/yetus/diff-checkstyle-ql.txt
 |
| modules | C: ql U: ql |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-14064/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2018-09-25 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628149#comment-16628149
 ] 

Hive QA commented on HIVE-20523:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12941276/HIVE-20523.4.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 24 failed/errored test(s), 14999 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[array_table_stats] 
(batchId=60)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[nested_column_pruning] 
(batchId=36)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_analyze] 
(batchId=24)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_complex_types_vectorization]
 (batchId=77)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_join] 
(batchId=21)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_map_type_vectorization]
 (batchId=90)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_no_row_serde] 
(batchId=74)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_struct_type_vectorization]
 (batchId=28)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_types_non_dictionary_encoding_vectorization]
 (batchId=91)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_types_vectorization]
 (batchId=15)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_decimal_date]
 (batchId=32)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_part_project]
 (batchId=37)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorization_numeric_overflows]
 (batchId=73)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorization_parquet_projection]
 (batchId=46)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorized_parquet_types]
 (batchId=71)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_partitioned_date_time]
 (batchId=179)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_dynamic_partition_pruning]
 (batchId=188)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_join] 
(batchId=119)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_decimal_date]
 (batchId=124)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[parquet_vectorization_part_project]
 (batchId=126)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vectorization_input_format_excludes]
 (batchId=131)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vectorization_parquet_projection]
 (batchId=130)
org.apache.hadoop.hive.ql.io.parquet.TestParquetSerDe.testParquetHiveSerDe 
(batchId=287)
org.apache.hive.jdbc.miniHS2.TestHs2ConnectionMetricsBinary.testOpenConnectionMetrics
 (batchId=256)
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/14049/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/14049/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-14049/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 24 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12941276 - PreCommit-HIVE-Build

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2018-09-25 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628108#comment-16628108
 ] 

Hive QA commented on HIVE-20523:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  8m 
 7s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
0s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
40s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  3m 
52s{color} | {color:blue} ql in master has 2326 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
56s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
21s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
2s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
40s{color} | {color:red} ql: The patch generated 1 new + 5 unchanged - 1 fixed 
= 6 total (was 6) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
14s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 23m 17s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-14049/dev-support/hive-personality.sh
 |
| git revision | master / 4137c21 |
| Default Java | 1.8.0_111 |
| findbugs | v3.0.0 |
| checkstyle | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-14049/yetus/diff-checkstyle-ql.txt
 |
| modules | C: ql U: ql |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-14049/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2018-09-23 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16625287#comment-16625287
 ] 

Hive QA commented on HIVE-20523:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12940982/HIVE-20523.3.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 90 failed/errored test(s), 14993 tests 
executed
*Failed tests:*
{noformat}
TestMiniDruidCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=194)

[druidmini_masking.q,druidmini_test1.q,druidkafkamini_basic.q,druidmini_joins.q,druid_timestamptz.q]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[alter_table_update_status]
 (batchId=85)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[nested_column_pruning] 
(batchId=36)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_analyze] 
(batchId=24)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_array_null_element]
 (batchId=79)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_complex_types_vectorization]
 (batchId=77)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_join] 
(batchId=21)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_map_null] 
(batchId=90)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_map_type_vectorization]
 (batchId=89)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_no_row_serde] 
(batchId=74)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_read_backward_compatible_files]
 (batchId=54)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_schema_evolution]
 (batchId=80)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_struct_type_vectorization]
 (batchId=28)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_types_non_dictionary_encoding_vectorization]
 (batchId=91)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_types_vectorization]
 (batchId=15)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_0] 
(batchId=17)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_10]
 (batchId=24)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_11]
 (batchId=39)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_12]
 (batchId=25)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_13]
 (batchId=55)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_14]
 (batchId=41)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_15]
 (batchId=92)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_16]
 (batchId=87)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_17]
 (batchId=30)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_1] 
(batchId=12)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_2] 
(batchId=3)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_3] 
(batchId=82)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_4] 
(batchId=46)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_5] 
(batchId=75)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_6] 
(batchId=44)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_7] 
(batchId=90)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_8] 
(batchId=14)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_9] 
(batchId=32)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_decimal_date]
 (batchId=32)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_div0]
 (batchId=82)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_limit]
 (batchId=26)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_nested_udf]
 (batchId=50)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_not]
 (batchId=83)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_offset_limit]
 (batchId=35)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_part]
 (batchId=77)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_part_project]
 (batchId=37)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_part_varchar]
 (batchId=77)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_pushdown]
 (batchId=36)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorization_numeric_overflows]
 (batchId=73)

[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2018-09-23 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16625274#comment-16625274
 ] 

Hive QA commented on HIVE-20523:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  8m 
 4s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
6s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
43s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  3m 
52s{color} | {color:blue} ql in master has 2326 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
56s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
3s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
36s{color} | {color:red} ql: The patch generated 1 new + 5 unchanged - 1 fixed 
= 6 total (was 6) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
56s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
14s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 23m 19s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-14007/dev-support/hive-personality.sh
 |
| git revision | master / 0f7163f |
| Default Java | 1.8.0_111 |
| findbugs | v3.0.0 |
| checkstyle | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-14007/yetus/diff-checkstyle-ql.txt
 |
| modules | C: ql U: ql |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-14007/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2018-09-23 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16625265#comment-16625265
 ] 

Hive QA commented on HIVE-20523:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12940980/HIVE-20523.2.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 98 failed/errored test(s), 14995 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[nested_column_pruning] 
(batchId=36)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_analyze] 
(batchId=24)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_array_map_emptynullvals]
 (batchId=36)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_array_null_element]
 (batchId=79)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_complex_types_vectorization]
 (batchId=77)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_create] 
(batchId=94)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_decimal1] 
(batchId=56)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_join] 
(batchId=21)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_map_null] 
(batchId=90)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_map_of_arrays_of_ints]
 (batchId=10)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_map_of_maps] 
(batchId=70)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_map_type_vectorization]
 (batchId=89)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_nested_complex] 
(batchId=94)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_no_row_serde] 
(batchId=74)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_read_backward_compatible_files]
 (batchId=54)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_schema_evolution]
 (batchId=80)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_struct_type_vectorization]
 (batchId=28)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_type_promotion] 
(batchId=91)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_types_non_dictionary_encoding_vectorization]
 (batchId=91)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_types_vectorization]
 (batchId=15)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_0] 
(batchId=17)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_10]
 (batchId=24)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_11]
 (batchId=39)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_12]
 (batchId=25)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_13]
 (batchId=55)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_14]
 (batchId=41)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_15]
 (batchId=92)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_16]
 (batchId=87)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_17]
 (batchId=30)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_1] 
(batchId=12)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_2] 
(batchId=3)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_3] 
(batchId=82)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_4] 
(batchId=46)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_5] 
(batchId=75)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_6] 
(batchId=44)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_7] 
(batchId=90)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_8] 
(batchId=14)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_9] 
(batchId=32)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_decimal_date]
 (batchId=32)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_div0]
 (batchId=82)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_limit]
 (batchId=26)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_nested_udf]
 (batchId=50)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_not]
 (batchId=83)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_offset_limit]
 (batchId=35)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_part]
 (batchId=77)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_part_project]
 (batchId=37)

[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2018-09-23 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16625256#comment-16625256
 ] 

Hive QA commented on HIVE-20523:


| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
1s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  7m 
44s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
1s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
41s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  3m 
50s{color} | {color:blue} ql in master has 2326 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
55s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
37s{color} | {color:green} ql: The patch generated 0 new + 5 unchanged - 1 
fixed = 5 total (was 6) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
13s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 22m 47s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-14006/dev-support/hive-personality.sh
 |
| git revision | master / 0f7163f |
| Default Java | 1.8.0_111 |
| findbugs | v3.0.0 |
| modules | C: ql U: ql |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-14006/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2018-09-23 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16625059#comment-16625059
 ] 

Hive QA commented on HIVE-20523:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12940906/HIVE-20523.1.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 96 failed/errored test(s), 14994 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[nested_column_pruning] 
(batchId=36)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_analyze] 
(batchId=24)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_array_map_emptynullvals]
 (batchId=36)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_array_null_element]
 (batchId=79)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_complex_types_vectorization]
 (batchId=77)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_create] 
(batchId=94)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_decimal1] 
(batchId=56)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_join] 
(batchId=21)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_map_null] 
(batchId=90)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_map_of_arrays_of_ints]
 (batchId=10)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_map_of_maps] 
(batchId=70)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_map_type_vectorization]
 (batchId=89)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_nested_complex] 
(batchId=94)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_no_row_serde] 
(batchId=74)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_read_backward_compatible_files]
 (batchId=54)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_schema_evolution]
 (batchId=80)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_struct_type_vectorization]
 (batchId=28)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_type_promotion] 
(batchId=90)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_types_non_dictionary_encoding_vectorization]
 (batchId=91)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_types_vectorization]
 (batchId=15)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_0] 
(batchId=17)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_10]
 (batchId=24)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_11]
 (batchId=39)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_12]
 (batchId=25)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_13]
 (batchId=55)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_14]
 (batchId=41)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_15]
 (batchId=92)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_16]
 (batchId=87)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_17]
 (batchId=30)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_1] 
(batchId=12)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_2] 
(batchId=3)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_3] 
(batchId=82)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_4] 
(batchId=46)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_5] 
(batchId=75)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_6] 
(batchId=44)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_7] 
(batchId=90)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_8] 
(batchId=14)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_9] 
(batchId=32)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_decimal_date]
 (batchId=32)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_div0]
 (batchId=82)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_limit]
 (batchId=26)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_nested_udf]
 (batchId=50)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_not]
 (batchId=83)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_offset_limit]
 (batchId=35)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_part]
 (batchId=76)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_part_project]
 (batchId=37)

[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2018-09-23 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16625048#comment-16625048
 ] 

Hive QA commented on HIVE-20523:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  7m 
54s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
2s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
41s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  3m 
53s{color} | {color:blue} ql in master has 2326 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
56s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
3s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
36s{color} | {color:red} ql: The patch generated 51 new + 6 unchanged - 0 fixed 
= 57 total (was 6) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
57s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
14s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 23m 14s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-13997/dev-support/hive-personality.sh
 |
| git revision | master / c44f2b5 |
| Default Java | 1.8.0_111 |
| findbugs | v3.0.0 |
| checkstyle | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-13997/yetus/diff-checkstyle-ql.txt
 |
| modules | C: ql U: ql |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-13997/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)