[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format
[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pachitariu updated HIVE-20523: - Status: Open (was: Patch Available) > Improve table statistics for Parquet format > --- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Minor > Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, > HIVE-20523.11.patch, HIVE-20523.12.patch, HIVE-20523.2.patch, > HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, > HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, > HIVE-20523.9.patch, HIVE-20523.patch > > > Right now, in the table basic statistics, the *raw data size* for a row with > any data type in the Parquet format is 1. This is an underestimated value > when columns are complex data structures, like arrays. > Having tables with underestimated raw data size makes Hive assign less > containers (mappers/reducers) to it, making the overall query slower. > Heavy underestimation also makes Hive choose MapJoin instead of the > ShuffleJoin that can fail with OOM errors. > In this patch, I compute the columns data size better, taking into account > complex structures. I followed the Writer implementation for the ORC format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format
[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pachitariu updated HIVE-20523: - Summary: Improve table statistics for Parquet format (was: Improve table statistics when the table contains arrays) > Improve table statistics for Parquet format > --- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Minor > Attachments: HIVE-20523.patch > > > By default, when the table contains table-stats, the value of *rawDataSize* > is taken to estimate the table data size in the execution plan. > The problem is that rawDataSize does not contain the data size of arrays. > This makes the table size be underestimated when arrays make most of the > table size. > In those specific cases, the value of the *totalSize* is much closer to the > truth. > In this task I propose to take the *max* value between *rawDataSize* and > *totalSize*deserializationFactor*. > I don't know if this proposal will backfire in any specific cases > (overestimating the size of tables). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format
[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pachitariu updated HIVE-20523: - Description: Right now, in the table basic statistics, the *raw data size* for a row with any data type in the Parquet format is 1. This is an underestimated value when columns are complex data structures, like arrays. Having tables with underestimated raw data size makes Hive assign less containers (mappers/reducers) to it, making the overall query slower. Heavy underestimation also makes Hive choose MapJoin instead of the ShuffleJoin that can fail with OOM errors. In this patch, I compute the columns data size better, taking into account complex structures. I followed the Writer implementation for the ORC format. was: By default, when the table contains table-stats, the value of *rawDataSize* is taken to estimate the table data size in the execution plan. The problem is that rawDataSize does not contain the data size of arrays. This makes the table size be underestimated when arrays make most of the table size. In those specific cases, the value of the *totalSize* is much closer to the truth. In this task I propose to take the *max* value between *rawDataSize* and *totalSize*deserializationFactor*. I don't know if this proposal will backfire in any specific cases (overestimating the size of tables). > Improve table statistics for Parquet format > --- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Minor > Attachments: HIVE-20523.patch > > > Right now, in the table basic statistics, the *raw data size* for a row with > any data type in the Parquet format is 1. This is an underestimated value > when columns are complex data structures, like arrays. > Having tables with underestimated raw data size makes Hive assign less > containers (mappers/reducers) to it, making the overall query slower. > Heavy underestimation also makes Hive choose MapJoin instead of the > ShuffleJoin that can fail with OOM errors. > In this patch, I compute the columns data size better, taking into account > complex structures. I followed the Writer implementation for the ORC format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format
[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pachitariu updated HIVE-20523: - Status: Open (was: Patch Available) > Improve table statistics for Parquet format > --- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Minor > Attachments: HIVE-20523.patch > > > Right now, in the table basic statistics, the *raw data size* for a row with > any data type in the Parquet format is 1. This is an underestimated value > when columns are complex data structures, like arrays. > Having tables with underestimated raw data size makes Hive assign less > containers (mappers/reducers) to it, making the overall query slower. > Heavy underestimation also makes Hive choose MapJoin instead of the > ShuffleJoin that can fail with OOM errors. > In this patch, I compute the columns data size better, taking into account > complex structures. I followed the Writer implementation for the ORC format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format
[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pachitariu updated HIVE-20523: - Attachment: HIVE-20523.1.patch Status: Patch Available (was: Open) > Improve table statistics for Parquet format > --- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Minor > Attachments: HIVE-20523.1.patch, HIVE-20523.patch > > > Right now, in the table basic statistics, the *raw data size* for a row with > any data type in the Parquet format is 1. This is an underestimated value > when columns are complex data structures, like arrays. > Having tables with underestimated raw data size makes Hive assign less > containers (mappers/reducers) to it, making the overall query slower. > Heavy underestimation also makes Hive choose MapJoin instead of the > ShuffleJoin that can fail with OOM errors. > In this patch, I compute the columns data size better, taking into account > complex structures. I followed the Writer implementation for the ORC format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format
[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pachitariu updated HIVE-20523: - Status: Open (was: Patch Available) > Improve table statistics for Parquet format > --- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Minor > Attachments: HIVE-20523.1.patch, HIVE-20523.patch > > > Right now, in the table basic statistics, the *raw data size* for a row with > any data type in the Parquet format is 1. This is an underestimated value > when columns are complex data structures, like arrays. > Having tables with underestimated raw data size makes Hive assign less > containers (mappers/reducers) to it, making the overall query slower. > Heavy underestimation also makes Hive choose MapJoin instead of the > ShuffleJoin that can fail with OOM errors. > In this patch, I compute the columns data size better, taking into account > complex structures. I followed the Writer implementation for the ORC format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format
[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pachitariu updated HIVE-20523: - Attachment: HIVE-20523.2.patch Status: Patch Available (was: Open) > Improve table statistics for Parquet format > --- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Minor > Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, HIVE-20523.patch > > > Right now, in the table basic statistics, the *raw data size* for a row with > any data type in the Parquet format is 1. This is an underestimated value > when columns are complex data structures, like arrays. > Having tables with underestimated raw data size makes Hive assign less > containers (mappers/reducers) to it, making the overall query slower. > Heavy underestimation also makes Hive choose MapJoin instead of the > ShuffleJoin that can fail with OOM errors. > In this patch, I compute the columns data size better, taking into account > complex structures. I followed the Writer implementation for the ORC format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format
[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pachitariu updated HIVE-20523: - Status: Open (was: Patch Available) > Improve table statistics for Parquet format > --- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Minor > Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, HIVE-20523.patch > > > Right now, in the table basic statistics, the *raw data size* for a row with > any data type in the Parquet format is 1. This is an underestimated value > when columns are complex data structures, like arrays. > Having tables with underestimated raw data size makes Hive assign less > containers (mappers/reducers) to it, making the overall query slower. > Heavy underestimation also makes Hive choose MapJoin instead of the > ShuffleJoin that can fail with OOM errors. > In this patch, I compute the columns data size better, taking into account > complex structures. I followed the Writer implementation for the ORC format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format
[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pachitariu updated HIVE-20523: - Attachment: HIVE-20523.3.patch Status: Patch Available (was: Open) > Improve table statistics for Parquet format > --- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Minor > Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, > HIVE-20523.3.patch, HIVE-20523.patch > > > Right now, in the table basic statistics, the *raw data size* for a row with > any data type in the Parquet format is 1. This is an underestimated value > when columns are complex data structures, like arrays. > Having tables with underestimated raw data size makes Hive assign less > containers (mappers/reducers) to it, making the overall query slower. > Heavy underestimation also makes Hive choose MapJoin instead of the > ShuffleJoin that can fail with OOM errors. > In this patch, I compute the columns data size better, taking into account > complex structures. I followed the Writer implementation for the ORC format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format
[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pachitariu updated HIVE-20523: - Status: Open (was: Patch Available) > Improve table statistics for Parquet format > --- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Minor > Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, > HIVE-20523.3.patch, HIVE-20523.patch > > > Right now, in the table basic statistics, the *raw data size* for a row with > any data type in the Parquet format is 1. This is an underestimated value > when columns are complex data structures, like arrays. > Having tables with underestimated raw data size makes Hive assign less > containers (mappers/reducers) to it, making the overall query slower. > Heavy underestimation also makes Hive choose MapJoin instead of the > ShuffleJoin that can fail with OOM errors. > In this patch, I compute the columns data size better, taking into account > complex structures. I followed the Writer implementation for the ORC format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format
[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pachitariu updated HIVE-20523: - Attachment: HIVE-20523.4.patch Status: Patch Available (was: Open) > Improve table statistics for Parquet format > --- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Minor > Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, > HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.patch > > > Right now, in the table basic statistics, the *raw data size* for a row with > any data type in the Parquet format is 1. This is an underestimated value > when columns are complex data structures, like arrays. > Having tables with underestimated raw data size makes Hive assign less > containers (mappers/reducers) to it, making the overall query slower. > Heavy underestimation also makes Hive choose MapJoin instead of the > ShuffleJoin that can fail with OOM errors. > In this patch, I compute the columns data size better, taking into account > complex structures. I followed the Writer implementation for the ORC format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format
[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pachitariu updated HIVE-20523: - Status: Open (was: Patch Available) > Improve table statistics for Parquet format > --- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Minor > Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, > HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.patch > > > Right now, in the table basic statistics, the *raw data size* for a row with > any data type in the Parquet format is 1. This is an underestimated value > when columns are complex data structures, like arrays. > Having tables with underestimated raw data size makes Hive assign less > containers (mappers/reducers) to it, making the overall query slower. > Heavy underestimation also makes Hive choose MapJoin instead of the > ShuffleJoin that can fail with OOM errors. > In this patch, I compute the columns data size better, taking into account > complex structures. I followed the Writer implementation for the ORC format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format
[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pachitariu updated HIVE-20523: - Attachment: HIVE-20523.5.patch Status: Patch Available (was: Open) > Improve table statistics for Parquet format > --- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Minor > Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, > HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, HIVE-20523.patch > > > Right now, in the table basic statistics, the *raw data size* for a row with > any data type in the Parquet format is 1. This is an underestimated value > when columns are complex data structures, like arrays. > Having tables with underestimated raw data size makes Hive assign less > containers (mappers/reducers) to it, making the overall query slower. > Heavy underestimation also makes Hive choose MapJoin instead of the > ShuffleJoin that can fail with OOM errors. > In this patch, I compute the columns data size better, taking into account > complex structures. I followed the Writer implementation for the ORC format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format
[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pachitariu updated HIVE-20523: - Status: Open (was: Patch Available) > Improve table statistics for Parquet format > --- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Minor > Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, > HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, HIVE-20523.patch > > > Right now, in the table basic statistics, the *raw data size* for a row with > any data type in the Parquet format is 1. This is an underestimated value > when columns are complex data structures, like arrays. > Having tables with underestimated raw data size makes Hive assign less > containers (mappers/reducers) to it, making the overall query slower. > Heavy underestimation also makes Hive choose MapJoin instead of the > ShuffleJoin that can fail with OOM errors. > In this patch, I compute the columns data size better, taking into account > complex structures. I followed the Writer implementation for the ORC format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format
[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pachitariu updated HIVE-20523: - Attachment: HIVE-20523.6.patch Status: Patch Available (was: Open) > Improve table statistics for Parquet format > --- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Minor > Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, > HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, > HIVE-20523.6.patch, HIVE-20523.patch > > > Right now, in the table basic statistics, the *raw data size* for a row with > any data type in the Parquet format is 1. This is an underestimated value > when columns are complex data structures, like arrays. > Having tables with underestimated raw data size makes Hive assign less > containers (mappers/reducers) to it, making the overall query slower. > Heavy underestimation also makes Hive choose MapJoin instead of the > ShuffleJoin that can fail with OOM errors. > In this patch, I compute the columns data size better, taking into account > complex structures. I followed the Writer implementation for the ORC format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format
[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pachitariu updated HIVE-20523: - Attachment: HIVE-20523.7.patch Status: Patch Available (was: Open) > Improve table statistics for Parquet format > --- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Minor > Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, > HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, > HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.patch > > > Right now, in the table basic statistics, the *raw data size* for a row with > any data type in the Parquet format is 1. This is an underestimated value > when columns are complex data structures, like arrays. > Having tables with underestimated raw data size makes Hive assign less > containers (mappers/reducers) to it, making the overall query slower. > Heavy underestimation also makes Hive choose MapJoin instead of the > ShuffleJoin that can fail with OOM errors. > In this patch, I compute the columns data size better, taking into account > complex structures. I followed the Writer implementation for the ORC format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format
[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pachitariu updated HIVE-20523: - Status: Open (was: Patch Available) > Improve table statistics for Parquet format > --- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Minor > Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, > HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, > HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.patch > > > Right now, in the table basic statistics, the *raw data size* for a row with > any data type in the Parquet format is 1. This is an underestimated value > when columns are complex data structures, like arrays. > Having tables with underestimated raw data size makes Hive assign less > containers (mappers/reducers) to it, making the overall query slower. > Heavy underestimation also makes Hive choose MapJoin instead of the > ShuffleJoin that can fail with OOM errors. > In this patch, I compute the columns data size better, taking into account > complex structures. I followed the Writer implementation for the ORC format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format
[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pachitariu updated HIVE-20523: - Attachment: HIVE-20523.8.patch Status: Patch Available (was: Open) > Improve table statistics for Parquet format > --- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Minor > Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, > HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, > HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, HIVE-20523.patch > > > Right now, in the table basic statistics, the *raw data size* for a row with > any data type in the Parquet format is 1. This is an underestimated value > when columns are complex data structures, like arrays. > Having tables with underestimated raw data size makes Hive assign less > containers (mappers/reducers) to it, making the overall query slower. > Heavy underestimation also makes Hive choose MapJoin instead of the > ShuffleJoin that can fail with OOM errors. > In this patch, I compute the columns data size better, taking into account > complex structures. I followed the Writer implementation for the ORC format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format
[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pachitariu updated HIVE-20523: - Status: Open (was: Patch Available) > Improve table statistics for Parquet format > --- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Minor > Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, > HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, > HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.patch > > > Right now, in the table basic statistics, the *raw data size* for a row with > any data type in the Parquet format is 1. This is an underestimated value > when columns are complex data structures, like arrays. > Having tables with underestimated raw data size makes Hive assign less > containers (mappers/reducers) to it, making the overall query slower. > Heavy underestimation also makes Hive choose MapJoin instead of the > ShuffleJoin that can fail with OOM errors. > In this patch, I compute the columns data size better, taking into account > complex structures. I followed the Writer implementation for the ORC format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format
[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pachitariu updated HIVE-20523: - Status: Open (was: Patch Available) > Improve table statistics for Parquet format > --- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Minor > Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, > HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, > HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, HIVE-20523.patch > > > Right now, in the table basic statistics, the *raw data size* for a row with > any data type in the Parquet format is 1. This is an underestimated value > when columns are complex data structures, like arrays. > Having tables with underestimated raw data size makes Hive assign less > containers (mappers/reducers) to it, making the overall query slower. > Heavy underestimation also makes Hive choose MapJoin instead of the > ShuffleJoin that can fail with OOM errors. > In this patch, I compute the columns data size better, taking into account > complex structures. I followed the Writer implementation for the ORC format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format
[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pachitariu updated HIVE-20523: - Attachment: HIVE-20523.9.patch Status: Patch Available (was: Open) > Improve table statistics for Parquet format > --- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Minor > Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, > HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, > HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, > HIVE-20523.9.patch, HIVE-20523.patch > > > Right now, in the table basic statistics, the *raw data size* for a row with > any data type in the Parquet format is 1. This is an underestimated value > when columns are complex data structures, like arrays. > Having tables with underestimated raw data size makes Hive assign less > containers (mappers/reducers) to it, making the overall query slower. > Heavy underestimation also makes Hive choose MapJoin instead of the > ShuffleJoin that can fail with OOM errors. > In this patch, I compute the columns data size better, taking into account > complex structures. I followed the Writer implementation for the ORC format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format
[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pachitariu updated HIVE-20523: - Attachment: HIVE-20523.10.patch Status: Patch Available (was: Open) > Improve table statistics for Parquet format > --- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Minor > Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, > HIVE-20523.2.patch, HIVE-20523.3.patch, HIVE-20523.4.patch, > HIVE-20523.5.patch, HIVE-20523.6.patch, HIVE-20523.7.patch, > HIVE-20523.8.patch, HIVE-20523.9.patch, HIVE-20523.patch > > > Right now, in the table basic statistics, the *raw data size* for a row with > any data type in the Parquet format is 1. This is an underestimated value > when columns are complex data structures, like arrays. > Having tables with underestimated raw data size makes Hive assign less > containers (mappers/reducers) to it, making the overall query slower. > Heavy underestimation also makes Hive choose MapJoin instead of the > ShuffleJoin that can fail with OOM errors. > In this patch, I compute the columns data size better, taking into account > complex structures. I followed the Writer implementation for the ORC format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format
[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pachitariu updated HIVE-20523: - Status: Open (was: Patch Available) > Improve table statistics for Parquet format > --- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Minor > Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, > HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, > HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, > HIVE-20523.9.patch, HIVE-20523.patch > > > Right now, in the table basic statistics, the *raw data size* for a row with > any data type in the Parquet format is 1. This is an underestimated value > when columns are complex data structures, like arrays. > Having tables with underestimated raw data size makes Hive assign less > containers (mappers/reducers) to it, making the overall query slower. > Heavy underestimation also makes Hive choose MapJoin instead of the > ShuffleJoin that can fail with OOM errors. > In this patch, I compute the columns data size better, taking into account > complex structures. I followed the Writer implementation for the ORC format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format
[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pachitariu updated HIVE-20523: - Attachment: (was: HIVE-20523.11.patch) > Improve table statistics for Parquet format > --- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Minor > Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, > HIVE-20523.2.patch, HIVE-20523.3.patch, HIVE-20523.4.patch, > HIVE-20523.5.patch, HIVE-20523.6.patch, HIVE-20523.7.patch, > HIVE-20523.8.patch, HIVE-20523.9.patch, HIVE-20523.patch > > > Right now, in the table basic statistics, the *raw data size* for a row with > any data type in the Parquet format is 1. This is an underestimated value > when columns are complex data structures, like arrays. > Having tables with underestimated raw data size makes Hive assign less > containers (mappers/reducers) to it, making the overall query slower. > Heavy underestimation also makes Hive choose MapJoin instead of the > ShuffleJoin that can fail with OOM errors. > In this patch, I compute the columns data size better, taking into account > complex structures. I followed the Writer implementation for the ORC format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format
[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pachitariu updated HIVE-20523: - Attachment: HIVE-20523.11.patch > Improve table statistics for Parquet format > --- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Minor > Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, > HIVE-20523.2.patch, HIVE-20523.3.patch, HIVE-20523.4.patch, > HIVE-20523.5.patch, HIVE-20523.6.patch, HIVE-20523.7.patch, > HIVE-20523.8.patch, HIVE-20523.9.patch, HIVE-20523.patch > > > Right now, in the table basic statistics, the *raw data size* for a row with > any data type in the Parquet format is 1. This is an underestimated value > when columns are complex data structures, like arrays. > Having tables with underestimated raw data size makes Hive assign less > containers (mappers/reducers) to it, making the overall query slower. > Heavy underestimation also makes Hive choose MapJoin instead of the > ShuffleJoin that can fail with OOM errors. > In this patch, I compute the columns data size better, taking into account > complex structures. I followed the Writer implementation for the ORC format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format
[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pachitariu updated HIVE-20523: - Attachment: HIVE-20523.11.patch Status: Patch Available (was: Open) > Improve table statistics for Parquet format > --- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Minor > Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, > HIVE-20523.11.patch, HIVE-20523.2.patch, HIVE-20523.3.patch, > HIVE-20523.4.patch, HIVE-20523.5.patch, HIVE-20523.6.patch, > HIVE-20523.7.patch, HIVE-20523.8.patch, HIVE-20523.9.patch, HIVE-20523.patch > > > Right now, in the table basic statistics, the *raw data size* for a row with > any data type in the Parquet format is 1. This is an underestimated value > when columns are complex data structures, like arrays. > Having tables with underestimated raw data size makes Hive assign less > containers (mappers/reducers) to it, making the overall query slower. > Heavy underestimation also makes Hive choose MapJoin instead of the > ShuffleJoin that can fail with OOM errors. > In this patch, I compute the columns data size better, taking into account > complex structures. I followed the Writer implementation for the ORC format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format
[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pachitariu updated HIVE-20523: - Status: Open (was: Patch Available) > Improve table statistics for Parquet format > --- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Minor > Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, > HIVE-20523.11.patch, HIVE-20523.2.patch, HIVE-20523.3.patch, > HIVE-20523.4.patch, HIVE-20523.5.patch, HIVE-20523.6.patch, > HIVE-20523.7.patch, HIVE-20523.8.patch, HIVE-20523.9.patch, HIVE-20523.patch > > > Right now, in the table basic statistics, the *raw data size* for a row with > any data type in the Parquet format is 1. This is an underestimated value > when columns are complex data structures, like arrays. > Having tables with underestimated raw data size makes Hive assign less > containers (mappers/reducers) to it, making the overall query slower. > Heavy underestimation also makes Hive choose MapJoin instead of the > ShuffleJoin that can fail with OOM errors. > In this patch, I compute the columns data size better, taking into account > complex structures. I followed the Writer implementation for the ORC format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format
[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pachitariu updated HIVE-20523: - Attachment: HIVE-20523.12.patch Status: Patch Available (was: Open) > Improve table statistics for Parquet format > --- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Minor > Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, > HIVE-20523.11.patch, HIVE-20523.12.patch, HIVE-20523.2.patch, > HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, > HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, > HIVE-20523.9.patch, HIVE-20523.patch > > > Right now, in the table basic statistics, the *raw data size* for a row with > any data type in the Parquet format is 1. This is an underestimated value > when columns are complex data structures, like arrays. > Having tables with underestimated raw data size makes Hive assign less > containers (mappers/reducers) to it, making the overall query slower. > Heavy underestimation also makes Hive choose MapJoin instead of the > ShuffleJoin that can fail with OOM errors. > In this patch, I compute the columns data size better, taking into account > complex structures. I followed the Writer implementation for the ORC format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format
[ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] George Pachitariu updated HIVE-20523: - Status: Open (was: Patch Available) > Improve table statistics for Parquet format > --- > > Key: HIVE-20523 > URL: https://issues.apache.org/jira/browse/HIVE-20523 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Minor > Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, > HIVE-20523.11.patch, HIVE-20523.12.patch, HIVE-20523.2.patch, > HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, > HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, > HIVE-20523.9.patch, HIVE-20523.patch > > > Right now, in the table basic statistics, the *raw data size* for a row with > any data type in the Parquet format is 1. This is an underestimated value > when columns are complex data structures, like arrays. > Having tables with underestimated raw data size makes Hive assign less > containers (mappers/reducers) to it, making the overall query slower. > Heavy underestimation also makes Hive choose MapJoin instead of the > ShuffleJoin that can fail with OOM errors. > In this patch, I compute the columns data size better, taking into account > complex structures. I followed the Writer implementation for the ORC format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)