[jira] [Updated] (HIVE-23891) Using UNION sql clause and speculative execution can cause file duplication in Tez

2020-07-21 Thread George Pachitariu (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-23891:
-
Description: 
Hello, 

the specific scenario when this can happen:
 - the execution engine is Tez;
 - speculative execution is on;
 - the query inserts into a table and the last step is a UNION sql clause;

The problem is that Tez creates an extra layer of subdirectories when there is 
a UNION. Later, when deduplicating, Hive doesn't take that into account and 
only deduplicates folders but not the files inside.

So for a query like this:
{code:sql}
insert overwrite table union_all
select * from union_first_part
union all
select * from union_second_part;
{code}
The folder structure afterwards will be like this (a possible example):
{code:java}
.../union_all/HIVE_UNION_SUBDIR_1/00_0
.../union_all/HIVE_UNION_SUBDIR_1/00_1
.../union_all/HIVE_UNION_SUBDIR_2/00_1
{code}
The attached patch increases the number of folder levels that Hive will check 
recursively for duplicates when we have a UNION in Tez.

Feel free to reach out if you have any questions :).

  was:
Hello, 

the specific scenario when this can happen:
 - the execution engine is Tez;
 - speculative execution is on;
 - the query inserts into a table and the last step is a UNION sql clause;

The problem is that Tez creates an extra layer of subdirectories when there is 
a UNION. Later, when deduplicating, Hive doesn't take that into account and 
only deduplicates folders but not the files inside.

So for a query like this:
{code:sql}
insert overwrite table union_all
select * from union_first_part
union all
select * from union_second_part;
{code}
The folder structure afterwards will be like this (a possible example):
{code:java}
.../union_all/HIVE_UNION_SUBDIR_1/00_0
.../union_all/HIVE_UNION_SUBDIR_1/00_1
.../union_all/HIVE_UNION_SUBDIR_2/00_1
{code}
The attached patch increases the number of folder levels that Hive will check 
recursively for duplicates (recursively) when we have a UNION in Tez.

Feel free to reach out if you have any questions :).


> Using UNION sql clause and speculative execution can cause file duplication 
> in Tez
> --
>
> Key: HIVE-23891
> URL: https://issues.apache.org/jira/browse/HIVE-23891
> Project: Hive
>  Issue Type: Bug
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Major
> Attachments: HIVE-23891.1.patch
>
>
> Hello, 
> the specific scenario when this can happen:
>  - the execution engine is Tez;
>  - speculative execution is on;
>  - the query inserts into a table and the last step is a UNION sql clause;
> The problem is that Tez creates an extra layer of subdirectories when there 
> is a UNION. Later, when deduplicating, Hive doesn't take that into account 
> and only deduplicates folders but not the files inside.
> So for a query like this:
> {code:sql}
> insert overwrite table union_all
> select * from union_first_part
> union all
> select * from union_second_part;
> {code}
> The folder structure afterwards will be like this (a possible example):
> {code:java}
> .../union_all/HIVE_UNION_SUBDIR_1/00_0
> .../union_all/HIVE_UNION_SUBDIR_1/00_1
> .../union_all/HIVE_UNION_SUBDIR_2/00_1
> {code}
> The attached patch increases the number of folder levels that Hive will check 
> recursively for duplicates when we have a UNION in Tez.
> Feel free to reach out if you have any questions :).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-23891) Using UNION sql clause and speculative execution can cause file duplication in Tez

2020-07-21 Thread George Pachitariu (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-23891:
-
Attachment: HIVE-23891.1.patch
Status: Patch Available  (was: Open)

> Using UNION sql clause and speculative execution can cause file duplication 
> in Tez
> --
>
> Key: HIVE-23891
> URL: https://issues.apache.org/jira/browse/HIVE-23891
> Project: Hive
>  Issue Type: Bug
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Major
> Attachments: HIVE-23891.1.patch
>
>
> Hello, 
> the specific scenario when this can happen:
>  - the execution engine is Tez;
>  - speculative execution is on;
>  - the query inserts into a table and the last step is a UNION sql clause;
> The problem is that Tez creates an extra layer of subdirectories when there 
> is a UNION. Later, when deduplicating, Hive doesn't take that into account 
> and only deduplicates folders but not the files inside.
> So for a query like this:
> {code:sql}
> insert overwrite table union_all
> select * from union_first_part
> union all
> select * from union_second_part;
> {code}
> The folder structure afterwards will be like this (a possible example):
> {code:java}
> .../union_all/HIVE_UNION_SUBDIR_1/00_0
> .../union_all/HIVE_UNION_SUBDIR_1/00_1
> .../union_all/HIVE_UNION_SUBDIR_2/00_1
> {code}
> The attached patch increases the number of folder levels that Hive will check 
> recursively for duplicates (recursively) when we have a UNION in Tez.
> Feel free to reach out if you have any questions :).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-23891) Using UNION sql clause and speculative execution can cause file duplication in Tez

2020-07-21 Thread George Pachitariu (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu reassigned HIVE-23891:



> Using UNION sql clause and speculative execution can cause file duplication 
> in Tez
> --
>
> Key: HIVE-23891
> URL: https://issues.apache.org/jira/browse/HIVE-23891
> Project: Hive
>  Issue Type: Bug
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Major
>
> Hello, 
> the specific scenario when this can happen:
>  - the execution engine is Tez;
>  - speculative execution is on;
>  - the query inserts into a table and the last step is a UNION sql clause;
> The problem is that Tez creates an extra layer of subdirectories when there 
> is a UNION. Later, when deduplicating, Hive doesn't take that into account 
> and only deduplicates folders but not the files inside.
> So for a query like this:
> {code:sql}
> insert overwrite table union_all
> select * from union_first_part
> union all
> select * from union_second_part;
> {code}
> The folder structure afterwards will be like this (a possible example):
> {code:java}
> .../union_all/HIVE_UNION_SUBDIR_1/00_0
> .../union_all/HIVE_UNION_SUBDIR_1/00_1
> .../union_all/HIVE_UNION_SUBDIR_2/00_1
> {code}
> The attached patch increases the number of folder levels that Hive will check 
> recursively for duplicates (recursively) when we have a UNION in Tez.
> Feel free to reach out if you have any questions :).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-21100) Allow flattening of table subdirectories resulted when using TEZ engine and UNION clause

2019-04-13 Thread George Pachitariu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16816918#comment-16816918
 ] 

George Pachitariu commented on HIVE-21100:
--

Sorry for taking this long for fixing minor errors.

I think the failing test above 
(TestHCatMutableNonPartitioned.testHCatNonPartitionedTable) is not related to 
my patch.

Hi [~ekoifman], can you please review this?

> Allow flattening of table subdirectories resulted when using TEZ engine and 
> UNION clause
> 
>
> Key: HIVE-21100
> URL: https://issues.apache.org/jira/browse/HIVE-21100
> Project: Hive
>  Issue Type: Improvement
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-21100.1.patch, HIVE-21100.2.patch, 
> HIVE-21100.3.patch, HIVE-21100.patch
>
>
> Right now, when writing data into a table with Tez engine and the clause 
> UNION ALL is the last step of the query, Hive on Tez will create a 
> subdirectory for each branch of the UNION ALL.
> With this patch the subdirectories are removed, and the files are renamed and 
> moved to the parent directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21100) Allow flattening of table subdirectories resulted when using TEZ engine and UNION clause

2019-04-12 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-21100:
-
Attachment: HIVE-21100.3.patch
Status: Patch Available  (was: Open)

I fixed: "First sentence should end with a period."

> Allow flattening of table subdirectories resulted when using TEZ engine and 
> UNION clause
> 
>
> Key: HIVE-21100
> URL: https://issues.apache.org/jira/browse/HIVE-21100
> Project: Hive
>  Issue Type: Improvement
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-21100.1.patch, HIVE-21100.2.patch, 
> HIVE-21100.3.patch, HIVE-21100.patch
>
>
> Right now, when writing data into a table with Tez engine and the clause 
> UNION ALL is the last step of the query, Hive on Tez will create a 
> subdirectory for each branch of the UNION ALL.
> With this patch the subdirectories are removed, and the files are renamed and 
> moved to the parent directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21100) Allow flattening of table subdirectories resulted when using TEZ engine and UNION clause

2019-04-12 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-21100:
-
Status: Open  (was: Patch Available)

> Allow flattening of table subdirectories resulted when using TEZ engine and 
> UNION clause
> 
>
> Key: HIVE-21100
> URL: https://issues.apache.org/jira/browse/HIVE-21100
> Project: Hive
>  Issue Type: Improvement
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-21100.1.patch, HIVE-21100.2.patch, HIVE-21100.patch
>
>
> Right now, when writing data into a table with Tez engine and the clause 
> UNION ALL is the last step of the query, Hive on Tez will create a 
> subdirectory for each branch of the UNION ALL.
> With this patch the subdirectories are removed, and the files are renamed and 
> moved to the parent directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21100) Allow flattening of table subdirectories resulted when using TEZ engine and UNION clause

2019-04-12 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-21100:
-
Attachment: HIVE-21100.2.patch
Status: Patch Available  (was: Open)

I fixed the checkstyle errors.

> Allow flattening of table subdirectories resulted when using TEZ engine and 
> UNION clause
> 
>
> Key: HIVE-21100
> URL: https://issues.apache.org/jira/browse/HIVE-21100
> Project: Hive
>  Issue Type: Improvement
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-21100.1.patch, HIVE-21100.2.patch, HIVE-21100.patch
>
>
> Right now, when writing data into a table with Tez engine and the clause 
> UNION ALL is the last step of the query, Hive on Tez will create a 
> subdirectory for each branch of the UNION ALL.
> With this patch the subdirectories are removed, and the files are renamed and 
> moved to the parent directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21100) Allow flattening of table subdirectories resulted when using TEZ engine and UNION clause

2019-04-12 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-21100:
-
Status: Open  (was: Patch Available)

> Allow flattening of table subdirectories resulted when using TEZ engine and 
> UNION clause
> 
>
> Key: HIVE-21100
> URL: https://issues.apache.org/jira/browse/HIVE-21100
> Project: Hive
>  Issue Type: Improvement
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-21100.1.patch, HIVE-21100.patch
>
>
> Right now, when writing data into a table with Tez engine and the clause 
> UNION ALL is the last step of the query, Hive on Tez will create a 
> subdirectory for each branch of the UNION ALL.
> With this patch the subdirectories are removed, and the files are renamed and 
> moved to the parent directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21100) Allow flattening of table subdirectories resulted when using TEZ engine and UNION clause

2019-04-12 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-21100:
-
Attachment: HIVE-21100.1.patch
Status: Patch Available  (was: Open)

I submitted the same patch to see the checkstyle errors. I added the Apache 
Licence to the newly created test file.

> Allow flattening of table subdirectories resulted when using TEZ engine and 
> UNION clause
> 
>
> Key: HIVE-21100
> URL: https://issues.apache.org/jira/browse/HIVE-21100
> Project: Hive
>  Issue Type: Improvement
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-21100.1.patch, HIVE-21100.patch
>
>
> Right now, when writing data into a table with Tez engine and the clause 
> UNION ALL is the last step of the query, Hive on Tez will create a 
> subdirectory for each branch of the UNION ALL.
> With this patch the subdirectories are removed, and the files are renamed and 
> moved to the parent directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21100) Allow flattening of table subdirectories resulted when using TEZ engine and UNION clause

2019-04-12 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-21100:
-
Status: Open  (was: Patch Available)

> Allow flattening of table subdirectories resulted when using TEZ engine and 
> UNION clause
> 
>
> Key: HIVE-21100
> URL: https://issues.apache.org/jira/browse/HIVE-21100
> Project: Hive
>  Issue Type: Improvement
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-21100.patch
>
>
> Right now, when writing data into a table with Tez engine and the clause 
> UNION ALL is the last step of the query, Hive on Tez will create a 
> subdirectory for each branch of the UNION ALL.
> With this patch the subdirectories are removed, and the files are renamed and 
> moved to the parent directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (HIVE-20523) Improve table statistics for Parquet format

2019-02-19 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Comment: was deleted

(was: The implementation in the other tasks is better than this one.)

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, 
> HIVE-20523.11.patch, HIVE-20523.12.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, 
> HIVE-20523.9.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2019-02-19 Thread George Pachitariu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772033#comment-16772033
 ] 

George Pachitariu commented on HIVE-20523:
--

Ok, I understand now (and you know much more about this than me :P). Thank you 
for the lesson.

I will close this task.

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, 
> HIVE-20523.11.patch, HIVE-20523.12.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, 
> HIVE-20523.9.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HIVE-20523) Improve table statistics for Parquet format

2019-02-19 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu resolved HIVE-20523.
--
Resolution: Duplicate

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, 
> HIVE-20523.11.patch, HIVE-20523.12.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, 
> HIVE-20523.9.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2019-02-19 Thread George Pachitariu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772032#comment-16772032
 ] 

George Pachitariu commented on HIVE-20523:
--

The implementation in the other tasks is better than this one.

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, 
> HIVE-20523.11.patch, HIVE-20523.12.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, 
> HIVE-20523.9.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format

2019-02-19 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Status: Open  (was: Patch Available)

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, 
> HIVE-20523.11.patch, HIVE-20523.12.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, 
> HIVE-20523.9.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2019-02-19 Thread George Pachitariu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771853#comment-16771853
 ] 

George Pachitariu commented on HIVE-20523:
--

Hi [~asinkovits], thank you for having a look on my patch :).

Can you please give me a concrete example when my solution will give less 
consistent results, compared to your solution?

My code gets called for each row, so it sees everything. 
The same is reading from the footer in parquet files (you see all the rows).

I initially went with this approach because I wanted to be consistent with the 
Orc implementation (so that later we can merge the code together, instead of 
always implementing things the Parquet way and duplicating the effort to do it 
the Orc way).

Cheers.

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, 
> HIVE-20523.11.patch, HIVE-20523.12.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, 
> HIVE-20523.9.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format

2019-01-22 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Status: Open  (was: Patch Available)

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, 
> HIVE-20523.11.patch, HIVE-20523.12.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, 
> HIVE-20523.9.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format

2019-01-22 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Attachment: HIVE-20523.12.patch
Status: Patch Available  (was: Open)

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, 
> HIVE-20523.11.patch, HIVE-20523.12.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, 
> HIVE-20523.9.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21100) Allow flattening of table subdirectories resulted when using TEZ engine and UNION clause

2019-01-09 Thread George Pachitariu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737980#comment-16737980
 ] 

George Pachitariu commented on HIVE-21100:
--

Hello [~ekoifman] , thanks for commenting.

The motivation is that some systems in Hadoop, like Impala, cannot read 
directories recursively. If a table was created with Hive with subdirectories 
and after that, it is queried by Impala, the table will look empty.

I know that this patch will only benefit a few people, that's why it is 
disabled by default and I added an option to turn it on.

> Allow flattening of table subdirectories resulted when using TEZ engine and 
> UNION clause
> 
>
> Key: HIVE-21100
> URL: https://issues.apache.org/jira/browse/HIVE-21100
> Project: Hive
>  Issue Type: Improvement
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-21100.patch
>
>
> Right now, when writing data into a table with Tez engine and the clause 
> UNION ALL is the last step of the query, Hive on Tez will create a 
> subdirectory for each branch of the UNION ALL.
> With this patch the subdirectories are removed, and the files are renamed and 
> moved to the parent directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21100) Allow flattening of table subdirectories resulted when using TEZ engine and UNION clause

2019-01-08 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-21100:
-
Attachment: HIVE-21100.patch
Status: Patch Available  (was: Open)

> Allow flattening of table subdirectories resulted when using TEZ engine and 
> UNION clause
> 
>
> Key: HIVE-21100
> URL: https://issues.apache.org/jira/browse/HIVE-21100
> Project: Hive
>  Issue Type: Improvement
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-21100.patch
>
>
> Right now, when writing data into a table with Tez engine and the clause 
> UNION ALL is the last step of the query, Hive on Tez will create a 
> subdirectory for each branch of the UNION ALL.
> With this patch the subdirectories are removed, and the files are renamed and 
> moved to the parent directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-21100) Allow flattening of table subdirectories resulted when using TEZ engine and UNION clause

2019-01-08 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu reassigned HIVE-21100:



> Allow flattening of table subdirectories resulted when using TEZ engine and 
> UNION clause
> 
>
> Key: HIVE-21100
> URL: https://issues.apache.org/jira/browse/HIVE-21100
> Project: Hive
>  Issue Type: Improvement
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
>
> Right now, when writing data into a table with Tez engine and the clause 
> UNION ALL is the last step of the query, Hive on Tez will create a 
> subdirectory for each branch of the UNION ALL.
> With this patch the subdirectories are removed, and the files are renamed and 
> moved to the parent directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format

2019-01-04 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Status: Open  (was: Patch Available)

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, 
> HIVE-20523.11.patch, HIVE-20523.2.patch, HIVE-20523.3.patch, 
> HIVE-20523.4.patch, HIVE-20523.5.patch, HIVE-20523.6.patch, 
> HIVE-20523.7.patch, HIVE-20523.8.patch, HIVE-20523.9.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format

2019-01-04 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Attachment: HIVE-20523.11.patch
Status: Patch Available  (was: Open)

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, 
> HIVE-20523.11.patch, HIVE-20523.2.patch, HIVE-20523.3.patch, 
> HIVE-20523.4.patch, HIVE-20523.5.patch, HIVE-20523.6.patch, 
> HIVE-20523.7.patch, HIVE-20523.8.patch, HIVE-20523.9.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format

2019-01-04 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Attachment: HIVE-20523.11.patch

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, 
> HIVE-20523.2.patch, HIVE-20523.3.patch, HIVE-20523.4.patch, 
> HIVE-20523.5.patch, HIVE-20523.6.patch, HIVE-20523.7.patch, 
> HIVE-20523.8.patch, HIVE-20523.9.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format

2019-01-04 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Attachment: (was: HIVE-20523.11.patch)

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, 
> HIVE-20523.2.patch, HIVE-20523.3.patch, HIVE-20523.4.patch, 
> HIVE-20523.5.patch, HIVE-20523.6.patch, HIVE-20523.7.patch, 
> HIVE-20523.8.patch, HIVE-20523.9.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format

2019-01-03 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Status: Open  (was: Patch Available)

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, 
> HIVE-20523.9.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format

2019-01-03 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Attachment: HIVE-20523.10.patch
Status: Patch Available  (was: Open)

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, 
> HIVE-20523.2.patch, HIVE-20523.3.patch, HIVE-20523.4.patch, 
> HIVE-20523.5.patch, HIVE-20523.6.patch, HIVE-20523.7.patch, 
> HIVE-20523.8.patch, HIVE-20523.9.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format

2019-01-03 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Attachment: HIVE-20523.9.patch
Status: Patch Available  (was: Open)

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, 
> HIVE-20523.9.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2019-01-03 Thread George Pachitariu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732749#comment-16732749
 ] 

George Pachitariu commented on HIVE-20523:
--

I submitted patch ...8.patch ( .9.patch ) again, to reactivate Jenkins 
PreCommit-Hive-Build.

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, 
> HIVE-20523.9.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format

2019-01-03 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Status: Open  (was: Patch Available)

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format

2019-01-02 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Status: Open  (was: Patch Available)

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format

2019-01-02 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Attachment: HIVE-20523.8.patch
Status: Patch Available  (was: Open)

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.8.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format

2018-12-20 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Status: Open  (was: Patch Available)

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format

2018-12-20 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Attachment: HIVE-20523.7.patch
Status: Patch Available  (was: Open)

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.7.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2018-10-13 Thread George Pachitariu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16648897#comment-16648897
 ] 

George Pachitariu commented on HIVE-20523:
--

Hi, can anyone please have a look at this patch?

[~asherman] [~janulatha]

I'm not sure how I should proceed.

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-20523) Improve table statistics for Parquet format

2018-09-27 Thread George Pachitariu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630911#comment-16630911
 ] 

George Pachitariu edited comment on HIVE-20523 at 9/27/18 7:04 PM:
---

Hi [~kgyrtkirk] , can you please have a look at this patch?

I have looked at the failed tests. They all failed because the expected raw 
data size has changed, which is the expected behaviour of this patch.


was (Author: george.pachitariu):
Hi Zoltan Haindrich, can you please have a look at this patch?

I have looked at the failed tests. They all failed because the expected raw 
data size has changed, which is the expected behaviour of this patch.

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format

2018-09-27 Thread George Pachitariu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630911#comment-16630911
 ] 

George Pachitariu commented on HIVE-20523:
--

Hi Zoltan Haindrich, can you please have a look at this patch?

I have looked at the failed tests. They all failed because the expected raw 
data size has changed, which is the expected behaviour of this patch.

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format

2018-09-26 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Attachment: HIVE-20523.6.patch
Status: Patch Available  (was: Open)

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, 
> HIVE-20523.6.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format

2018-09-26 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Status: Open  (was: Patch Available)

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format

2018-09-26 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Attachment: HIVE-20523.5.patch
Status: Patch Available  (was: Open)

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format

2018-09-26 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Status: Open  (was: Patch Available)

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format

2018-09-25 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Attachment: HIVE-20523.4.patch
Status: Patch Available  (was: Open)

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format

2018-09-25 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Status: Open  (was: Patch Available)

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format

2018-09-23 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Attachment: HIVE-20523.3.patch
Status: Patch Available  (was: Open)

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, 
> HIVE-20523.3.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format

2018-09-23 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Status: Open  (was: Patch Available)

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format

2018-09-23 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Attachment: HIVE-20523.2.patch
Status: Patch Available  (was: Open)

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.2.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format

2018-09-23 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Status: Open  (was: Patch Available)

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format

2018-09-22 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Attachment: HIVE-20523.1.patch
Status: Patch Available  (was: Open)

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.1.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format

2018-09-22 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Status: Open  (was: Patch Available)

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format

2018-09-22 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Description: 
Right now, in the table basic statistics, the *raw data size* for a row with 
any data type in the Parquet format is 1. This is an underestimated value when 
columns are complex data structures, like arrays.

Having tables with underestimated raw data size makes Hive assign less 
containers (mappers/reducers) to it, making the overall query slower. 
Heavy underestimation also makes Hive choose MapJoin instead of the ShuffleJoin 
that can fail with OOM errors.

In this patch, I compute the columns data size better, taking into account 
complex structures. I followed the Writer implementation for the ORC format.

  was:
By default, when the table contains table-stats, the value of *rawDataSize* is 
taken to estimate the table data size in the execution plan.

The problem is that rawDataSize does not contain the data size of arrays. This 
makes the table size be underestimated when arrays make most of the table size.

In those specific cases, the value of the *totalSize* is much closer to the 
truth.

In this task I propose to take the *max* value between *rawDataSize* and 
*totalSize*deserializationFactor*.

I don't know if this proposal will backfire in any specific cases 
(overestimating the size of tables).


> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with 
> any data type in the Parquet format is 1. This is an underestimated value 
> when columns are complex data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less 
> containers (mappers/reducers) to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the 
> ShuffleJoin that can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account 
> complex structures. I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics for Parquet format

2018-09-22 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Summary: Improve table statistics for Parquet format  (was: Improve table 
statistics when the table contains arrays)

> Improve table statistics for Parquet format
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.patch
>
>
> By default, when the table contains table-stats, the value of *rawDataSize* 
> is taken to estimate the table data size in the execution plan.
> The problem is that rawDataSize does not contain the data size of arrays. 
> This makes the table size be underestimated when arrays make most of the 
> table size.
> In those specific cases, the value of the *totalSize* is much closer to the 
> truth.
> In this task I propose to take the *max* value between *rawDataSize* and 
> *totalSize*deserializationFactor*.
> I don't know if this proposal will backfire in any specific cases 
> (overestimating the size of tables).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-20480) Implement column stats annotation rules for the UDTFOperator: Follow up for HIVE-20262

2018-09-10 Thread George Pachitariu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16609305#comment-16609305
 ] 

George Pachitariu edited comment on HIVE-20480 at 9/10/18 2:44 PM:
---

Hi [~kgyrtkirk], thanks. I did reupload the patch with the nullcheck. The 
hiveqa comments from above are from the second upload.

After the first version of the patch, there were no comments from hiveqa. I 
asked about the ptest2 to know in the future.


was (Author: george.pachitariu):
Hi [~kgyrtkirk], thanks. I did reupload the patch with the nullcheck. The 
hiveqa comments from above are from the second upload.

After the first version of the patch, there were no comments from hiveqa.

> Implement column stats annotation rules for the UDTFOperator: Follow up for 
> HIVE-20262
> --
>
> Key: HIVE-20480
> URL: https://issues.apache.org/jira/browse/HIVE-20480
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Fix For: 4.0.0
>
> Attachments: HIVE-20480.1.patch, HIVE-20480.patch
>
>
>  Implementing the rule for column stats: Follow-up task for 
> [HIVE-20262|https://issues.apache.org/jira/browse/HIVE-20262]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20480) Implement column stats annotation rules for the UDTFOperator: Follow up for HIVE-20262

2018-09-10 Thread George Pachitariu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16609305#comment-16609305
 ] 

George Pachitariu commented on HIVE-20480:
--

Hi [~kgyrtkirk], thanks. I did reupload the patch with the nullcheck. The 
hiveqa comments from above are from the second upload.

After the first version of the patch, there were no comments from hiveqa.

> Implement column stats annotation rules for the UDTFOperator: Follow up for 
> HIVE-20262
> --
>
> Key: HIVE-20480
> URL: https://issues.apache.org/jira/browse/HIVE-20480
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Fix For: 4.0.0
>
> Attachments: HIVE-20480.1.patch, HIVE-20480.patch
>
>
>  Implementing the rule for column stats: Follow-up task for 
> [HIVE-20262|https://issues.apache.org/jira/browse/HIVE-20262]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20480) Implement column stats annotation rules for the UDTFOperator: Follow up for HIVE-20262

2018-09-09 Thread George Pachitariu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608424#comment-16608424
 ] 

George Pachitariu commented on HIVE-20480:
--

Hello [~maheshk114] and [~ashutoshc], I am sorry for breaking the ptest build 
earlier. 
I will follow the jenkíns PreCommit-HIVE-Build from now on.

related question: Should I invest the time to setup the "Hive PTest2" 
infrastructure locally?

Thank you for your time.

> Implement column stats annotation rules for the UDTFOperator: Follow up for 
> HIVE-20262
> --
>
> Key: HIVE-20480
> URL: https://issues.apache.org/jira/browse/HIVE-20480
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Fix For: 4.0.0
>
> Attachments: HIVE-20480.1.patch, HIVE-20480.patch
>
>
>  Implementing the rule for column stats: Follow-up task for 
> [HIVE-20262|https://issues.apache.org/jira/browse/HIVE-20262]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (HIVE-20523) Improve table statistics when the table contains arrays

2018-09-08 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Comment: was deleted

(was: This is my understanding of why the original behaviour happens (and 
please correct me if I'm wrong):

rawDataSize is computed from the schema in the objectInspector (one example is 
[here|https://github.com/apache/hive/blob/rel/release-3.1.0/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java#L166]).
 And the inspector for an array is UnionStructObjectInspector which will return 
size = 1.)

> Improve table statistics when the table contains arrays
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.patch
>
>
> By default, when the table contains table-stats, the value of *rawDataSize* 
> is taken to estimate the table data size in the execution plan.
> The problem is that rawDataSize does not contain the data size of arrays. 
> This makes the table size be underestimated when arrays make most of the 
> table size.
> In those specific cases, the value of the *totalSize* is much closer to the 
> truth.
> In this task I propose to take the *max* value between *rawDataSize* and 
> *totalSize*deserializationFactor*.
> I don't know if this proposal will backfire in any specific cases 
> (overestimating the size of tables).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20523) Improve table statistics when the table contains arrays

2018-09-08 Thread George Pachitariu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608183#comment-16608183
 ] 

George Pachitariu commented on HIVE-20523:
--

This is my understanding of why the original behaviour happens (and please 
correct me if I'm wrong):

rawDataSize is computed from the schema in the objectInspector (one example is 
[here|https://github.com/apache/hive/blob/rel/release-3.1.0/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java#L166]).
 And the inspector for an array is UnionStructObjectInspector which will return 
size = 1.

> Improve table statistics when the table contains arrays
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.patch
>
>
> By default, when the table contains table-stats, the value of *rawDataSize* 
> is taken to estimate the table data size in the execution plan.
> The problem is that rawDataSize does not contain the data size of arrays. 
> This makes the table size be underestimated when arrays make most of the 
> table size.
> In those specific cases, the value of the *totalSize* is much closer to the 
> truth.
> In this task I propose to take the *max* value between *rawDataSize* and 
> *totalSize*deserializationFactor*.
> I don't know if this proposal will backfire in any specific cases 
> (overestimating the size of tables).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics when the table contains arrays

2018-09-08 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Attachment: HIVE-20523.patch
Status: Patch Available  (was: Open)

> Improve table statistics when the table contains arrays
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20523.patch
>
>
> By default, when the table contains table-stats, the value of *rawDataSize* 
> is taken to estimate the table data size in the execution plan.
> The problem is that rawDataSize does not contain the data size of arrays. 
> This makes the table size be underestimated when arrays make most of the 
> table size.
> In those specific cases, the value of the *totalSize* is much closer to the 
> truth.
> In this task I propose to take the *max* value between *rawDataSize* and 
> *totalSize*deserializationFactor*.
> I don't know if this proposal will backfire in any specific cases 
> (overestimating the size of tables).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20523) Improve table statistics when the table contains arrays

2018-09-08 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20523:
-
Description: 
By default, when the table contains table-stats, the value of *rawDataSize* is 
taken to estimate the table data size in the execution plan.

The problem is that rawDataSize does not contain the data size of arrays. This 
makes the table size be underestimated when arrays make most of the table size.

In those specific cases, the value of the *totalSize* is much closer to the 
truth.

In this task I propose to take the *max* value between *rawDataSize* and 
*totalSize*deserializationFactor*.

I don't know if this proposal will backfire in any specific cases 
(overestimating the size of tables).

  was:
By default, when the table contains table-stats, the value of *rawDataSize* is 
taken to estimate the table data size in the execution plan.

The problem is that rawDataSize does not contain the data size of arrays. This 
makes the table size be underestimated when arrays make most of the table size.

In those specific cases, the value of the *totalSize* is much closer to the 
truth.

In this task I propose to take the max value between *rawDataSize* and 
*totalSize*deserializationFactor*.

I don't know if this proposal will backfire in any specific cases 
(overestimating the size of tables).


> Improve table statistics when the table contains arrays
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
>
> By default, when the table contains table-stats, the value of *rawDataSize* 
> is taken to estimate the table data size in the execution plan.
> The problem is that rawDataSize does not contain the data size of arrays. 
> This makes the table size be underestimated when arrays make most of the 
> table size.
> In those specific cases, the value of the *totalSize* is much closer to the 
> truth.
> In this task I propose to take the *max* value between *rawDataSize* and 
> *totalSize*deserializationFactor*.
> I don't know if this proposal will backfire in any specific cases 
> (overestimating the size of tables).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-20523) Improve table statistics when the table contains arrays

2018-09-08 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu reassigned HIVE-20523:



> Improve table statistics when the table contains arrays
> ---
>
> Key: HIVE-20523
> URL: https://issues.apache.org/jira/browse/HIVE-20523
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
>
> By default, when the table contains table-stats, the value of *rawDataSize* 
> is taken to estimate the table data size in the execution plan.
> The problem is that rawDataSize does not contain the data size of arrays. 
> This makes the table size be underestimated when arrays make most of the 
> table size.
> In those specific cases, the value of the *totalSize* is much closer to the 
> truth.
> In this task I propose to take the max value between *rawDataSize* and 
> *totalSize*deserializationFactor*.
> I don't know if this proposal will backfire in any specific cases 
> (overestimating the size of tables).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20480) Implement column stats annotation rules for the UDTFOperator: Follow up for HIVE-20262

2018-09-08 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20480:
-
Attachment: HIVE-20480.1.patch
Status: Patch Available  (was: Reopened)

> Implement column stats annotation rules for the UDTFOperator: Follow up for 
> HIVE-20262
> --
>
> Key: HIVE-20480
> URL: https://issues.apache.org/jira/browse/HIVE-20480
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Fix For: 4.0.0
>
> Attachments: HIVE-20480.1.patch, HIVE-20480.patch
>
>
>  Implementing the rule for column stats: Follow-up task for 
> [HIVE-20262|https://issues.apache.org/jira/browse/HIVE-20262]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20480) Implement column stats annotation rules for the UDTFOperator: Follow up for HIVE-20262

2018-08-29 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20480:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Implement column stats annotation rules for the UDTFOperator: Follow up for 
> HIVE-20262
> --
>
> Key: HIVE-20480
> URL: https://issues.apache.org/jira/browse/HIVE-20480
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20480.patch
>
>
>  Implementing the rule for column stats: Follow-up task for 
> [HIVE-20262|https://issues.apache.org/jira/browse/HIVE-20262]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20480) Implement column stats annotation rules for the UDTFOperator: Follow up for HIVE-20262

2018-08-29 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20480:
-
Description:  Implementing the rule for column stats: Follow-up task for 
[HIVE-20262|https://issues.apache.org/jira/browse/HIVE-20262]  (was:  
Implementing the rule for column stats: Follow-up task for 
[HIVE-20262|http://example.com/])

> Implement column stats annotation rules for the UDTFOperator: Follow up for 
> HIVE-20262
> --
>
> Key: HIVE-20480
> URL: https://issues.apache.org/jira/browse/HIVE-20480
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20480.patch
>
>
>  Implementing the rule for column stats: Follow-up task for 
> [HIVE-20262|https://issues.apache.org/jira/browse/HIVE-20262]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20480) Implement column stats annotation rules for the UDTFOperator: Follow up for HIVE-20262

2018-08-28 Thread George Pachitariu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595588#comment-16595588
 ] 

George Pachitariu commented on HIVE-20480:
--

Hi [~ashutoshc],

This is the follow-up task with a patch for your comment here: HIVE-20262

Is this what you meant?

Can you also please give me an idea on how I could test this?

 

Sorry for taking this long to come back to you.

George :)

> Implement column stats annotation rules for the UDTFOperator: Follow up for 
> HIVE-20262
> --
>
> Key: HIVE-20480
> URL: https://issues.apache.org/jira/browse/HIVE-20480
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20480.patch
>
>
>  Implementing the rule for column stats: Follow-up task for 
> [HIVE-20262|http://example.com/]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20480) Implement column stats annotation rules for the UDTFOperator: Follow up for HIVE-20262

2018-08-28 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20480:
-
Description:  Implementing the rule for column stats: Follow-up task for 
[HIVE-20262|http://example.com/]  (was:  

Implementing the rule for column stats: Follow up task for 
[HIVE-20262|http://example.com])

> Implement column stats annotation rules for the UDTFOperator: Follow up for 
> HIVE-20262
> --
>
> Key: HIVE-20480
> URL: https://issues.apache.org/jira/browse/HIVE-20480
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20480.patch
>
>
>  Implementing the rule for column stats: Follow-up task for 
> [HIVE-20262|http://example.com/]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20480) Implement column stats annotation rules for the UDTFOperator: Follow up for HIVE-20262

2018-08-28 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20480:
-
Attachment: HIVE-20480.patch
Status: Patch Available  (was: Open)

> Implement column stats annotation rules for the UDTFOperator: Follow up for 
> HIVE-20262
> --
>
> Key: HIVE-20480
> URL: https://issues.apache.org/jira/browse/HIVE-20480
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20480.patch
>
>
>  
> Implementing the rule for column stats: Follow up task for 
> [HIVE-20262|http://example.com]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-20480) Implement column stats annotation rules for the UDTFOperator: Follow up for HIVE-20262

2018-08-28 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu reassigned HIVE-20480:



> Implement column stats annotation rules for the UDTFOperator: Follow up for 
> HIVE-20262
> --
>
> Key: HIVE-20480
> URL: https://issues.apache.org/jira/browse/HIVE-20480
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
>
>  
> Implementing the rule for column stats: Follow up task for 
> [HIVE-20262|http://example.com]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20262) Implement stats annotation rule for the UDTFOperator

2018-07-31 Thread George Pachitariu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563481#comment-16563481
 ] 

George Pachitariu commented on HIVE-20262:
--

Yes, I will. :)

> Implement stats annotation rule for the UDTFOperator
> 
>
> Key: HIVE-20262
> URL: https://issues.apache.org/jira/browse/HIVE-20262
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Fix For: 4.0.0
>
> Attachments: HIVE-20262.1.patch, HIVE-20262.2.patch, HIVE-20262.patch
>
>
> User Defined Table Functions (UDTFs) change the number of rows of the output. 
> A common UDTF is the explode() method that creates a row for each element for 
> each array in the input column.
>  
> Right now, the number of output rows is equal to the number of input rows. 
> But if the average number of output rows is bigger than 1, the resulting 
> number of rows is underestimated in the execution plan.
>  
> Implement a rule that can have a factor X as a parameter and for each UDTF 
> function predict that:
>  
> {code:java}
> number of output rows = X * number of input rows{code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20262) Implement stats annotation rule for the UDTFOperator

2018-07-29 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20262:
-
Attachment: HIVE-20262.2.patch
Status: Patch Available  (was: Open)

> Implement stats annotation rule for the UDTFOperator
> 
>
> Key: HIVE-20262
> URL: https://issues.apache.org/jira/browse/HIVE-20262
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20262.1.patch, HIVE-20262.2.patch, HIVE-20262.patch
>
>
> User Defined Table Functions (UDTFs) change the number of rows of the output. 
> A common UDTF is the explode() method that creates a row for each element for 
> each array in the input column.
>  
> Right now, the number of output rows is equal to the number of input rows. 
> But if the average number of output rows is bigger than 1, the resulting 
> number of rows is underestimated in the execution plan.
>  
> Implement a rule that can have a factor X as a parameter and for each UDTF 
> function predict that:
>  
> {code:java}
> number of output rows = X * number of input rows{code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20262) Implement stats annotation rule for the UDTFOperator

2018-07-29 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20262:
-
Status: Open  (was: Patch Available)

> Implement stats annotation rule for the UDTFOperator
> 
>
> Key: HIVE-20262
> URL: https://issues.apache.org/jira/browse/HIVE-20262
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20262.1.patch, HIVE-20262.2.patch, HIVE-20262.patch
>
>
> User Defined Table Functions (UDTFs) change the number of rows of the output. 
> A common UDTF is the explode() method that creates a row for each element for 
> each array in the input column.
>  
> Right now, the number of output rows is equal to the number of input rows. 
> But if the average number of output rows is bigger than 1, the resulting 
> number of rows is underestimated in the execution plan.
>  
> Implement a rule that can have a factor X as a parameter and for each UDTF 
> function predict that:
>  
> {code:java}
> number of output rows = X * number of input rows{code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20262) Implement stats annotation rule for the UDTFOperator

2018-07-29 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20262:
-
Attachment: HIVE-20262.1.patch
Status: Patch Available  (was: Open)

> Implement stats annotation rule for the UDTFOperator
> 
>
> Key: HIVE-20262
> URL: https://issues.apache.org/jira/browse/HIVE-20262
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20262.1.patch, HIVE-20262.patch
>
>
> User Defined Table Functions (UDTFs) change the number of rows of the output. 
> A common UDTF is the explode() method that creates a row for each element for 
> each array in the input column.
>  
> Right now, the number of output rows is equal to the number of input rows. 
> But if the average number of output rows is bigger than 1, the resulting 
> number of rows is underestimated in the execution plan.
>  
> Implement a rule that can have a factor X as a parameter and for each UDTF 
> function predict that:
>  
> {code:java}
> number of output rows = X * number of input rows{code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20262) Implement stats annotation rule for the UDTFOperator

2018-07-29 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20262:
-
Status: Open  (was: Patch Available)

> Implement stats annotation rule for the UDTFOperator
> 
>
> Key: HIVE-20262
> URL: https://issues.apache.org/jira/browse/HIVE-20262
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20262.patch
>
>
> User Defined Table Functions (UDTFs) change the number of rows of the output. 
> A common UDTF is the explode() method that creates a row for each element for 
> each array in the input column.
>  
> Right now, the number of output rows is equal to the number of input rows. 
> But if the average number of output rows is bigger than 1, the resulting 
> number of rows is underestimated in the execution plan.
>  
> Implement a rule that can have a factor X as a parameter and for each UDTF 
> function predict that:
>  
> {code:java}
> number of output rows = X * number of input rows{code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20262) Implement stats annotation rule for the UDTFOperator

2018-07-28 Thread George Pachitariu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560881#comment-16560881
 ] 

George Pachitariu commented on HIVE-20262:
--

Can someone please review this code?

Thank you.

> Implement stats annotation rule for the UDTFOperator
> 
>
> Key: HIVE-20262
> URL: https://issues.apache.org/jira/browse/HIVE-20262
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20262.patch
>
>
> User Defined Table Functions (UDTFs) change the number of rows of the output. 
> A common UDTF is the explode() method that creates a row for each element for 
> each array in the input column.
>  
> Right now, the number of output rows is equal to the number of input rows. 
> But if the average number of output rows is bigger than 1, the resulting 
> number of rows is underestimated in the execution plan.
>  
> Implement a rule that can have a factor X as a parameter and for each UDTF 
> function predict that:
>  
> {code:java}
> number of output rows = X * number of input rows{code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20262) Implement stats annotation rule for the UDTFOperator

2018-07-28 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20262:
-
Attachment: HIVE-20262.patch
Status: Patch Available  (was: Open)

> Implement stats annotation rule for the UDTFOperator
> 
>
> Key: HIVE-20262
> URL: https://issues.apache.org/jira/browse/HIVE-20262
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20262.patch
>
>
> User Defined Table Functions (UDTFs) change the number of rows of the output. 
> A common UDTF is the explode() method that creates a row for each element for 
> each array in the input column.
>  
> Right now, the number of output rows is equal to the number of input rows. 
> But if the average number of output rows is bigger than 1, the resulting 
> number of rows is underestimated in the execution plan.
>  
> Implement a rule that can have a factor X as a parameter and for each UDTF 
> function predict that:
>  
> {code:java}
> number of output rows = X * number of input rows{code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20262) Implement stats annotation rule for the UDTFOperator

2018-07-28 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20262:
-
Attachment: (was: HIVE-20262.patch)

> Implement stats annotation rule for the UDTFOperator
> 
>
> Key: HIVE-20262
> URL: https://issues.apache.org/jira/browse/HIVE-20262
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20262.patch
>
>
> User Defined Table Functions (UDTFs) change the number of rows of the output. 
> A common UDTF is the explode() method that creates a row for each element for 
> each array in the input column.
>  
> Right now, the number of output rows is equal to the number of input rows. 
> But if the average number of output rows is bigger than 1, the resulting 
> number of rows is underestimated in the execution plan.
>  
> Implement a rule that can have a factor X as a parameter and for each UDTF 
> function predict that:
>  
> {code:java}
> number of output rows = X * number of input rows{code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20262) Implement stats annotation rule for the UDTFOperator

2018-07-28 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu updated HIVE-20262:
-
Attachment: HIVE-20262.patch

> Implement stats annotation rule for the UDTFOperator
> 
>
> Key: HIVE-20262
> URL: https://issues.apache.org/jira/browse/HIVE-20262
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-20262.patch
>
>
> User Defined Table Functions (UDTFs) change the number of rows of the output. 
> A common UDTF is the explode() method that creates a row for each element for 
> each array in the input column.
>  
> Right now, the number of output rows is equal to the number of input rows. 
> But if the average number of output rows is bigger than 1, the resulting 
> number of rows is underestimated in the execution plan.
>  
> Implement a rule that can have a factor X as a parameter and for each UDTF 
> function predict that:
>  
> {code:java}
> number of output rows = X * number of input rows{code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-20262) Implement stats annotation rule for the UDTFOperator

2018-07-28 Thread George Pachitariu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pachitariu reassigned HIVE-20262:



> Implement stats annotation rule for the UDTFOperator
> 
>
> Key: HIVE-20262
> URL: https://issues.apache.org/jira/browse/HIVE-20262
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
>
> User Defined Table Functions (UDTFs) change the number of rows of the output. 
> A common UDTF is the explode() method that creates a row for each element for 
> each array in the input column.
>  
> Right now, the number of output rows is equal to the number of input rows. 
> But if the average number of output rows is bigger than 1, the resulting 
> number of rows is underestimated in the execution plan.
>  
> Implement a rule that can have a factor X as a parameter and for each UDTF 
> function predict that:
>  
> {code:java}
> number of output rows = X * number of input rows{code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-17379) Null Pointer Exception in WHERE clause when using aggregate function as a filter

2018-07-16 Thread George Pachitariu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544972#comment-16544972
 ] 

George Pachitariu commented on HIVE-17379:
--

No [~kgyrtkirk]. Same scenario and also in Hive2.1.1

> Null Pointer Exception in WHERE clause when using aggregate function as a 
> filter  
> --
>
> Key: HIVE-17379
> URL: https://issues.apache.org/jira/browse/HIVE-17379
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 2.1.1
>Reporter: Sharanya Santhanam
>Priority: Major
>
> Sample Query : 
> with tableAAlias as (
>select a, count(z)  as acount
>from tableA
>groupBy a 
> )
> select a.a, b.b 
> from tableB as b JOIN 
> tableAAlias a
> on a.a=b.a
> where a.acount > 10 
> FAILED: NullPointerException null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.optimizer.ColumnPrunerProcFactory$ColumnPrunerFilterProc.process(ColumnPrunerProcFactory.java:103)
> at 
> org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
> at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105)
> at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89)
> at 
> org.apache.hadoop.hive.ql.optimizer.ColumnPruner$ColumnPrunerWalker.walk(ColumnPruner.java:176)
> at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120)
> at 
> org.apache.hadoop.hive.ql.optimizer.ColumnPruner.transform(ColumnPruner.java:136)
> at org.apache.hadoop.hive.ql.optimizer.Optimizer.optimize(Optimizer.java:246)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:11149)
> at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:246)
> at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:264)
> at 
> org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:80)
> at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:264)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:490)
> at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1270)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1412)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1199)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1189)
> at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:265)
> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:210)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:444)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
> at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474)
> at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:514)
> at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:882)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:836)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:732)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:223)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> The above Query Succeeds if it is modified as : 
> select a.a, b.b , *a.acount*
> from tableB as b JOIN 
> tableAAlias a
> on a.a=b.a
> where a.acount > 10 
> Please Note the original query worked on hive1.2 & breaks on Hive2.1.1 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-17379) Null Pointer Exception in WHERE clause when using aggregate function as a filter

2018-06-25 Thread George Pachitariu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16522603#comment-16522603
 ] 

George Pachitariu edited comment on HIVE-17379 at 6/25/18 5:51 PM:
---

Hi [~sharanya] , I had the same error and I solved it with:
{code:java}
set hive.ppd.remove.duplicatefilters=false;
{code}


was (Author: george.pachitariu):
Hi [~sharanya] , I had the same error and I solved it with:
{code:java}
set hive.optimize.ppd=true;
set hive.ppd.remove.duplicatefilters=false;
{code}

> Null Pointer Exception in WHERE clause when using aggregate function as a 
> filter  
> --
>
> Key: HIVE-17379
> URL: https://issues.apache.org/jira/browse/HIVE-17379
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 2.1.1
>Reporter: Sharanya Santhanam
>Priority: Major
>
> Sample Query : 
> with tableAAlias as (
>select a, count(z)  as acount
>from tableA
>groupBy a 
> )
> select a.a, b.b 
> from tableB as b JOIN 
> tableAAlias a
> on a.a=b.a
> where a.acount > 10 
> FAILED: NullPointerException null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.optimizer.ColumnPrunerProcFactory$ColumnPrunerFilterProc.process(ColumnPrunerProcFactory.java:103)
> at 
> org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
> at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105)
> at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89)
> at 
> org.apache.hadoop.hive.ql.optimizer.ColumnPruner$ColumnPrunerWalker.walk(ColumnPruner.java:176)
> at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120)
> at 
> org.apache.hadoop.hive.ql.optimizer.ColumnPruner.transform(ColumnPruner.java:136)
> at org.apache.hadoop.hive.ql.optimizer.Optimizer.optimize(Optimizer.java:246)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:11149)
> at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:246)
> at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:264)
> at 
> org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:80)
> at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:264)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:490)
> at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1270)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1412)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1199)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1189)
> at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:265)
> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:210)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:444)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
> at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474)
> at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:514)
> at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:882)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:836)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:732)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:223)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> The above Query Succeeds if it is modified as : 
> select a.a, b.b , *a.acount*
> from tableB as b JOIN 
> tableAAlias a
> on a.a=b.a
> where a.acount > 10 
> Please Note the original query worked on hive1.2 & breaks on Hive2.1.1 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-17379) Null Pointer Exception in WHERE clause when using aggregate function as a filter

2018-06-25 Thread George Pachitariu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16522603#comment-16522603
 ] 

George Pachitariu edited comment on HIVE-17379 at 6/25/18 5:50 PM:
---

Hi [~sharanya] , I had the same error and I solved it with:
{code:java}
set hive.optimize.ppd=true;
set hive.ppd.remove.duplicatefilters=false;
{code}


was (Author: george.pachitariu):
Hi [~sharanya] , I had the same error and I solved it with:

```

set hive.optimize.ppd=true;
set hive.ppd.remove.duplicatefilters=false;

```

> Null Pointer Exception in WHERE clause when using aggregate function as a 
> filter  
> --
>
> Key: HIVE-17379
> URL: https://issues.apache.org/jira/browse/HIVE-17379
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 2.1.1
>Reporter: Sharanya Santhanam
>Priority: Major
>
> Sample Query : 
> with tableAAlias as (
>select a, count(z)  as acount
>from tableA
>groupBy a 
> )
> select a.a, b.b 
> from tableB as b JOIN 
> tableAAlias a
> on a.a=b.a
> where a.acount > 10 
> FAILED: NullPointerException null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.optimizer.ColumnPrunerProcFactory$ColumnPrunerFilterProc.process(ColumnPrunerProcFactory.java:103)
> at 
> org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
> at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105)
> at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89)
> at 
> org.apache.hadoop.hive.ql.optimizer.ColumnPruner$ColumnPrunerWalker.walk(ColumnPruner.java:176)
> at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120)
> at 
> org.apache.hadoop.hive.ql.optimizer.ColumnPruner.transform(ColumnPruner.java:136)
> at org.apache.hadoop.hive.ql.optimizer.Optimizer.optimize(Optimizer.java:246)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:11149)
> at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:246)
> at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:264)
> at 
> org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:80)
> at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:264)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:490)
> at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1270)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1412)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1199)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1189)
> at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:265)
> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:210)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:444)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
> at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474)
> at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:514)
> at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:882)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:836)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:732)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:223)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> The above Query Succeeds if it is modified as : 
> select a.a, b.b , *a.acount*
> from tableB as b JOIN 
> tableAAlias a
> on a.a=b.a
> where a.acount > 10 
> Please Note the original query worked on hive1.2 & breaks on Hive2.1.1 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-17379) Null Pointer Exception in WHERE clause when using aggregate function as a filter

2018-06-25 Thread George Pachitariu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16522603#comment-16522603
 ] 

George Pachitariu commented on HIVE-17379:
--

Hi [~sharanya] , I had the same error and I solved it with:

```

set hive.optimize.ppd=true;
set hive.ppd.remove.duplicatefilters=false;

```

> Null Pointer Exception in WHERE clause when using aggregate function as a 
> filter  
> --
>
> Key: HIVE-17379
> URL: https://issues.apache.org/jira/browse/HIVE-17379
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 2.1.1
>Reporter: Sharanya Santhanam
>Priority: Major
>
> Sample Query : 
> with tableAAlias as (
>select a, count(z)  as acount
>from tableA
>groupBy a 
> )
> select a.a, b.b 
> from tableB as b JOIN 
> tableAAlias a
> on a.a=b.a
> where a.acount > 10 
> FAILED: NullPointerException null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.optimizer.ColumnPrunerProcFactory$ColumnPrunerFilterProc.process(ColumnPrunerProcFactory.java:103)
> at 
> org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
> at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105)
> at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89)
> at 
> org.apache.hadoop.hive.ql.optimizer.ColumnPruner$ColumnPrunerWalker.walk(ColumnPruner.java:176)
> at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120)
> at 
> org.apache.hadoop.hive.ql.optimizer.ColumnPruner.transform(ColumnPruner.java:136)
> at org.apache.hadoop.hive.ql.optimizer.Optimizer.optimize(Optimizer.java:246)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:11149)
> at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:246)
> at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:264)
> at 
> org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:80)
> at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:264)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:490)
> at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1270)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1412)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1199)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1189)
> at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:265)
> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:210)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:444)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
> at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474)
> at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:514)
> at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:882)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:836)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:732)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:223)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> The above Query Succeeds if it is modified as : 
> select a.a, b.b , *a.acount*
> from tableB as b JOIN 
> tableAAlias a
> on a.a=b.a
> where a.acount > 10 
> Please Note the original query worked on hive1.2 & breaks on Hive2.1.1 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)