[jira] [Updated] (HIVE-12491) Column Statistics: 3 attribute join on a 2-source table is off

2015-12-01 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-12491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated HIVE-12491:

Attachment: HIVE-12491.5.patch

> Column Statistics: 3 attribute join on a 2-source table is off
> --
>
> Key: HIVE-12491
> URL: https://issues.apache.org/jira/browse/HIVE-12491
> Project: Hive
>  Issue Type: Bug
>  Components: Statistics
>Affects Versions: 1.3.0, 2.0.0
>Reporter: Gopal V
>Assignee: Ashutosh Chauhan
> Attachments: HIVE-12491.2.patch, HIVE-12491.3.patch, 
> HIVE-12491.4.patch, HIVE-12491.5.patch, HIVE-12491.WIP.patch, HIVE-12491.patch
>
>
> The eased out denominator has to detect duplicate row-stats from different 
> attributes.
> {code}
> select account_id from customers c,  customer_activation ca
>   where c.customer_id = ca.customer_id
>   and year(ca.dt) = year(c.dt) and month(ca.dt) = month(c.dt)
>   and year(ca.dt) between year('2013-12-26') and year('2013-12-26')
> {code}
> {code}
>   private Long getEasedOutDenominator(List distinctVals) {
>   // Exponential back-off for NDVs.
>   // 1) Descending order sort of NDVs
>   // 2) denominator = NDV1 * (NDV2 ^ (1/2)) * (NDV3 ^ (1/4))) * 
>   Collections.sort(distinctVals, Collections.reverseOrder());
>   long denom = distinctVals.get(0);
>   for (int i = 1; i < distinctVals.size(); i++) {
> denom = (long) (denom * Math.pow(distinctVals.get(i), 1.0 / (1 << 
> i)));
>   }
>   return denom;
> }
> {code}
> This gets {{[8007986, 821974390, 821974390]}}, which is actually 3 columns 2 
> of which are derived from the same column.
> {code}
> Reduce Output Operator (RS_12)
>   key expressions: _col0 (type: bigint), year(_col2) (type: int), 
> month(_col2) (type: int)
>   sort order: +++
>   Map-reduce partition columns: _col0 (type: bigint), year(_col2) 
> (type: int), month(_col2) (type: int)
>   value expressions: _col1 (type: bigint)
>   Join Operator (JOIN_13)
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col0 (type: bigint), year(_col1) (type: int), month(_col1) 
> (type: int)
>   1 _col0 (type: bigint), year(_col2) (type: int), month(_col2) 
> (type: int)
> outputColumnNames: _col3
> {code}
> So the eased out denominator is off by a factor of 30,000 or so, causing OOMs 
> in map-joins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-12491) Column Statistics: 3 attribute join on a 2-source table is off

2015-12-01 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-12491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated HIVE-12491:

Attachment: HIVE-12491.4.patch

> Column Statistics: 3 attribute join on a 2-source table is off
> --
>
> Key: HIVE-12491
> URL: https://issues.apache.org/jira/browse/HIVE-12491
> Project: Hive
>  Issue Type: Bug
>  Components: Statistics
>Affects Versions: 1.3.0, 2.0.0
>Reporter: Gopal V
>Assignee: Ashutosh Chauhan
> Attachments: HIVE-12491.2.patch, HIVE-12491.3.patch, 
> HIVE-12491.4.patch, HIVE-12491.WIP.patch, HIVE-12491.patch
>
>
> The eased out denominator has to detect duplicate row-stats from different 
> attributes.
> {code}
> select account_id from customers c,  customer_activation ca
>   where c.customer_id = ca.customer_id
>   and year(ca.dt) = year(c.dt) and month(ca.dt) = month(c.dt)
>   and year(ca.dt) between year('2013-12-26') and year('2013-12-26')
> {code}
> {code}
>   private Long getEasedOutDenominator(List distinctVals) {
>   // Exponential back-off for NDVs.
>   // 1) Descending order sort of NDVs
>   // 2) denominator = NDV1 * (NDV2 ^ (1/2)) * (NDV3 ^ (1/4))) * 
>   Collections.sort(distinctVals, Collections.reverseOrder());
>   long denom = distinctVals.get(0);
>   for (int i = 1; i < distinctVals.size(); i++) {
> denom = (long) (denom * Math.pow(distinctVals.get(i), 1.0 / (1 << 
> i)));
>   }
>   return denom;
> }
> {code}
> This gets {{[8007986, 821974390, 821974390]}}, which is actually 3 columns 2 
> of which are derived from the same column.
> {code}
> Reduce Output Operator (RS_12)
>   key expressions: _col0 (type: bigint), year(_col2) (type: int), 
> month(_col2) (type: int)
>   sort order: +++
>   Map-reduce partition columns: _col0 (type: bigint), year(_col2) 
> (type: int), month(_col2) (type: int)
>   value expressions: _col1 (type: bigint)
>   Join Operator (JOIN_13)
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col0 (type: bigint), year(_col1) (type: int), month(_col1) 
> (type: int)
>   1 _col0 (type: bigint), year(_col2) (type: int), month(_col2) 
> (type: int)
> outputColumnNames: _col3
> {code}
> So the eased out denominator is off by a factor of 30,000 or so, causing OOMs 
> in map-joins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-12491) Column Statistics: 3 attribute join on a 2-source table is off

2015-12-01 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-12491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated HIVE-12491:

Attachment: HIVE-12491.3.patch

We can exploit semantic information about known udfs which have bounded NDVs.

> Column Statistics: 3 attribute join on a 2-source table is off
> --
>
> Key: HIVE-12491
> URL: https://issues.apache.org/jira/browse/HIVE-12491
> Project: Hive
>  Issue Type: Bug
>  Components: Statistics
>Affects Versions: 1.3.0, 2.0.0
>Reporter: Gopal V
>Assignee: Ashutosh Chauhan
> Attachments: HIVE-12491.2.patch, HIVE-12491.3.patch, 
> HIVE-12491.WIP.patch, HIVE-12491.patch
>
>
> The eased out denominator has to detect duplicate row-stats from different 
> attributes.
> {code}
> select account_id from customers c,  customer_activation ca
>   where c.customer_id = ca.customer_id
>   and year(ca.dt) = year(c.dt) and month(ca.dt) = month(c.dt)
>   and year(ca.dt) between year('2013-12-26') and year('2013-12-26')
> {code}
> {code}
>   private Long getEasedOutDenominator(List distinctVals) {
>   // Exponential back-off for NDVs.
>   // 1) Descending order sort of NDVs
>   // 2) denominator = NDV1 * (NDV2 ^ (1/2)) * (NDV3 ^ (1/4))) * 
>   Collections.sort(distinctVals, Collections.reverseOrder());
>   long denom = distinctVals.get(0);
>   for (int i = 1; i < distinctVals.size(); i++) {
> denom = (long) (denom * Math.pow(distinctVals.get(i), 1.0 / (1 << 
> i)));
>   }
>   return denom;
> }
> {code}
> This gets {{[8007986, 821974390, 821974390]}}, which is actually 3 columns 2 
> of which are derived from the same column.
> {code}
> Reduce Output Operator (RS_12)
>   key expressions: _col0 (type: bigint), year(_col2) (type: int), 
> month(_col2) (type: int)
>   sort order: +++
>   Map-reduce partition columns: _col0 (type: bigint), year(_col2) 
> (type: int), month(_col2) (type: int)
>   value expressions: _col1 (type: bigint)
>   Join Operator (JOIN_13)
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col0 (type: bigint), year(_col1) (type: int), month(_col1) 
> (type: int)
>   1 _col0 (type: bigint), year(_col2) (type: int), month(_col2) 
> (type: int)
> outputColumnNames: _col3
> {code}
> So the eased out denominator is off by a factor of 30,000 or so, causing OOMs 
> in map-joins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-12491) Column Statistics: 3 attribute join on a 2-source table is off

2015-12-01 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-12491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated HIVE-12491:

Component/s: Statistics

> Column Statistics: 3 attribute join on a 2-source table is off
> --
>
> Key: HIVE-12491
> URL: https://issues.apache.org/jira/browse/HIVE-12491
> Project: Hive
>  Issue Type: Bug
>  Components: Statistics
>Affects Versions: 1.3.0, 2.0.0
>Reporter: Gopal V
>Assignee: Ashutosh Chauhan
> Attachments: HIVE-12491.2.patch, HIVE-12491.WIP.patch, 
> HIVE-12491.patch
>
>
> The eased out denominator has to detect duplicate row-stats from different 
> attributes.
> {code}
> select account_id from customers c,  customer_activation ca
>   where c.customer_id = ca.customer_id
>   and year(ca.dt) = year(c.dt) and month(ca.dt) = month(c.dt)
>   and year(ca.dt) between year('2013-12-26') and year('2013-12-26')
> {code}
> {code}
>   private Long getEasedOutDenominator(List distinctVals) {
>   // Exponential back-off for NDVs.
>   // 1) Descending order sort of NDVs
>   // 2) denominator = NDV1 * (NDV2 ^ (1/2)) * (NDV3 ^ (1/4))) * 
>   Collections.sort(distinctVals, Collections.reverseOrder());
>   long denom = distinctVals.get(0);
>   for (int i = 1; i < distinctVals.size(); i++) {
> denom = (long) (denom * Math.pow(distinctVals.get(i), 1.0 / (1 << 
> i)));
>   }
>   return denom;
> }
> {code}
> This gets {{[8007986, 821974390, 821974390]}}, which is actually 3 columns 2 
> of which are derived from the same column.
> {code}
> Reduce Output Operator (RS_12)
>   key expressions: _col0 (type: bigint), year(_col2) (type: int), 
> month(_col2) (type: int)
>   sort order: +++
>   Map-reduce partition columns: _col0 (type: bigint), year(_col2) 
> (type: int), month(_col2) (type: int)
>   value expressions: _col1 (type: bigint)
>   Join Operator (JOIN_13)
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col0 (type: bigint), year(_col1) (type: int), month(_col1) 
> (type: int)
>   1 _col0 (type: bigint), year(_col2) (type: int), month(_col2) 
> (type: int)
> outputColumnNames: _col3
> {code}
> So the eased out denominator is off by a factor of 30,000 or so, causing OOMs 
> in map-joins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-12491) Column Statistics: 3 attribute join on a 2-source table is off

2015-12-01 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-12491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated HIVE-12491:

Attachment: HIVE-12491.2.patch

Addressed comments and little bit of refactoring in StatsRuleProcFactory (no 
logic change there) for better readability.

> Column Statistics: 3 attribute join on a 2-source table is off
> --
>
> Key: HIVE-12491
> URL: https://issues.apache.org/jira/browse/HIVE-12491
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.3.0, 2.0.0
>Reporter: Gopal V
>Assignee: Ashutosh Chauhan
> Attachments: HIVE-12491.2.patch, HIVE-12491.WIP.patch, 
> HIVE-12491.patch
>
>
> The eased out denominator has to detect duplicate row-stats from different 
> attributes.
> {code}
> select account_id from customers c,  customer_activation ca
>   where c.customer_id = ca.customer_id
>   and year(ca.dt) = year(c.dt) and month(ca.dt) = month(c.dt)
>   and year(ca.dt) between year('2013-12-26') and year('2013-12-26')
> {code}
> {code}
>   private Long getEasedOutDenominator(List distinctVals) {
>   // Exponential back-off for NDVs.
>   // 1) Descending order sort of NDVs
>   // 2) denominator = NDV1 * (NDV2 ^ (1/2)) * (NDV3 ^ (1/4))) * 
>   Collections.sort(distinctVals, Collections.reverseOrder());
>   long denom = distinctVals.get(0);
>   for (int i = 1; i < distinctVals.size(); i++) {
> denom = (long) (denom * Math.pow(distinctVals.get(i), 1.0 / (1 << 
> i)));
>   }
>   return denom;
> }
> {code}
> This gets {{[8007986, 821974390, 821974390]}}, which is actually 3 columns 2 
> of which are derived from the same column.
> {code}
> Reduce Output Operator (RS_12)
>   key expressions: _col0 (type: bigint), year(_col2) (type: int), 
> month(_col2) (type: int)
>   sort order: +++
>   Map-reduce partition columns: _col0 (type: bigint), year(_col2) 
> (type: int), month(_col2) (type: int)
>   value expressions: _col1 (type: bigint)
>   Join Operator (JOIN_13)
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col0 (type: bigint), year(_col1) (type: int), month(_col1) 
> (type: int)
>   1 _col0 (type: bigint), year(_col2) (type: int), month(_col2) 
> (type: int)
> outputColumnNames: _col3
> {code}
> So the eased out denominator is off by a factor of 30,000 or so, causing OOMs 
> in map-joins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-12491) Column Statistics: 3 attribute join on a 2-source table is off

2015-12-01 Thread Prasanth Jayachandran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-12491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth Jayachandran updated HIVE-12491:
-
Assignee: Ashutosh Chauhan  (was: Prasanth Jayachandran)

> Column Statistics: 3 attribute join on a 2-source table is off
> --
>
> Key: HIVE-12491
> URL: https://issues.apache.org/jira/browse/HIVE-12491
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.3.0, 2.0.0
>Reporter: Gopal V
>Assignee: Ashutosh Chauhan
> Attachments: HIVE-12491.WIP.patch, HIVE-12491.patch
>
>
> The eased out denominator has to detect duplicate row-stats from different 
> attributes.
> {code}
> select account_id from customers c,  customer_activation ca
>   where c.customer_id = ca.customer_id
>   and year(ca.dt) = year(c.dt) and month(ca.dt) = month(c.dt)
>   and year(ca.dt) between year('2013-12-26') and year('2013-12-26')
> {code}
> {code}
>   private Long getEasedOutDenominator(List distinctVals) {
>   // Exponential back-off for NDVs.
>   // 1) Descending order sort of NDVs
>   // 2) denominator = NDV1 * (NDV2 ^ (1/2)) * (NDV3 ^ (1/4))) * 
>   Collections.sort(distinctVals, Collections.reverseOrder());
>   long denom = distinctVals.get(0);
>   for (int i = 1; i < distinctVals.size(); i++) {
> denom = (long) (denom * Math.pow(distinctVals.get(i), 1.0 / (1 << 
> i)));
>   }
>   return denom;
> }
> {code}
> This gets {{[8007986, 821974390, 821974390]}}, which is actually 3 columns 2 
> of which are derived from the same column.
> {code}
> Reduce Output Operator (RS_12)
>   key expressions: _col0 (type: bigint), year(_col2) (type: int), 
> month(_col2) (type: int)
>   sort order: +++
>   Map-reduce partition columns: _col0 (type: bigint), year(_col2) 
> (type: int), month(_col2) (type: int)
>   value expressions: _col1 (type: bigint)
>   Join Operator (JOIN_13)
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col0 (type: bigint), year(_col1) (type: int), month(_col1) 
> (type: int)
>   1 _col0 (type: bigint), year(_col2) (type: int), month(_col2) 
> (type: int)
> outputColumnNames: _col3
> {code}
> So the eased out denominator is off by a factor of 30,000 or so, causing OOMs 
> in map-joins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-12491) Column Statistics: 3 attribute join on a 2-source table is off

2015-11-30 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-12491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated HIVE-12491:

Attachment: HIVE-12491.patch

[~prasanth_j] Could you take a look?

> Column Statistics: 3 attribute join on a 2-source table is off
> --
>
> Key: HIVE-12491
> URL: https://issues.apache.org/jira/browse/HIVE-12491
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.3.0, 2.0.0
>Reporter: Gopal V
>Assignee: Prasanth Jayachandran
> Attachments: HIVE-12491.WIP.patch, HIVE-12491.patch
>
>
> The eased out denominator has to detect duplicate row-stats from different 
> attributes.
> {code}
> select account_id from customers c,  customer_activation ca
>   where c.customer_id = ca.customer_id
>   and year(ca.dt) = year(c.dt) and month(ca.dt) = month(c.dt)
>   and year(ca.dt) between year('2013-12-26') and year('2013-12-26')
> {code}
> {code}
>   private Long getEasedOutDenominator(List distinctVals) {
>   // Exponential back-off for NDVs.
>   // 1) Descending order sort of NDVs
>   // 2) denominator = NDV1 * (NDV2 ^ (1/2)) * (NDV3 ^ (1/4))) * 
>   Collections.sort(distinctVals, Collections.reverseOrder());
>   long denom = distinctVals.get(0);
>   for (int i = 1; i < distinctVals.size(); i++) {
> denom = (long) (denom * Math.pow(distinctVals.get(i), 1.0 / (1 << 
> i)));
>   }
>   return denom;
> }
> {code}
> This gets {{[8007986, 821974390, 821974390]}}, which is actually 3 columns 2 
> of which are derived from the same column.
> {code}
> Reduce Output Operator (RS_12)
>   key expressions: _col0 (type: bigint), year(_col2) (type: int), 
> month(_col2) (type: int)
>   sort order: +++
>   Map-reduce partition columns: _col0 (type: bigint), year(_col2) 
> (type: int), month(_col2) (type: int)
>   value expressions: _col1 (type: bigint)
>   Join Operator (JOIN_13)
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col0 (type: bigint), year(_col1) (type: int), month(_col1) 
> (type: int)
>   1 _col0 (type: bigint), year(_col2) (type: int), month(_col2) 
> (type: int)
> outputColumnNames: _col3
> {code}
> So the eased out denominator is off by a factor of 30,000 or so, causing OOMs 
> in map-joins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-12491) Column Statistics: 3 attribute join on a 2-source table is off

2015-11-25 Thread Gopal V (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-12491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal V updated HIVE-12491:
---
Description: 
The eased out denominator has to detect duplicate row-stats from different 
attributes.

{code}
select account_id from customers c,  customer_activation ca
  where c.customer_id = ca.customer_id
  and year(ca.dt) = year(c.dt) and month(ca.dt) = month(c.dt)
  and year(ca.dt) between year('2013-12-26') and year('2013-12-26')
{code}

{code}
  private Long getEasedOutDenominator(List distinctVals) {
  // Exponential back-off for NDVs.
  // 1) Descending order sort of NDVs
  // 2) denominator = NDV1 * (NDV2 ^ (1/2)) * (NDV3 ^ (1/4))) * 
  Collections.sort(distinctVals, Collections.reverseOrder());

  long denom = distinctVals.get(0);
  for (int i = 1; i < distinctVals.size(); i++) {
denom = (long) (denom * Math.pow(distinctVals.get(i), 1.0 / (1 << i)));
  }

  return denom;
}
{code}

This gets {{[8007986, 821974390, 821974390]}}, which is actually 3 columns 2 of 
which are derived from the same column.

{code}
Reduce Output Operator (RS_12)
  key expressions: _col0 (type: bigint), year(_col2) (type: int), 
month(_col2) (type: int)
  sort order: +++
  Map-reduce partition columns: _col0 (type: bigint), year(_col2) 
(type: int), month(_col2) (type: int)
  value expressions: _col1 (type: bigint)
  Join Operator (JOIN_13)
condition map:
 Inner Join 0 to 1
keys:
  0 _col0 (type: bigint), year(_col1) (type: int), month(_col1) 
(type: int)
  1 _col0 (type: bigint), year(_col2) (type: int), month(_col2) 
(type: int)
outputColumnNames: _col3
{code}

So the eased out denominator is off by a factor of 30,000 or so, causing OOMs 
in map-joins.

  was:
The eased out denominator has to detect duplicate row-stats from different 
attributes.

{code}
  private Long getEasedOutDenominator(List distinctVals) {
  // Exponential back-off for NDVs.
  // 1) Descending order sort of NDVs
  // 2) denominator = NDV1 * (NDV2 ^ (1/2)) * (NDV3 ^ (1/4))) * 
  Collections.sort(distinctVals, Collections.reverseOrder());

  long denom = distinctVals.get(0);
  for (int i = 1; i < distinctVals.size(); i++) {
denom = (long) (denom * Math.pow(distinctVals.get(i), 1.0 / (1 << i)));
  }

  return denom;
}
{code}

This gets {{[8007986, 821974390, 821974390]}}, which is actually 3 columns 2 of 
which are derived from the same column.

{code}
Reduce Output Operator (RS_12)
  key expressions: _col0 (type: bigint), year(_col2) (type: int), 
month(_col2) (type: int)
  sort order: +++
  Map-reduce partition columns: _col0 (type: bigint), year(_col2) 
(type: int), month(_col2) (type: int)
  value expressions: _col1 (type: bigint)
  Join Operator (JOIN_13)
condition map:
 Inner Join 0 to 1
keys:
  0 _col0 (type: bigint), year(_col1) (type: int), month(_col1) 
(type: int)
  1 _col0 (type: bigint), year(_col2) (type: int), month(_col2) 
(type: int)
outputColumnNames: _col3
{code}

So the eased out denominator is off by a factor of 30,000 or so, causing OOMs 
in map-joins.


> Column Statistics: 3 attribute join on a 2-source table is off
> --
>
> Key: HIVE-12491
> URL: https://issues.apache.org/jira/browse/HIVE-12491
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.3.0, 2.0.0
>Reporter: Gopal V
>Assignee: Prasanth Jayachandran
> Attachments: HIVE-12491.WIP.patch
>
>
> The eased out denominator has to detect duplicate row-stats from different 
> attributes.
> {code}
> select account_id from customers c,  customer_activation ca
>   where c.customer_id = ca.customer_id
>   and year(ca.dt) = year(c.dt) and month(ca.dt) = month(c.dt)
>   and year(ca.dt) between year('2013-12-26') and year('2013-12-26')
> {code}
> {code}
>   private Long getEasedOutDenominator(List distinctVals) {
>   // Exponential back-off for NDVs.
>   // 1) Descending order sort of NDVs
>   // 2) denominator = NDV1 * (NDV2 ^ (1/2)) * (NDV3 ^ (1/4))) * 
>   Collections.sort(distinctVals, Collections.reverseOrder());
>   long denom = distinctVals.get(0);
>   for (int i = 1; i < distinctVals.size(); i++) {
> denom = (long) (denom * Math.pow(distinctVals.get(i), 1.0 / (1 << 
> i)));
>   }
>   return denom;
> }
> {code}
> This gets {{[8007986, 821974390, 821974390]}}, which is actually 3 columns 2 
> of which are derived from the same column.
> {code}
> Reduce Output Operator (RS_12)
>   key expressions: _col0 (type: bigint), year(_col2) (type: int),

[jira] [Updated] (HIVE-12491) Column Statistics: 3 attribute join on a 2-source table is off

2015-11-22 Thread Gopal V (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-12491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal V updated HIVE-12491:
---
Attachment: HIVE-12491.WIP.patch

> Column Statistics: 3 attribute join on a 2-source table is off
> --
>
> Key: HIVE-12491
> URL: https://issues.apache.org/jira/browse/HIVE-12491
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.3.0, 2.0.0
>Reporter: Gopal V
>Assignee: Prasanth Jayachandran
> Attachments: HIVE-12491.WIP.patch
>
>
> The eased out denominator has to detect duplicate row-stats from different 
> attributes.
> {code}
>   private Long getEasedOutDenominator(List distinctVals) {
>   // Exponential back-off for NDVs.
>   // 1) Descending order sort of NDVs
>   // 2) denominator = NDV1 * (NDV2 ^ (1/2)) * (NDV3 ^ (1/4))) * 
>   Collections.sort(distinctVals, Collections.reverseOrder());
>   long denom = distinctVals.get(0);
>   for (int i = 1; i < distinctVals.size(); i++) {
> denom = (long) (denom * Math.pow(distinctVals.get(i), 1.0 / (1 << 
> i)));
>   }
>   return denom;
> }
> {code}
> This gets {{[8007986, 821974390, 821974390]}}, which is actually 3 columns 2 
> of which are derived from the same column.
> {code}
> Reduce Output Operator (RS_12)
>   key expressions: _col0 (type: bigint), year(_col2) (type: int), 
> month(_col2) (type: int)
>   sort order: +++
>   Map-reduce partition columns: _col0 (type: bigint), year(_col2) 
> (type: int), month(_col2) (type: int)
>   value expressions: _col1 (type: bigint)
>   Join Operator (JOIN_13)
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col0 (type: bigint), year(_col1) (type: int), month(_col1) 
> (type: int)
>   1 _col0 (type: bigint), year(_col2) (type: int), month(_col2) 
> (type: int)
> outputColumnNames: _col3
> {code}
> So the eased out denominator is off by a factor of 30,000 or so, causing OOMs 
> in map-joins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-12491) Column Statistics: 3 attribute join on a 2-source table is off

2015-11-21 Thread Gopal V (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-12491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal V updated HIVE-12491:
---
Affects Version/s: 2.0.0
   1.3.0

> Column Statistics: 3 attribute join on a 2-source table is off
> --
>
> Key: HIVE-12491
> URL: https://issues.apache.org/jira/browse/HIVE-12491
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.3.0, 2.0.0
>Reporter: Gopal V
>Assignee: Prasanth Jayachandran
>
> The eased out denominator has to detect duplicate row-stats from different 
> attributes.
> {code}
>   private Long getEasedOutDenominator(List distinctVals) {
>   // Exponential back-off for NDVs.
>   // 1) Descending order sort of NDVs
>   // 2) denominator = NDV1 * (NDV2 ^ (1/2)) * (NDV3 ^ (1/4))) * 
>   Collections.sort(distinctVals, Collections.reverseOrder());
>   long denom = distinctVals.get(0);
>   for (int i = 1; i < distinctVals.size(); i++) {
> denom = (long) (denom * Math.pow(distinctVals.get(i), 1.0 / (1 << 
> i)));
>   }
>   return denom;
> }
> {code}
> This gets {{[8007986, 821974390, 821974390]}}, which is actually 3 columns 2 
> of which are derived from the same column.
> {code}
> Reduce Output Operator (RS_12)
>   key expressions: _col0 (type: bigint), year(_col2) (type: int), 
> month(_col2) (type: int)
>   sort order: +++
>   Map-reduce partition columns: _col0 (type: bigint), year(_col2) 
> (type: int), month(_col2) (type: int)
>   value expressions: _col1 (type: bigint)
>   Join Operator (JOIN_13)
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col0 (type: bigint), year(_col1) (type: int), month(_col1) 
> (type: int)
>   1 _col0 (type: bigint), year(_col2) (type: int), month(_col2) 
> (type: int)
> outputColumnNames: _col3
> {code}
> So the eased out denominator is off by a factor of 30,000 or so, causing OOMs 
> in map-joins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-12491) Column Statistics: 3 attribute join on a 2-source table is off

2015-11-21 Thread Gopal V (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-12491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal V updated HIVE-12491:
---
Summary: Column Statistics: 3 attribute join on a 2-source table is off  
(was: Statistics: 3 attribute join on a 2-source table is off)

> Column Statistics: 3 attribute join on a 2-source table is off
> --
>
> Key: HIVE-12491
> URL: https://issues.apache.org/jira/browse/HIVE-12491
> Project: Hive
>  Issue Type: Bug
>Reporter: Gopal V
>Assignee: Prasanth Jayachandran
>
> The eased out denominator has to detect duplicate row-stats from different 
> attributes.
> {code}
>   private Long getEasedOutDenominator(List distinctVals) {
>   // Exponential back-off for NDVs.
>   // 1) Descending order sort of NDVs
>   // 2) denominator = NDV1 * (NDV2 ^ (1/2)) * (NDV3 ^ (1/4))) * 
>   Collections.sort(distinctVals, Collections.reverseOrder());
>   long denom = distinctVals.get(0);
>   for (int i = 1; i < distinctVals.size(); i++) {
> denom = (long) (denom * Math.pow(distinctVals.get(i), 1.0 / (1 << 
> i)));
>   }
>   return denom;
> }
> {code}
> This gets {{[8007986, 821974390, 821974390]}}, which is actually 3 columns 2 
> of which are derived from the same column.
> {code}
> Reduce Output Operator (RS_12)
>   key expressions: _col0 (type: bigint), year(_col2) (type: int), 
> month(_col2) (type: int)
>   sort order: +++
>   Map-reduce partition columns: _col0 (type: bigint), year(_col2) 
> (type: int), month(_col2) (type: int)
>   value expressions: _col1 (type: bigint)
>   Join Operator (JOIN_13)
> condition map:
>  Inner Join 0 to 1
> keys:
>   0 _col0 (type: bigint), year(_col1) (type: int), month(_col1) 
> (type: int)
>   1 _col0 (type: bigint), year(_col2) (type: int), month(_col2) 
> (type: int)
> outputColumnNames: _col3
> {code}
> So the eased out denominator is off by a factor of 30,000 or so, causing OOMs 
> in map-joins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)