[jira] [Updated] (PHOENIX-5774) Phoenix Mapreduce job over hbase snapshots is extremely inefficient.

2020-03-18 Thread Xu Cang (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-5774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xu Cang updated PHOENIX-5774:
-
External issue ID: PHOENIX-4997

> Phoenix Mapreduce job over hbase snapshots is extremely inefficient.
> 
>
> Key: PHOENIX-5774
> URL: https://issues.apache.org/jira/browse/PHOENIX-5774
> Project: Phoenix
>  Issue Type: Bug
>Affects Versions: 4.13.1
>Reporter: Rushabh Shah
>Assignee: Xu Cang
>Priority: Major
>
> Internally we have tenant estimation framework which calculates the number of 
> rows each tenant occupy in the cluster. Basically what the framework does is 
> it launch MapReduce(MR) job per table and run the following query : "Select 
> tenant_id from " and we do count over this tenant_id in reducer 
> phase.
>  Earlier we use to run this query against live table but we found meta table 
> was getting hammered over the time this job was running so we thought to run 
> the MR job on hbase snapshots instead of live table. Take advantage of this 
> feature: https://issues.apache.org/jira/browse/PHOENIX-3744
> When we were querying live table, the MR job for one of the biggest table in 
> sandbox cluster took around 2.5 hours.
>  After we started using hbase snapshots, the MR job for the same table took 
> 135 hours. We have maximum concurrent running mapper limit to 15 to avoid 
> hammering meta table when we were querying live tables. We didn't remove that 
> restriction after we moved to hbase snapshots.So ideally it shouldn't take 
> 135 hours to complete if we don't have that restriction.
> Some statistics about that table:
>  Size: -578 GB- 2.70 TB, Num Regions in that table: -161- 670
> The average map time took 3 mins 11 seconds when querying live table.
>  The average map time took 5 hours 33 minutes when querying hbase snapshots.
> The issue is we don't consider snapshots while generating splits. So during 
> map phase, each map task has to go through all regions in snapshots to 
> determine which region has the start and end key assigned to that task. After 
> determining all regions, it has to open each region to scan all hfiles in 
> that region. In one such map task, the start and end key from split was 
> distributed among 289 regions(from snapshot not live table). Reading from 
> each region took an average of 90 seconds, so for 289 regions it took 
> approximately 7 hours.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PHOENIX-5774) Phoenix Mapreduce job over hbase snapshots is extremely inefficient.

2020-03-12 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-5774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated PHOENIX-5774:
--
Summary: Phoenix Mapreduce job over hbase snapshots is extremely 
inefficient.  (was: Phoenix Mapreduce job over hbase Snapshots is extremely 
inefficient.)

> Phoenix Mapreduce job over hbase snapshots is extremely inefficient.
> 
>
> Key: PHOENIX-5774
> URL: https://issues.apache.org/jira/browse/PHOENIX-5774
> Project: Phoenix
>  Issue Type: Bug
>Affects Versions: 4.13.1
>Reporter: Rushabh Shah
>Priority: Major
>
> Internally we have tenant estimation framework which calculates the number of 
> rows each tenant occupy in the cluster. Basically what the framework does is 
> it launch MapReduce(MR) job per table and run the following query : "Select 
> tenant_id from " and we do count over this tenant_id in reducer 
> phase.
>  Earlier we use to run this query against live table but we found meta table 
> was getting hammered over the time this job was running so we thought to run 
> the MR job on hbase snapshots instead of live table. Take advantage of this 
> feature: https://issues.apache.org/jira/browse/PHOENIX-3744
> When we were querying live table, the MR job for one of the biggest table in 
> sandbox cluster took around 2.5 hours.
>  After we started using hbase snapshots, the MR job for the same table took 
> 135 hours. We have maximum concurrent running mapper limit to 15 to avoid 
> hammering meta table when we were querying live tables. We didn't remove that 
> restriction after we moved to hbase snapshots.So ideally it shouldn't take 
> 135 hours to complete if we don't have that restriction.
> Some statistics about that table:
>  Size: -578 GB- 2.70 TB, Num Regions in that table: -161- 670
> The average map time took 3 mins 11 seconds when querying live table.
>  The average map time took 5 hours 33 minutes when querying hbase snapshots.
> The issue is we don't consider snapshots while generating splits. So during 
> map phase, each map task has to go through all regions in snapshots to 
> determine which region has the start and end key assigned to that task. After 
> determining all regions, it has to open each region to scan all hfiles in 
> that region. In one such map task, the start and end key from split was 
> distributed among 289 regions(from snapshot not live table). Reading from 
> each region took an average of 90 seconds, so for 289 regions it took 
> approximately 7 hours.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PHOENIX-5774) Phoenix Mapreduce job over hbase Snapshots is extremely inefficient.

2020-03-12 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-5774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated PHOENIX-5774:
--
Description: 
Internally we have tenant estimation framework which calculates the number of 
rows each tenant occupy in the cluster. Basically what the framework does is it 
launch MapReduce(MR) job per table and run the following query : "Select 
tenant_id from " and we do count over this tenant_id in reducer 
phase.
 Earlier we use to run this query against live table but we found meta table 
was getting hammered over the time this job was running so we thought to run 
the MR job on hbase snapshots instead of live table. Take advantage of this 
feature: https://issues.apache.org/jira/browse/PHOENIX-3744

When we were querying live table, the MR job for one of the biggest table in 
sandbox cluster took around 2.5 hours.
 After we started using hbase snapshots, the MR job for the same table took 135 
hours. We have maximum concurrent running mapper limit to 15 to avoid hammering 
meta table when we were querying live tables. We didn't remove that restriction 
after we moved to hbase snapshots.So ideally it shouldn't take 135 hours to 
complete if we don't have that restriction.

Some statistics about that table:
 Size: -578 GB- 2.71 TB, Num Regions in that table: -161- 670

The average map time took 3 mins 11 seconds when querying live table.
 The average map time took 5 hours 33 minutes when querying hbase snapshots.

The issue is we don't consider snapshots while generating splits. So during map 
phase, each map task has to go through all regions in snapshots to determine 
which region has the start and end key assigned to that task. After determining 
all regions, it has to open each region to scan all hfiles in that region. In 
one such map task, the start and end key from split was distributed among 289 
regions(from snapshot not live table). Reading from each region took an average 
of 90 seconds, so for 289 regions it took approximately 7 hours.

  was:
Internally we have tenant estimation framework which calculates the number of 
rows each tenant occupy in the cluster. Basically what the framework does is it 
launch MapReduce(MR) job per table and run the following query : "Select 
tenant_id from " and we do count over this tenant_id in reducer 
phase.
Earlier we use to run this query against live table but we found meta table was 
getting hammered over the time this job was running so we thought to run the MR 
job on hbase snapshots instead of live table. Take advantage of this feature: 
https://issues.apache.org/jira/browse/PHOENIX-3744

When we were querying live table, the MR job for one of the biggest table in 
sandbox cluster took around 2.5 hours.
After we started using hbase snapshots, the MR job for the same table took 135 
hours.  We have maximum concurrent running mapper limit to 15 to avoid 
hammering meta table when we were querying live tables. We didn't remove that 
restriction after we moved to hbase snapshots.So ideally it shouldn't take 135 
hours to complete if we don't have that restriction.

Some statistics about that table:
Size: -578 GB- 2.71 TB, Num Regions in that table: -161- 671

The average map time took 3 mins 11 seconds when querying live table.
The average map time took 5 hours 33 minutes when querying hbase snapshots.

The issue is we don't consider snapshots while generating splits. So during map 
phase, each map task has to go through all regions in snapshots to determine 
which region has the start and end key assigned to that task. After determining 
all regions, it has to open each region to scan all hfiles in that region. In 
one such map task, the start and end key from split was distributed among 289 
regions(from snapshot not live table). Reading from each region took an average 
of 90 seconds, so for 289 regions it took approximately 7 hours.


> Phoenix Mapreduce job over hbase Snapshots is extremely inefficient.
> 
>
> Key: PHOENIX-5774
> URL: https://issues.apache.org/jira/browse/PHOENIX-5774
> Project: Phoenix
>  Issue Type: Bug
>Affects Versions: 4.13.1
>Reporter: Rushabh Shah
>Priority: Major
>
> Internally we have tenant estimation framework which calculates the number of 
> rows each tenant occupy in the cluster. Basically what the framework does is 
> it launch MapReduce(MR) job per table and run the following query : "Select 
> tenant_id from " and we do count over this tenant_id in reducer 
> phase.
>  Earlier we use to run this query against live table but we found meta table 
> was getting hammered over the time this job was running so we thought to run 
> the MR job on hbase snapshots instead of live table. Take advantage of this 
> feature: https://issues.apache.org/jira/browse/PHOENIX-3744
> Whe

[jira] [Updated] (PHOENIX-5774) Phoenix Mapreduce job over hbase Snapshots is extremely inefficient.

2020-03-12 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-5774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated PHOENIX-5774:
--
Description: 
Internally we have tenant estimation framework which calculates the number of 
rows each tenant occupy in the cluster. Basically what the framework does is it 
launch MapReduce(MR) job per table and run the following query : "Select 
tenant_id from " and we do count over this tenant_id in reducer 
phase.
 Earlier we use to run this query against live table but we found meta table 
was getting hammered over the time this job was running so we thought to run 
the MR job on hbase snapshots instead of live table. Take advantage of this 
feature: https://issues.apache.org/jira/browse/PHOENIX-3744

When we were querying live table, the MR job for one of the biggest table in 
sandbox cluster took around 2.5 hours.
 After we started using hbase snapshots, the MR job for the same table took 135 
hours. We have maximum concurrent running mapper limit to 15 to avoid hammering 
meta table when we were querying live tables. We didn't remove that restriction 
after we moved to hbase snapshots.So ideally it shouldn't take 135 hours to 
complete if we don't have that restriction.

Some statistics about that table:
 Size: -578 GB- 2.70 TB, Num Regions in that table: -161- 670

The average map time took 3 mins 11 seconds when querying live table.
 The average map time took 5 hours 33 minutes when querying hbase snapshots.

The issue is we don't consider snapshots while generating splits. So during map 
phase, each map task has to go through all regions in snapshots to determine 
which region has the start and end key assigned to that task. After determining 
all regions, it has to open each region to scan all hfiles in that region. In 
one such map task, the start and end key from split was distributed among 289 
regions(from snapshot not live table). Reading from each region took an average 
of 90 seconds, so for 289 regions it took approximately 7 hours.

  was:
Internally we have tenant estimation framework which calculates the number of 
rows each tenant occupy in the cluster. Basically what the framework does is it 
launch MapReduce(MR) job per table and run the following query : "Select 
tenant_id from " and we do count over this tenant_id in reducer 
phase.
 Earlier we use to run this query against live table but we found meta table 
was getting hammered over the time this job was running so we thought to run 
the MR job on hbase snapshots instead of live table. Take advantage of this 
feature: https://issues.apache.org/jira/browse/PHOENIX-3744

When we were querying live table, the MR job for one of the biggest table in 
sandbox cluster took around 2.5 hours.
 After we started using hbase snapshots, the MR job for the same table took 135 
hours. We have maximum concurrent running mapper limit to 15 to avoid hammering 
meta table when we were querying live tables. We didn't remove that restriction 
after we moved to hbase snapshots.So ideally it shouldn't take 135 hours to 
complete if we don't have that restriction.

Some statistics about that table:
 Size: -578 GB- 2.71 TB, Num Regions in that table: -161- 670

The average map time took 3 mins 11 seconds when querying live table.
 The average map time took 5 hours 33 minutes when querying hbase snapshots.

The issue is we don't consider snapshots while generating splits. So during map 
phase, each map task has to go through all regions in snapshots to determine 
which region has the start and end key assigned to that task. After determining 
all regions, it has to open each region to scan all hfiles in that region. In 
one such map task, the start and end key from split was distributed among 289 
regions(from snapshot not live table). Reading from each region took an average 
of 90 seconds, so for 289 regions it took approximately 7 hours.


> Phoenix Mapreduce job over hbase Snapshots is extremely inefficient.
> 
>
> Key: PHOENIX-5774
> URL: https://issues.apache.org/jira/browse/PHOENIX-5774
> Project: Phoenix
>  Issue Type: Bug
>Affects Versions: 4.13.1
>Reporter: Rushabh Shah
>Priority: Major
>
> Internally we have tenant estimation framework which calculates the number of 
> rows each tenant occupy in the cluster. Basically what the framework does is 
> it launch MapReduce(MR) job per table and run the following query : "Select 
> tenant_id from " and we do count over this tenant_id in reducer 
> phase.
>  Earlier we use to run this query against live table but we found meta table 
> was getting hammered over the time this job was running so we thought to run 
> the MR job on hbase snapshots instead of live table. Take advantage of this 
> feature: https://issues.apache.org/jira/browse/PHOENIX-3744
> 

[jira] [Updated] (PHOENIX-5774) Phoenix Mapreduce job over hbase Snapshots is extremely inefficient.

2020-03-12 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-5774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated PHOENIX-5774:
--
Description: 
Internally we have tenant estimation framework which calculates the number of 
rows each tenant occupy in the cluster. Basically what the framework does is it 
launch MapReduce(MR) job per table and run the following query : "Select 
tenant_id from " and we do count over this tenant_id in reducer 
phase.
Earlier we use to run this query against live table but we found meta table was 
getting hammered over the time this job was running so we thought to run the MR 
job on hbase snapshots instead of live table. Take advantage of this feature: 
https://issues.apache.org/jira/browse/PHOENIX-3744

When we were querying live table, the MR job for one of the biggest table in 
sandbox cluster took around 2.5 hours.
After we started using hbase snapshots, the MR job for the same table took 135 
hours.  We have maximum concurrent running mapper limit to 15 to avoid 
hammering meta table when we were querying live tables. We didn't remove that 
restriction after we moved to hbase snapshots.So ideally it shouldn't take 135 
hours to complete if we don't have that restriction.

Some statistics about that table:
Size: -578 GB- 2.71 TB, Num Regions in that table: -161- 671

The average map time took 3 mins 11 seconds when querying live table.
The average map time took 5 hours 33 minutes when querying hbase snapshots.

The issue is we don't consider snapshots while generating splits. So during map 
phase, each map task has to go through all regions in snapshots to determine 
which region has the start and end key assigned to that task. After determining 
all regions, it has to open each region to scan all hfiles in that region. In 
one such map task, the start and end key from split was distributed among 289 
regions(from snapshot not live table). Reading from each region took an average 
of 90 seconds, so for 289 regions it took approximately 7 hours.

  was:
Internally we have tenant estimation framework which calculates the number of 
rows each tenant occupy in the cluster. Basically what the framework does is it 
launch MapReduce(MR) job per table and run the following query : "Select 
tenant_id from " and we do count over this tenant_id in reducer 
phase.
Earlier we use to run this query against live table but we found meta table was 
getting hammered over the time this job was running so we thought to run the MR 
job on hbase snapshots instead of live table. Take advantage of this feature: 
https://issues.apache.org/jira/browse/PHOENIX-3744

When we were querying live table, the MR job for one of the biggest table in 
sandbox cluster took around 2.5 hours.
After we started using hbase snapshots, the MR job for the same table took 135 
hours.  We have maximum concurrent running mapper limit to 15 to avoid 
hammering meta table when we were querying live tables. We didn't remove that 
restriction after we moved to hbase snapshots.So ideally it shouldn't take 135 
hours to complete if we don't have that restriction.

Some statistics about that table:
Size: 578 GB, Num Regions in that table: 161

The average map time took 3 mins 11 seconds when querying live table.
The average map time took 5 hours 33 minutes when querying hbase snapshots.

The issue is we don't consider snapshots while generating splits. So during map 
phase, each map task has to go through all regions in snapshots to determine 
which region has the start and end key assigned to that task. After determining 
all regions, it has to open each region to scan all hfiles in that region. In 
one such map task, the start and end key from split was distributed among 289 
regions(from snapshot not live table). Reading from each region took an average 
of 90 seconds, so for 289 regions it took approximately 7 hours.


> Phoenix Mapreduce job over hbase Snapshots is extremely inefficient.
> 
>
> Key: PHOENIX-5774
> URL: https://issues.apache.org/jira/browse/PHOENIX-5774
> Project: Phoenix
>  Issue Type: Bug
>Affects Versions: 4.13.1
>Reporter: Rushabh Shah
>Priority: Major
>
> Internally we have tenant estimation framework which calculates the number of 
> rows each tenant occupy in the cluster. Basically what the framework does is 
> it launch MapReduce(MR) job per table and run the following query : "Select 
> tenant_id from " and we do count over this tenant_id in reducer 
> phase.
> Earlier we use to run this query against live table but we found meta table 
> was getting hammered over the time this job was running so we thought to run 
> the MR job on hbase snapshots instead of live table. Take advantage of this 
> feature: https://issues.apache.org/jira/browse/PHOENIX-3744
> When we were querying l

[jira] [Updated] (PHOENIX-5774) Phoenix Mapreduce job over hbase Snapshots is extremely inefficient.

2020-03-12 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-5774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated PHOENIX-5774:
--
Description: 
Internally we have tenant estimation framework which calculates the number of 
rows each tenant occupy in the cluster. Basically what the framework does is it 
launch MapReduce(MR) job per table and run the following query : "Select 
tenant_id from " and we do count over this tenant_id in reducer 
phase.
Earlier we use to run this query against live table but we found meta table was 
getting hammered over the time this job was running so we thought to run the MR 
job on hbase snapshots instead of live table. Take advantage of this feature: 
https://issues.apache.org/jira/browse/PHOENIX-3744

When we were querying live table, the MR job for one of the biggest table in 
sandbox cluster took around 2.5 hours.
After we started using hbase snapshots, the MR job for the same table took 135 
hours.  We have maximum concurrent running mapper limit to 15 to avoid 
hammering meta table when we were querying live tables. We didn't remove that 
restriction after we moved to hbase snapshots.So ideally it shouldn't take 135 
hours to complete if we don't have that restriction.

Some statistics about that table:
Size: 578 GB, Num Regions in that table: 161

The average map time took 3 mins 11 seconds when querying live table.
The average map time took 5 hours 33 minutes when querying hbase snapshots.

The issue is we don't consider snapshots while generating splits. So during map 
phase, each map task has to go through all regions in snapshots to determine 
which region has the start and end key assigned to that task. After determining 
all regions, it has to open each region to scan all hfiles in that region. In 
one such map task, the start and end key from split was distributed among 289 
regions(from snapshot not live table). Reading from each region took an average 
of 90 seconds, so for 289 regions it took approximately 7 hours.

  was:
Internally we have tenant estimation framework which calculates the number of 
rows each tenant occupy in the cluster. Basically what the framework does is it 
launch MapReduce(MR) job per table and run the following query : "Select 
tenant_id from " and we do count over this tenant_id in reducer 
phase.
Earlier we use to run this query against live table but we found meta table was 
getting hammered over the time this job was running so we thought to run the MR 
job on hbase snapshots instead of live table. Take advantage of this feature: 
https://issues.apache.org/jira/browse/PHOENIX-3744

When we were querying live table, the MR job for one of the biggest table in 
sandbox cluster took around 2.5 hours.
After we started using hbase snapshots, the MR job for the same table took 135 
hours.  We have maximum concurrent running mapper limit to 15 to avoid 
hammering meta table when we were querying live tables. We didn't remove that 
restriction after we moved to hbase snapshots.So ideally it shouldn't take 135 
hours to complete if we don't have that restriction.

Some statistics about that table:
Size: 578 GB, Num Regions in that table: 161

The average map time took 3 mins 11 seconds when querying live table.
The average map time took 5 hours 33 minutes when querying hbase snapshots.

The issue is we don't consider snapshots while generating splits. So during map 
phase, each map task has to go through all regions in snapshots to determine 
which region has the start and end key assigned to that task. After determining 
all regions, it has to open each region to scan all hfiles in that region. In 
one such map task, the start and end key from split was distributed among 289 
regions. Reading from each region took an average of 90 seconds, so for 289 
regions it took approximately 7 hours.


> Phoenix Mapreduce job over hbase Snapshots is extremely inefficient.
> 
>
> Key: PHOENIX-5774
> URL: https://issues.apache.org/jira/browse/PHOENIX-5774
> Project: Phoenix
>  Issue Type: Bug
>Affects Versions: 4.13.1
>Reporter: Rushabh Shah
>Priority: Major
>
> Internally we have tenant estimation framework which calculates the number of 
> rows each tenant occupy in the cluster. Basically what the framework does is 
> it launch MapReduce(MR) job per table and run the following query : "Select 
> tenant_id from " and we do count over this tenant_id in reducer 
> phase.
> Earlier we use to run this query against live table but we found meta table 
> was getting hammered over the time this job was running so we thought to run 
> the MR job on hbase snapshots instead of live table. Take advantage of this 
> feature: https://issues.apache.org/jira/browse/PHOENIX-3744
> When we were querying live table, the MR job for one of the biggest t