[jira] [Commented] (IMPALA-9382) Prototype denser runtime profile implementation

2021-02-11 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17283461#comment-17283461
 ] 

ASF subversion and git services commented on IMPALA-9382:
-

Commit 6fe3466d4d0a88941206aed8d265d9663758fec2 in impala's branch 
refs/heads/master from Tim Armstrong
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=6fe3466 ]

IMPALA-9382: part 3/3 clean up runtime profile v2 text output

Eliminated some of the noisy per-instance counters
from DEFAULT verbosity.

Testing:
* Updated impala-profile-tool test with new output
* Added new impala-profile-tool test for v2 profile.

Change-Id: I277a0da749bcda4ecca574257b5aaacbcf222491
Reviewed-on: http://gerrit.cloudera.org:8080/17050
Reviewed-by: Tim Armstrong 
Tested-by: Impala Public Jenkins 


> Prototype denser runtime profile implementation
> ---
>
> Key: IMPALA-9382
> URL: https://issues.apache.org/jira/browse/IMPALA-9382
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Major
> Attachments: profile_504b379400cba9f2_2d2cf007, 
> tpcds_q10_profile_v1.txt, tpcds_q10_profile_v2.txt, tpcds_q10_profile_v2.txt
>
>
> RuntimeProfile trees can potentially stress the memory allocator and use up a 
> lot more memory and cache than is really necessary:
> * std::map is used throughout, and allocates a node per map entry. We do 
> depend on the counters being displayed in-order, but we would probably be 
> better of storing the counters in a vector and lazily sorting when needed 
> (since the set of counters is generally static after Prepare()).
> * We store the same counter names redundantly all over the place. We'd 
> probably be best off using a pool of constant counter names (we could just 
> require registering them upfront).
> There may be a small gain from switching thrift to using unordered_map, e.g. 
> for the info strings that appear with some frequency in profiles.
> However, I think we need to restructure the thrift representation and 
> in-memory representation to get significant gains.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9382) Prototype denser runtime profile implementation

2021-02-09 Thread Tim Armstrong (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281873#comment-17281873
 ] 

Tim Armstrong commented on IMPALA-9382:
---

Actually I should reduce the verbosity of the default option a bit as part 3

> Prototype denser runtime profile implementation
> ---
>
> Key: IMPALA-9382
> URL: https://issues.apache.org/jira/browse/IMPALA-9382
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Major
> Attachments: profile_504b379400cba9f2_2d2cf007, 
> tpcds_q10_profile_v1.txt, tpcds_q10_profile_v2.txt, tpcds_q10_profile_v2.txt
>
>
> RuntimeProfile trees can potentially stress the memory allocator and use up a 
> lot more memory and cache than is really necessary:
> * std::map is used throughout, and allocates a node per map entry. We do 
> depend on the counters being displayed in-order, but we would probably be 
> better of storing the counters in a vector and lazily sorting when needed 
> (since the set of counters is generally static after Prepare()).
> * We store the same counter names redundantly all over the place. We'd 
> probably be best off using a pool of constant counter names (we could just 
> require registering them upfront).
> There may be a small gain from switching thrift to using unordered_map, e.g. 
> for the info strings that appear with some frequency in profiles.
> However, I think we need to restructure the thrift representation and 
> in-memory representation to get significant gains.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9382) Prototype denser runtime profile implementation

2021-01-14 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265777#comment-17265777
 ] 

ASF subversion and git services commented on IMPALA-9382:
-

Commit 5fddbb569a68bd438d9f4847481c8db1fce5b67c in impala's branch 
refs/heads/master from Tim Armstrong
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=5fddbb5 ]

IMPALA-9865: part 2/2: add verbosity to profile tool

Adds a --profile_verbosity option for impala-profile-tool with
the following levels:
* 0: minimal
* 1: legacy - matches old output, this is the default still
* 2: default - basic descriptive stats, used for V2 profile.
* 3: extended
* 4: full

This will help with transition to the V2 profile because we
can have a nice, high-level, readable text profile by default
with the option to produce more detailed profiles and alternate
views of the profile from the thrift profile.

Use the profile version in impala-profile-tool to dump the
more verbose output for the V2 profile while preserving the
same output for the legacy profile.

Reduce verbosity of v2 profile output - only include mean/min/max
by default. I intend to refine the output at the different
verbosity levels for the v2 profiles further as part of IMPALA-9382,
it is still fairly noisy.

Fix output with/without gen_experimental_profile - there
was a small difference in that the summary stats were not
output in the averaged profile.

Testing:
* Add an end-to-end test that generates output for a small
  profile log and compares against expected files.
* Tweak other profile tests to reflect changes to output.

Change-Id: I82618a813e29af7996dfaed78873b2a73bc0231d
Reviewed-on: http://gerrit.cloudera.org:8080/16881
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 


> Prototype denser runtime profile implementation
> ---
>
> Key: IMPALA-9382
> URL: https://issues.apache.org/jira/browse/IMPALA-9382
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Major
> Attachments: profile_504b379400cba9f2_2d2cf007, 
> tpcds_q10_profile_v1.txt, tpcds_q10_profile_v2.txt, tpcds_q10_profile_v2.txt
>
>
> RuntimeProfile trees can potentially stress the memory allocator and use up a 
> lot more memory and cache than is really necessary:
> * std::map is used throughout, and allocates a node per map entry. We do 
> depend on the counters being displayed in-order, but we would probably be 
> better of storing the counters in a vector and lazily sorting when needed 
> (since the set of counters is generally static after Prepare()).
> * We store the same counter names redundantly all over the place. We'd 
> probably be best off using a pool of constant counter names (we could just 
> require registering them upfront).
> There may be a small gain from switching thrift to using unordered_map, e.g. 
> for the info strings that appear with some frequency in profiles.
> However, I think we need to restructure the thrift representation and 
> in-memory representation to get significant gains.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9382) Prototype denser runtime profile implementation

2020-12-15 Thread Tim Armstrong (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249938#comment-17249938
 ] 

Tim Armstrong commented on IMPALA-9382:
---

Avoiding the virtual function calls while decoding thrift could help a lot - 
see IMPALA-1638

> Prototype denser runtime profile implementation
> ---
>
> Key: IMPALA-9382
> URL: https://issues.apache.org/jira/browse/IMPALA-9382
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Major
> Attachments: profile_504b379400cba9f2_2d2cf007, 
> tpcds_q10_profile_v1.txt, tpcds_q10_profile_v2.txt, tpcds_q10_profile_v2.txt
>
>
> RuntimeProfile trees can potentially stress the memory allocator and use up a 
> lot more memory and cache than is really necessary:
> * std::map is used throughout, and allocates a node per map entry. We do 
> depend on the counters being displayed in-order, but we would probably be 
> better of storing the counters in a vector and lazily sorting when needed 
> (since the set of counters is generally static after Prepare()).
> * We store the same counter names redundantly all over the place. We'd 
> probably be best off using a pool of constant counter names (we could just 
> require registering them upfront).
> There may be a small gain from switching thrift to using unordered_map, e.g. 
> for the info strings that appear with some frequency in profiles.
> However, I think we need to restructure the thrift representation and 
> in-memory representation to get significant gains.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9382) Prototype denser runtime profile implementation

2020-11-27 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239822#comment-17239822
 ] 

ASF subversion and git services commented on IMPALA-9382:
-

Commit 9429bd779de986d3e61858bef7e258bd73a2cacd in impala's branch 
refs/heads/master from Tim Armstrong
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=9429bd7 ]

IMPALA-9382: part 2/3: aggregate profiles sent to coordinator

This reworks the status reporting so that serialized
AggregatedRuntimeProfile objects are sent from executors
to coordinators. These profiles are substantially denser
and faster to process for higher mt_dop values. The aggregation
is also done in a single step, merging the aggregated thrift
profile from the executor directly into the final aggregated
profile, instead of converting it to an unaggregated profile
first.

The changes required were:
* A new Update() method for AggregatedRuntimeProfile that
  updates the profile from a serialised AggregateRuntimeProfile
  for a subset of the instances. The code is generalized from the
  existing InitFromThrift() code path.
* Per-fragment reports included in the status report protobuf
  when --gen_experimental_profile=true.
* Logic on the coordinator that either consumes serialized
  AggregatedRuntimeProfile per fragment, when
  --gen_experimental_profile=true, or consumes a serialized
  RuntimeProfile per finstance otherwise.

This also adds support for event sequences and time series
in the aggregated profile, so the amount of information
in the aggregated profile is now on par with the basic profile.

We also finish off support for JSON profile. The JSON profile is
more stripped down because we do not need to round-trip profiles
via JSON and it is a much less dense profile representation.

Part 3 will clean up and improve the display of the profile.

Testing:
* Add sanity tests for aggregated runtime profile.
* Add unit tests to exercise aggregation of the various counter types
* Ran core tests.

Change-Id: Ic680cbfe94c939c2a8fad9d0943034ed058c6bca
Reviewed-on: http://gerrit.cloudera.org:8080/16057
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 


> Prototype denser runtime profile implementation
> ---
>
> Key: IMPALA-9382
> URL: https://issues.apache.org/jira/browse/IMPALA-9382
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Major
> Attachments: profile_504b379400cba9f2_2d2cf007, 
> tpcds_q10_profile_v1.txt, tpcds_q10_profile_v2.txt, tpcds_q10_profile_v2.txt
>
>
> RuntimeProfile trees can potentially stress the memory allocator and use up a 
> lot more memory and cache than is really necessary:
> * std::map is used throughout, and allocates a node per map entry. We do 
> depend on the counters being displayed in-order, but we would probably be 
> better of storing the counters in a vector and lazily sorting when needed 
> (since the set of counters is generally static after Prepare()).
> * We store the same counter names redundantly all over the place. We'd 
> probably be best off using a pool of constant counter names (we could just 
> require registering them upfront).
> There may be a small gain from switching thrift to using unordered_map, e.g. 
> for the info strings that appear with some frequency in profiles.
> However, I think we need to restructure the thrift representation and 
> in-memory representation to get significant gains.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9382) Prototype denser runtime profile implementation

2020-10-03 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17206847#comment-17206847
 ] 

ASF subversion and git services commented on IMPALA-9382:
-

Commit e60292fb3bd71f25b90119d0d48292f4c49e158f in impala's branch 
refs/heads/master from Tim Armstrong
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=e60292f ]

IMPALA-9711: incrementally update aggregate profile

In order to not cause additional work in the default mode,
we still only compute the average once per instance,
when it completes or when the query finishes.

When --gen_experimental_profile=true, we update the aggregated
profile for each status report, so that the live profile
can be viewed as the query executes.

The implications of this are as follows:
* More work is done on the KRPC control service RPC thread
  (although this is largely moot after part 2 of IMPALA-9382
   where we merge into the aggregated profile directly,
   so avoid the extra update).
* For complex multi-stage queries, the profile merging
  work is done earlier as each stage completes, therefore
  the critical path of the query is shortened
* Multiple RPC threads may be merging profiles concurrently
* Multiple threads may be calling AggregatedRuntimeProfile::Update()
  on the same profile, whereas previously all merging was done by
  a single thread. I looked through the locking in that function to
  check correctness.

Testing:
Ran core tests.

Ran a subset of the Python tests under TSAN, confirmed no races
were introduced in this code.

Change-Id: Ib03e79a40a33d8e74464640ae5f95a1467a6713a
Reviewed-on: http://gerrit.cloudera.org:8080/15931
Reviewed-by: Tim Armstrong 
Tested-by: Impala Public Jenkins 


> Prototype denser runtime profile implementation
> ---
>
> Key: IMPALA-9382
> URL: https://issues.apache.org/jira/browse/IMPALA-9382
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Major
> Attachments: profile_504b379400cba9f2_2d2cf007, 
> tpcds_q10_profile_v1.txt, tpcds_q10_profile_v2.txt, tpcds_q10_profile_v2.txt
>
>
> RuntimeProfile trees can potentially stress the memory allocator and use up a 
> lot more memory and cache than is really necessary:
> * std::map is used throughout, and allocates a node per map entry. We do 
> depend on the counters being displayed in-order, but we would probably be 
> better of storing the counters in a vector and lazily sorting when needed 
> (since the set of counters is generally static after Prepare()).
> * We store the same counter names redundantly all over the place. We'd 
> probably be best off using a pool of constant counter names (we could just 
> require registering them upfront).
> There may be a small gain from switching thrift to using unordered_map, e.g. 
> for the info strings that appear with some frequency in profiles.
> However, I think we need to restructure the thrift representation and 
> in-memory representation to get significant gains.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9382) Prototype denser runtime profile implementation

2020-09-21 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199700#comment-17199700
 ] 

ASF subversion and git services commented on IMPALA-9382:
-

Commit 79b17db83a2f97a409e468c5dd5cfaeb9454f5dd in impala's branch 
refs/heads/master from Tim Armstrong
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=79b17db ]

IMPALA-9382: part 1: transposed profile prototype

This adds an experimental profile representation
that is denser than the traditional representation.
Counters, info strings and other information for all
instances of a fragment are merged into a single
tree. Descriptive stats (min, max, mean) are shown for
each counter, along with the values for each instance. It
can be enabled by setting --gen_experimental_profile=true.
The default behaviour is unchanged, aside from including
a few extra counters in existing profiles.

An example of the pretty-printed profile is attached
to the JIRA.

The thrift representation of the profile is extended
so that all instances of a fragment can be merged
together into a single "aggregated" fragment, with
vectors of counters.

The in-memory representation is transformed in
a similar way. The RuntimeProfile class is
restructured so that there is a common RuntimeProfileBase
class, with RuntimeProfile and AggregatedRuntimeProfile
subclasses. Execution fills in counters in RuntimeProfile
for each instances, then these are aggregated together into
an AggregatedRuntimeProfile on the coordinator. This replaces
the "averaged" profile concept with an abstraction that
more clearly distinguishes what operations apply to aggregated
and unaggregated profiles.

In a future change, we could use AggregatedRuntimeProfile
for status reports so that less data needs to be sent to
the coordinator, and the coordinator needs to do less
processing.

The new profile removes the bad practice of including aggregated
stats as strings from the new profile. These stats can now be
automatically as aggregations of counters. The legacy uses of
InfoString are preserved so as to not lose information but
can be removed when we switch to the transposed profile.

Also make TotalTime and InactiveTime behave like other counters -
they are pretty-printed the same as other counters. Inactive time
is also now subtracted from local time in the averaged profile,
which fixes IMPALA-2794.

TODO in later patches for IMPALA-9382:
These will need to be fixed before this can be considered production
ready.
* The JSON profile generation is not fully implemented for aggregated
  profiles.
* Not all counter times are included in aggregated profile, e.g. time
  series counters.
* The pretty-printing of the various profile counters will need to be
  improved to be more readable, e.g. grouping by host, improving
  formatting.
* The aggregated profile is only updated at the end of the query.
  We need to support live updating.
* Consider how to show local time per instance - make it a first-class
  counter in the profile?

Possible extensions:
* We could better highlight outliers when pretty-printing the profile.

Testing:
* I diffed the text profile of TPC-DS Q1 to make sure there were no
  unexpected changes.
* Added unit test for stats computation in AveragedCounter.
* Passed core tests.
* exhaustive tests
* ASAN tests
* Ran some tests locally with TSAN

Change-Id: I0838c6a0872f57c696267ff4e92d29c08748eb7a
Reviewed-on: http://gerrit.cloudera.org:8080/15798
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 


> Prototype denser runtime profile implementation
> ---
>
> Key: IMPALA-9382
> URL: https://issues.apache.org/jira/browse/IMPALA-9382
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Major
> Attachments: profile_504b379400cba9f2_2d2cf007, 
> tpcds_q10_profile_v1.txt, tpcds_q10_profile_v2.txt, tpcds_q10_profile_v2.txt
>
>
> RuntimeProfile trees can potentially stress the memory allocator and use up a 
> lot more memory and cache than is really necessary:
> * std::map is used throughout, and allocates a node per map entry. We do 
> depend on the counters being displayed in-order, but we would probably be 
> better of storing the counters in a vector and lazily sorting when needed 
> (since the set of counters is generally static after Prepare()).
> * We store the same counter names redundantly all over the place. We'd 
> probably be best off using a pool of constant counter names (we could just 
> require registering them upfront).
> There may be a small gain from switching thrift to using unordered_map, e.g. 
> for the info strings that appear with some frequency in profiles.
> However, I think we need to restructure the th

[jira] [Commented] (IMPALA-9382) Prototype denser runtime profile implementation

2020-09-21 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199702#comment-17199702
 ] 

ASF subversion and git services commented on IMPALA-9382:
-

Commit 79b17db83a2f97a409e468c5dd5cfaeb9454f5dd in impala's branch 
refs/heads/master from Tim Armstrong
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=79b17db ]

IMPALA-9382: part 1: transposed profile prototype

This adds an experimental profile representation
that is denser than the traditional representation.
Counters, info strings and other information for all
instances of a fragment are merged into a single
tree. Descriptive stats (min, max, mean) are shown for
each counter, along with the values for each instance. It
can be enabled by setting --gen_experimental_profile=true.
The default behaviour is unchanged, aside from including
a few extra counters in existing profiles.

An example of the pretty-printed profile is attached
to the JIRA.

The thrift representation of the profile is extended
so that all instances of a fragment can be merged
together into a single "aggregated" fragment, with
vectors of counters.

The in-memory representation is transformed in
a similar way. The RuntimeProfile class is
restructured so that there is a common RuntimeProfileBase
class, with RuntimeProfile and AggregatedRuntimeProfile
subclasses. Execution fills in counters in RuntimeProfile
for each instances, then these are aggregated together into
an AggregatedRuntimeProfile on the coordinator. This replaces
the "averaged" profile concept with an abstraction that
more clearly distinguishes what operations apply to aggregated
and unaggregated profiles.

In a future change, we could use AggregatedRuntimeProfile
for status reports so that less data needs to be sent to
the coordinator, and the coordinator needs to do less
processing.

The new profile removes the bad practice of including aggregated
stats as strings from the new profile. These stats can now be
automatically as aggregations of counters. The legacy uses of
InfoString are preserved so as to not lose information but
can be removed when we switch to the transposed profile.

Also make TotalTime and InactiveTime behave like other counters -
they are pretty-printed the same as other counters. Inactive time
is also now subtracted from local time in the averaged profile,
which fixes IMPALA-2794.

TODO in later patches for IMPALA-9382:
These will need to be fixed before this can be considered production
ready.
* The JSON profile generation is not fully implemented for aggregated
  profiles.
* Not all counter times are included in aggregated profile, e.g. time
  series counters.
* The pretty-printing of the various profile counters will need to be
  improved to be more readable, e.g. grouping by host, improving
  formatting.
* The aggregated profile is only updated at the end of the query.
  We need to support live updating.
* Consider how to show local time per instance - make it a first-class
  counter in the profile?

Possible extensions:
* We could better highlight outliers when pretty-printing the profile.

Testing:
* I diffed the text profile of TPC-DS Q1 to make sure there were no
  unexpected changes.
* Added unit test for stats computation in AveragedCounter.
* Passed core tests.
* exhaustive tests
* ASAN tests
* Ran some tests locally with TSAN

Change-Id: I0838c6a0872f57c696267ff4e92d29c08748eb7a
Reviewed-on: http://gerrit.cloudera.org:8080/15798
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 


> Prototype denser runtime profile implementation
> ---
>
> Key: IMPALA-9382
> URL: https://issues.apache.org/jira/browse/IMPALA-9382
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Major
> Attachments: profile_504b379400cba9f2_2d2cf007, 
> tpcds_q10_profile_v1.txt, tpcds_q10_profile_v2.txt, tpcds_q10_profile_v2.txt
>
>
> RuntimeProfile trees can potentially stress the memory allocator and use up a 
> lot more memory and cache than is really necessary:
> * std::map is used throughout, and allocates a node per map entry. We do 
> depend on the counters being displayed in-order, but we would probably be 
> better of storing the counters in a vector and lazily sorting when needed 
> (since the set of counters is generally static after Prepare()).
> * We store the same counter names redundantly all over the place. We'd 
> probably be best off using a pool of constant counter names (we could just 
> require registering them upfront).
> There may be a small gain from switching thrift to using unordered_map, e.g. 
> for the info strings that appear with some frequency in profiles.
> However, I think we need to restructure the th