Michael Rose created CRUNCH-602:
-----------------------------------
Summary: Combiner initialization repeatedly retrieves RT nodes
from DistCache, leading to high NN load
Key: CRUNCH-602
URL: https://issues.apache.org/jira/browse/CRUNCH-602
Project: Crunch
Issue Type: Improvement
Components: Core
Affects Versions: 0.13.0, 0.12.0, 0.14.0
Environment: Crunch 0.14-SNAPSHOT, CDH5.6.0
Reporter: Michael Rose
Assignee: Josh Wills
When running one of our Crunch pipelines, we noticed our NameNode under very
heavy load. We run our masters on pretty light hardware, so our NN was sitting
at 100% CPU.
Crunch reads the RTNodes during creation of a CrunchTaskContext. These are
created when Mappers and Reducers are created. Importantly, a CrunchCombiner is
a subclass of a Reducer, so each mapper will create R combiners where R is the
number of reducers and thus R CrunchTaskContexts. Consequently in highly
parallel jobs, this means M*R semi-expensive calls to the NameNode.
In the constructor for CrunchTaskContext, this is the read to the DistCache:
this.nodes = (List<RTNode>) DistCache.read(conf, path);
Which then leads to a read into the NN + deserialization.
For now, we took the overly simplistic approach of caching the results of the
DistCache read in a Guava cache. The cache ensures combiners reuse RTNodes with
only the overhead of deserialization which is somewhat unavoidable as RTNodes
are stateful and not reusable. However, it's not configurable except by
modifying code.
I'll attach the patch, but given that it's not yet configurable I wouldn't call
it a "fix available." There may be much better ways of fixing this issue as
well -- if you have some guidance I'd be happy to do the legwork on a patch.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)