I am training a boosted trees model on a couple million input samples (with
around 300 features) and am noticing that the input size of each stage is
increasing each iteration. For each new tree, the first step seems to be
building the decision tree metadata, which does a .count() on the input
of ~20K records
was growing each iteration by ~15M records. I could not see why on a
first look.
@jkbradley I know it's not much info but does that ring any bells? I
think Johannes even has an instance of this up and running for
examination.
On Thu, Aug 13, 2015 at 10:04 PM, Matt Forbes
mfor
I have multiple input paths which each contain data that need to be mapped
in a slightly different way into a common data structure. My approach boils
down to:
RDDT rdd = null;
for (Configuration conf : configurations) {
RDDT nextRdd = loadFromConfiguration(conf);
rdd = (rdd == null) ?