Input size increasing every iteration of gradient boosted trees [1.4]

2015-08-13 Thread Matt Forbes
I am training a boosted trees model on a couple million input samples (with around 300 features) and am noticing that the input size of each stage is increasing each iteration. For each new tree, the first step seems to be building the decision tree metadata, which does a .count() on the input

Re: Input size increasing every iteration of gradient boosted trees [1.4]

2015-08-13 Thread Matt Forbes
of ~20K records was growing each iteration by ~15M records. I could not see why on a first look. @jkbradley I know it's not much info but does that ring any bells? I think Johannes even has an instance of this up and running for examination. On Thu, Aug 13, 2015 at 10:04 PM, Matt Forbes mfor

Union of many RDDs taking a long time

2015-06-17 Thread Matt Forbes
I have multiple input paths which each contain data that need to be mapped in a slightly different way into a common data structure. My approach boils down to: RDDT rdd = null; for (Configuration conf : configurations) { RDDT nextRdd = loadFromConfiguration(conf); rdd = (rdd == null) ?