I actually tried that first. I moved away from it because the algorithm needs to evaluate all records for all models, for instance, a model trained on (2,4) needs to be evaluated on a record whose true label is 8. I found that if I apply the filter in the label-creation transformer, then a record whose label is not 2 or 4 will not be scored. I'd be curious to discover if there's a way to make that approach work, however.
On Thu, Jan 10, 2019 at 12:20 PM Xiangrui Meng <men...@gmail.com> wrote: > In your custom transformer that produces labels, can you filter null > labels? A transformer doesn't always need to do 1:1 mapping. > > On Thu, Jan 10, 2019, 7:53 AM Patrick McCarthy > <pmccar...@dstillery.com.invalid wrote: > >> I'm trying to implement an algorithm on the MNIST digits that runs like >> so: >> >> >> - for every pair of digits (0,1), (0,2), (0,3)... assign a 0/1 label >> to the digits and build a LogisticRegression Classifier -- 45 in total >> - Fit every classifier on the test set separately >> - Aggregate the results per record of the test set and compute a >> prediction from the 45 predictions >> >> I tried implementing this with a Pipeline, composed of >> >> - stringIndexer >> - a custom transformer which accepts a lower-digit and upper-digit >> argument, producing the 0/1 label >> - a custom transformer to assemble the indexed strings to VectorUDT >> - LogisticRegression >> >> fed by a list of paramMaps. It failed because the fit() method of >> logistic couldn't handle cases of null labels, i.e. a case where my 0/1 >> transformer found neither the lower nor the upper digit label. I fixed this >> by extending the LogisticRegression class and overriding the fit() method >> to include a filter for labels in (0,1) -- I didn't want to alter the >> transform method. >> >> Now, I'd like to tune these models using CrossValidator with an estimator >> of pipeline but when I run either fitMultiple on my paramMap or I loop over >> the paramMaps, I get arcane Scala errors. >> >> >> Is there a better way to build this procedure? Thanks! >> >