Hi, Is my understanding correct that, right now, the way TF-IDF is computed is 3 steps. 1) Apply HashingTF on records and generate TF vectors. 2) Then IDF model is created with input TF vectors - which calculates DF(document frequencies of each term), 3) Finally TF vectors are transformed to TF-IDF by passing TF vectors to IDF Model.
//1 textVectorPairs = PairRDD<text, Vector>; // text and its TF vector IDF idf = new IDF(); //2 IDFModel idfModel = idf.fit(textVectorPairs.values()); //3 textVectorPairs.mapValues(v -> idfModel.transform(v)); Is this correct? Can it be optimized to combine step 2 in step 1 itself, that is computing document frequencies(df) along with TF?