Mlib: TF-IDF Computation Improvement

Reth RM Wed, 14 Dec 2016 23:08:46 -0800

Hi,

Is my understanding correct that, right now,  the way TF-IDF is computed is
3 steps.
1) Apply HashingTF on records and generate TF vectors.
2) Then IDF model is created with input TF vectors - which calculates
DF(document frequencies of each term),
3) Finally TF vectors are transformed to TF-IDF by passing TF vectors to
IDF Model.


//1
textVectorPairs = PairRDD<text, Vector>; // text and its TF vector
IDF idf = new IDF();
//2
IDFModel idfModel = idf.fit(textVectorPairs.values());
//3
textVectorPairs.mapValues(v -> idfModel.transform(v));

Is this correct? Can it be optimized to combine step 2 in step 1 itself,
that is computing document frequencies(df) along with TF?

Mlib: TF-IDF Computation Improvement

Reply via email to