I am using wholeTextFiles api to load bunch of text files and (caching this object) mapping its text content to tf-idf vectors and then applying kmean on these vectors. The k-mean model after training, predicts the clusterId of trained data by taking list<vectors> of training data, question is how to map this with wholeTextFiles object?
Use case Input: Set of text files present in a directory, process text files and cluster through kmean, output : get cluster membership of each text-file, read its file content that is in wholeTextFiles, and write it to respective clusterId directory.