Hi All, I'm working with text mining by using Mahoup algorithms. I'm calculating the similarity for text documents, First I computed the TF-IDF for all documents (SequenceFIle format), During computing the similarity, there are a lot of documents do not have any simlair Doc's. So I would like to remove those document from SequenceFile vectors.
Any Idea to do that? Thank in advance, Donni.