Hello everybody,

I'd like to discuss some issues with you regarding the 3rd layer of
our proposed tuwoc-architecture: the feature extraction from the
preprocessed crawled blog entries.

Currently we do a rather simple process: compute for each document
TFIDF of all terms in the corpus. This is implemented straight-forward
as a two-step map/reduce job. First a map job computes (and serializes
to HBASE) TF histograms for each document. Then a reduce job computes
the IDF of all terms occuring in the corpus and serializes the list of
term/IDF pairs to HDFS. Finally, a third map job uses the serialized
term/IDF pairs and TF histograms to compute a feature vector for each
document. So basically, our feature space is the set of all term/IDF
pairs.

I currently see one major issue with this approach: our feature space
- and thus our feature vectors - will probably get very large when
many documents are scanned. This will obviously lead to the clustering
being very slow. We probably will have to perform some kind of feature
reduction during the feature extraction to get smaller - but still
expressive - feature vectors. One idea would e.g. be to perform PCA on
the "complete" feature vectors in order to identify dimensions that
can be pruned. However, this might be computationally too expensive.
Since I am not very experienced in this field, I hoped that some of
you could share their thoughts or sugestions on the issue.

Cheers,
Max

Reply via email to