It could also be due to the way in which the Pearson correlation is
calculated in both implementations.
The distributed implementation centers all item vectors, scales them to
unit length and computes dot products afterwards.
The single machine implementation centers only based on the common
inte
Raju,
like Sebastian said, it probably due to the default sampling restriction of
hadoop-based implementation.
maxPrefsPerUserInItemSimilarity", "max number of preferences to consider
per user in the "
+ "item similarity computation phase, users with more
preferences will be sampled d
Hello,
I want to cluster document. I see lucene to be great help to do pre
processing
'StandardAnalyzer' to remove stop words, stemming etc.
As Mahout requires input in its vector format.
They have provided following props for the same, which I have used.
org.apache.mahout.utils.vectors.lucene.
No, just was never written I suppose back in the day. The way it is
structured now it creates a test split for each user, which is also
slow, and may be challenging to memory limitations as that's N data
models in memory. You could take a crack at a patch.
When I rewrote this aspect in a separate
Hello,
Is there any good reason why the *GenericRecommenderIRStatsEvaluator* does
not support parallel (multi-CPU) evaluation. Today is quite common to have
CPUs with more than one core, and IR evaluation on any reasonably sized
data set takes forever to finish. I'm thinking if we could paralleliz