assuming task memory x number of cores does not exceed ~5g, and block cache
manager ratio does not have some really weird setting, the next best thing
to look at is initial task split size. I don' think in the release you are
looking at the driver manages initial off-dfs splits satisfactorily
I’m trying to run Mahout 0.10 with Spark 1.1.1.
I have input files with 8k, 10M, 20M, 25M.
So far I run with the following configuration:
8k with 1,2,3 slaves
10M with 1, 2, 3 slaves
20M with 1,2,3 slaves
But when I try to run
bin/mahout spark-itemsimilarity --master spark://node1:7077 --input
Oh, I thought kmeans gave me a point vector as a centroid, not a calculated
point central to a cluster. I guess in this case I would be looking for the
most central point vector (from the index ) that I can use as a
representative of the cluster.
On Tue, Jul 21, 2015 at 6:41 AM, Andrew Musselman
The most central point in a cluster is often referred to as a medoid
(similar to median, but multi-dimensional).
The Mahout code does not compute medoids. In general, they are difficult
to compute and implementing a full k-medoid clustering algorithm even more
so.
On Mon, Jul 20, 2015 at 6:25
That kind of puts me in a tough position. I was planning to use kmeans as a
method for aggregating similar articles from multiple news sources, and
then getting a representative article from those. Here I mean similar as in
the articles are from different news sources but are about the exact same
It's possible you could write a post-processing step to find the closest
point to the centroid based on the distance property if I'm recalling it
correctly.
On Mon, Jul 20, 2015 at 6:45 PM, Ankit Goel ankitgoel2...@gmail.com wrote:
That kind of puts me in a tough position. I was planning to use
I'm not sure centroid id is even a defined thing, especially since the
centroid, in my understanding, is just a point in space, not necessarily a
point in your data.
Are you trying to find the most-central point in a given cluster?
On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel
Hi,
I was wondering if its possible to use only partial solr index for
clustering. For example, my crawler updates my solr index every hour with
new documents, and I just want to cluster those new documents, not the old
ones. If I was programming normally, I could query solr for the latest
Hmm, kmeans algorithmically is supposed to only annoint existing
vectors(documents) as the centroid for a cluster every step (or so I
believe). If mahout is generating non document vector as a centroid, it
changes a lot of things.
That would also explain the -distanceMeasure option in
You can always just pick the article closest to the centroid.
But I think that you may find that with simple k-means that clusters are
going to be about more than one thing.
On Mon, Jul 20, 2015 at 8:21 PM, Ankit Goel ankitgoel2...@gmail.com wrote:
Hmm, kmeans algorithmically is supposed to
True that. Kmeans is just a first step anyways. Definetely needs tuning.
Thanks guys
On Tue, Jul 21, 2015 at 9:46 AM, Ted Dunning ted.dunn...@gmail.com wrote:
You can always just pick the article closest to the centroid.
But I think that you may find that with simple k-means that clusters are
11 matches
Mail list logo