Re: Submitting mahout jobs to map/reduce cluster with fair scheduling

2012-11-08 Thread Jeff Eastman
That Job extends org.apache.mahout.common.AbstractJob, so it probably will accept a -D argument to set "mapred.fairscheduler.pool=..." . Have you tried this? On 11/8/12 3:41 PM, Yazan Boshmaf wrote: Hello, I'm trying to run the ASF Email example here: https://cwiki.apache.org/confluence/disp

Re: Introduction to Apache Mahout K-means clustering

2012-11-13 Thread Jeff Eastman
See the response "Re: Clustering without hadoop" by Johannes Schulte two postings earlier than yours on user@m.a.o. The driver functions can also be run in sequential mode from a local file system and do not require Hadoop. There are several examples of Java invocation in the unit tests. TestCl

Re: Issue: Canopy is processing extremly slow, what goes wrong?

2012-11-13 Thread Jeff Eastman
Canopy is very sensitive to the value of T2: Too small a value will cause the creation of very many canopies in each mapper and these will swamp the reducer. I suggest you begin with T1=T2= until you get enough canopies. With CosineDistanceMeasure, a value of 1 ought to produce only a single

Re: Issue: Canopy is processing extremly slow, what goes wrong?

2012-11-14 Thread Jeff Eastman
implementation of the code or i am setting the way too off-set value for parameters? is there any more info that i could provide to help you to help me analyze the issue? if i set t3,t4, would it help? thanks On Tue, Nov 13, 2012 at 10:01 PM, Jeff Eastman wrote: Canopy is very sensitive to the value

Re: Empty clusteredPoints after Dirichlet clustering

2012-11-28 Thread Jeff Eastman
The classification phase of Dirichlet uses a most-likely assignment of points to clusters by default. This means that, unlike the training phase where points are assigned statistically to likely clusters, the classification may result in empty clusters even though those clusters have nonzero co

Re: Is the implementation of CIMapper thread safe ?

2012-12-21 Thread Jeff Eastman
Hi Yunming, The problem I see with what you are proposing is that Hadoop only gives you a single input vector per call of CIMapper.map(). Using multiple threads to perform the body of that method would not be of benefit. If you want to experiment with thread-based concurrent execution of that

Re: Is the implementation of CIMapper thread safe ?

2012-12-22 Thread Jeff Eastman
ism within a mapper, Thanks again everyone! Yunming On Sat, Dec 22, 2012 at 12:48 AM, Jeff Eastman wrote: Hi Yunming, The problem I see with what you are proposing is that Hadoop only gives you a single input vector per call of CIMapper.map(). Using multiple threads to perform the body of t

Re: About Dirichlet clustering's threshold

2012-12-25 Thread Jeff Eastman
Here's a response to a similar question from a couple of months ago: The classification phase of Dirichlet uses a most-likely assignment of points to clusters by default. This means that, unlike the training phase where points are assigned statistically to likely clusters, the classification m

Re: About Dirichlet clustering's threshold

2012-12-26 Thread Jeff Eastman
on. I may be wrong but this is bug? Thanks, Yoshihiro. 2012/12/26 Jeff Eastman Here's a response to a similar question from a couple of months ago: The classification phase of Dirichlet uses a most-likely assignment of points to clusters by default. This means that, unlike the trai

Re: Seeding k-means with canopy clustering / Filter canopies

2013-01-05 Thread Jeff Eastman
Depending upon your data, 0.7 Canopy can be extremely sensitive to the value you specify for T2. Somewhere between the larger T2 value that yields 1 canopy and the smaller T2 value that yields "the wrong number of [i.e. too many] centroids" lies a value that will give you fewer centroids. You c

Re: machine learning algorithm giving wrong results

2013-01-10 Thread Jeff Eastman
To unsubscribe from this list, send an email to user-unsubscr...@mahout.apache.org On 1/10/13 11:14 AM, Walshe, Maurice (RBI-UK) wrote: unsubscribe -Original Message- From: akshay shetye [mailto:akshay.she...@gmail.com] Sent: 10 January 2013 14:40 To: user@mahout.apache.org Subject: Re

Re: Figuring out good values for t1 and t2 for canopy

2013-02-01 Thread Jeff Eastman
I know of no reliable ways to avoid some iteration in setting the T values for Canopy but T1 really has no impact on the number of clusters so setting T1==T2 and experimenting with T2 will reduce your search space. On 2/1/13 6:29 AM, Chris Harrington wrote: Seems my lack of any clusters what

Re: intial centriods for fuzzy k means algorithm

2013-02-01 Thread Jeff Eastman
If you don't specify a -k value but specify a -ci directory that contains clusters you want to use for the prior then the ClusterIterator will use them for kmeans and fuzzyk. You will need to create one or more sequence files containing ClusterWritables to do this. On 2/1/13 9:08 AM, sri krish

Re: intial centriods for fuzzy k means algorithm

2013-02-04 Thread Jeff Eastman
) to Cluster type ? From: Jeff Eastman To: user@mahout.apache.org Sent: Saturday, 2 February 2013 12:18 AM Subject: Re: intial centriods for fuzzy k means algorithm If you don't specify a -k value but specify a -ci directory that contains clusters you

Re: Does something like an "explain" feature exist in Mahout for clustering.

2013-02-04 Thread Jeff Eastman
That's a really good question. Mahout does not have an "explain" feature; however, you can use the ClusterDumper to print out the cluster centers and vectors clustered within each cluster. Output is pretty verbose and, with large text vectors being truncated, might not be that useful. You might

Re: Clustering error

2013-02-04 Thread Jeff Eastman
Kinda looks like you didn't specify the right input file. That job expects the delimited values from the synthetic control download, converts them to vectors and clusters them. The vectors are of cardinality 60 but somehow your input data generated 1151 elements. I'd look there. On 2/4/13 2:

Re: ClusterOutputPostProcessorDriver - strange numbering of generated output foldersas

2013-02-04 Thread Jeff Eastman
Maybe a typo? I would expect folder 16 to follow folder 15. For many reasons though, the cluster numbers may not be monotonic. Suggest you just iterate over the directories that are presented, their names should correspond to the clusterIds that exist in you clusters-final directory. On 2/4/13

Re: Regarding mahout clustering algorithms

2013-02-06 Thread Jeff Eastman
Note that the old clustering algorithms also run without Hadoop in sequential execution mode from the local file system. On 2/6/13 11:04 AM, Tanguy tlrx wrote: Thanks! -- Tanguy 2013/2/6 Ted Dunning https://github.com/tdunning/knn/ especially the docs directory On Wed, Feb 6, 2013 at 7:5

Re: Regarding mahout clustering algorithms

2013-02-08 Thread Jeff Eastman
ingle process in memory. On 2/7/13 4:32 AM, vivek bairathi wrote: Hi Jeff, Can you name some? On Thu, Feb 7, 2013 at 12:33 AM, Jeff Eastman wrote: Note that the old clustering algorithms also run without Hadoop in sequential execution mode from the local file system. On 2/6/13 11:04 AM, Tanguy

Re: Dirichlet process clustering

2013-02-08 Thread Jeff Eastman
What kind of data are you clustering? Which model distribution are you using? How many iterations are you running? How do the cluster n= values change as you increase the number of iterations? On 2/7/13 11:35 AM, Aysu Ezen wrote: Hello, I am having difficulty with Dirichlet process clusteri

Re: how to use a custom distance measure with kmeans?

2013-02-12 Thread Jeff Eastman
You also need to specify a fully-qualified class name On 2/12/13 11:48 AM, Dan Filimon wrote: You need to add the JAR containing the distance measure you want to the classpath. By default the CLASSPATH is set in line 120 of the mahout script. (the script itself is in the bin/ folder of your Maho

Re: How to pick t1 and t2 in canopy

2013-03-07 Thread Jeff Eastman
This is a common question and you can search the email archives for more discussion. Start by setting t1 == t2 as t2 is the variable that controls the number of clusters produced. Then iterate to find a value that gives you the "right" number of clusters. Smaller values of t2 yield larger numbe

Re: KMeans Throwing Hadoop write errors for large values of K

2013-03-08 Thread Jeff Eastman
I don't know where the timeout is happening, but each mapper and each reducer writes all its clusters out at the end of its run. With a large number of clusters, and with the non-sparse center and radius vectors that tend to accumulate, this could take a while... On 3/8/13 9:46 AM, Colum Foley

Re: KMean cluster produces more clusters then requested

2013-03-09 Thread Jeff Eastman
Unfortunately, all attachments are stripped by the Apache mail server. You will need to open a JIRA to get those attachments to us. You could; however, also tell us a bit more about your example: which algorithm are you running and what is your command line? I cannot think of any way that the 0

Re: Retrieving Fuzzy Cluster Probabilities

2013-03-22 Thread Jeff Eastman
On 3/22/13 10:39 AM, Sebastian Briesemeister wrote: Dear all, I am facing troubles when retrieving the cluster probabilities of instances: I am clustering instances using the FuzzyKMeansDriver. Afterwards, I am reading instances of WeightedVectorWritable from the clusteredPoints file (e.g. part

Re: Fuzyy Clustering accumulates lots of memory

2013-03-29 Thread Jeff Eastman
Fuzzy KMeans will use a lot of heap memory because every vector is observed (with weighting) by every cluster. This will make the cluster centers (and other vectors) much more dense than with any of the other clustering algorithms. Figure you are storing 90k doubles in each vector and each clus

Re: Problems with KMeans Clustering - Radius calculation returns incorrect ZERO value in some cases.

2013-05-14 Thread Jeff Eastman
Hi Erinn, The radius calculation in KMeans and other clustering algorithms uses a running sums algorithm (see RunningSumsGaussianAccumulator) and the radius is really the standard deviation produced by this method. In this method (as you likely know) s0 is the number of points observed, s1 is

Re: Problems with KMeans Clustering - Radius calculation returns incorrect ZERO value in some cases.

2013-05-15 Thread Jeff Eastman
classification results. *From:*Jeff Eastman [mailto:jeast...@windwardsolutions.com] *Sent:* Tuesday, May 14, 2013 11:10 AM *To:* Erinn Schorsch *Cc:* user@mahout.apache.org *Subject:* Re: Problems with KMeans Clustering - Radius calculation returns incorrect ZERO value in some cases. Hi Erinn, The

Re: Mahout Cluster attributes

2013-05-24 Thread Jeff Eastman
Another option would be to add a new command line option to the ClusterDumper to produce the abbreviated output you desire. Then you could submit it as a patch and everybody could benefit. Off hand, this seems like a useful output representation. Jeff On 5/24/13 6:57 AM, Rajesh Nikam wrote:

Re: k-means issues

2013-08-01 Thread Jeff Eastman
The clustering arguments are usually directories, not files. Try: mahout clusterdump -d mahout/vectors/dictionary.file-0 -dt sequencefile -i mahout/kmeans-clusters/clusters-1-final -n 20 -b 100 -o cdump.txt -p mahout/kmeans-clusters/clusteredPoints On 8/1/13 2:51 PM, Marco wrote: mahout

Re: CDBw Usage

2013-09-08 Thread Jeff Eastman
Hi Pablo, Look in the CDBw unit tests for examples of invoking it from Java code. Jeff On Sep 6, 2013, at 5:56 PM, Pablo Andretta Jaskowiak wrote: > Hello, > > I'm trying to use the CDBw implementation from Mahout. Given that I > have a dataset in CSV format and a clustering solution how ca

Re: dirichlet clustering with named vectors

2013-10-07 Thread Jeff Eastman
I've not tried this particular use case personally but it should be possible. NamedVectors flow transparently through k-means and Dirichlet, indeed they should be transparent in all Vector clustering. When you run the final clustering (-cl) step with Dirichlet, this will use the final iteration'

Re: AbstractCluster() constructor

2013-11-04 Thread Jeff Eastman
I can't think of a good reason not to do this, but I've not been in that code for a while. If you make that change, do all of the unit tests still run? Can you write another test that uses DenseVectors? If yes, then please submit a JIRA patch. Jeff On Nov 3, 2013, at 1:30 PM, "DeBarr, Dave"

RE: Problems running examples

2011-08-31 Thread Jeff Eastman
, August 31, 2011 12:16 PM To: Jeff Eastman Cc: user@mahout.apache.org Subject: Re: Problems running examples On 10 June 2011 18:34, Jeff Eastman wrote: > I'm still trying to figure out why reuters-0.5 does not work on either of my > clusters. The scripts themselves have no diff and the

RE: why so many place does`t set job.setNumReduceTasks

2011-09-13 Thread Jeff Eastman
You can use -Dmapred.reduce.tasks=n to set the number of reducers for most Mahout CLI jobs. Just be sure it is the first argument. -Original Message- From: myn [mailto:m...@163.com] Sent: Tuesday, September 13, 2011 9:15 AM To: user@mahout.apache.org Subject: why so many place does`t set

RE: Error while running any clustering tasks

2011-09-13 Thread Jeff Eastman
Looks like your input data is not numeric: "2Yu�3.1_0�osLinux". The InputMapper is barfing trying to convert this into a double. -Original Message- From: Varun Thacker [mailto:varunthacker1...@gmail.com] Sent: Tuesday, September 13, 2011 10:09 AM To: user@

RE: Making Mahout cluster results more like Cluto's?

2011-09-16 Thread Jeff Eastman
See inline comments -Original Message- From: Nivlem Trahm [mailto:nivle...@yahoo.com] Sent: Friday, September 16, 2011 12:32 PM To: user@mahout.apache.org Subject: Making Mahout cluster results more like Cluto's? Hi, I evaluated Cluto some time ago, and the results I was getting from

RE: Clustering : Number of Reducers

2011-09-19 Thread Jeff Eastman
Actually, most of the clustering jobs (including DirichletDriver) accept the -Dmapred.reduce.tasks=n argument as noted below. Canopy is the only job which forces n=1 and this is so the reducer will see all of the mapper outputs. Generally, by adjusting T2 & T1 to suitably-large values you can ge

RE: Clustering : Number of Reducers

2011-09-20 Thread Jeff Eastman
write canopies in a filesystem. This is done as a mapreduce job. Then the KMeansDriver needs these canopy points as input to run KMeans. On 20-09-2011 01:39, Jeff Eastman wrote: > Actually, most of the clustering jobs (including DirichletDriver) accept the > -Dmapred.reduce.tasks=n ar

RE: Clustering : Number of Reducers

2011-09-20 Thread Jeff Eastman
nique would you suggest to cluster really big data ( considering performance and big size as parameters )? Thanks and Regards, Paritosh Ranjan On 20-09-2011 21:35, Jeff Eastman wrote: > Well, while it is true that the CanopyDriver writes all its canopies to the > file system, they are written at

RE: Clustering : Number of Reducers

2011-09-20 Thread Jeff Eastman
g single vector ). On 20-09-2011 22:56, Jeff Eastman wrote: > I guess it depends upon what you expect from your HUGE data set: How many > clusters do you believe it contains? A hundred? A thousand? A million? A > billion? With the right T-values I believe Canopy can handle the first t

RE: Clustering : Number of Reducers

2011-09-20 Thread Jeff Eastman
ersion of clustering to a "persisted" one. The current implementation is not scalable. I have a valid business scenario with 5 million clusters, and I think there would be more users with bigger datasets/cluster numbers. Thanks and Regards, Paritosh Ranjan On 20-09-2011 23:35, Jeff Eastman wr

RE: Clustering : Number of Reducers

2011-09-20 Thread Jeff Eastman
would be more users with bigger > datasets/cluster numbers. > > > Thanks and Regards, > Paritosh Ranjan > > On 20-09-2011 23:35, Jeff Eastman wrote: > >> As all the Mahout clustering implementations keep their clusters in >> memory, I don't believe any of t

RE: How much memory do I need? : Clustering : Hadoop

2011-09-26 Thread Jeff Eastman
This is a common problem with canopy, since it is single-pass and uses a single reducer that must see the outputs of all mappers. You can adjust T2 upward and that will reduce the number of canopies produced by each mapper. T1 does not affect the number of canopies, only their centroid calculati

RE: Clustering : Number of Reducers

2011-09-26 Thread Jeff Eastman
>>> I think, two improvements, can be applied to the current algorithm. >>> >>> 1) To ask for minimum number of vectors to be inside a >>> canopy/cluster, or >>> the cluster is discarded. >>> 2) To change this "in memory" version of clus

RE: Clustering based on Similarity matrix

2011-09-30 Thread Jeff Eastman
The spectral clustering uses similarity matrices. We have no hierarchical implementation. -Original Message- From: prasenjit mukherjee [mailto:prasen@gmail.com] Sent: Thursday, September 29, 2011 10:44 AM To: user@mahout.apache.org Subject: Clustering based on Similarity matrix I ha

RE: Difference in results : Clustering : sequential and MapReduce

2011-10-03 Thread Jeff Eastman
The sequential and mapreduce implementations do not produce the same results, as the sequential implementation runs canopy once and the mapreduce implementation twice: in each mapper and in the reducer. This is documented in https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering (s

RE: Difference in results : Clustering : sequential and MapReduce

2011-10-03 Thread Jeff Eastman
Well, the default clusterFilter == 0, so this is not the difference between the implementations. When you talk about distributing similar vectors to each mapper, you are really moving into a hierarchical clustering method where you cluster your input points into a few large clusters and then clu

RE: Dirichlet Process Clustering not working

2011-10-18 Thread Jeff Eastman
Check out TestClusterDumper.testDirichlet2&3 for an example of text clustering using DPC. It produces reasonable looking clusters when compared with k-means and the other algorithms, but on a small vocabulary. Also check out DisplayDirichlet, which does a great job of clustering some random 2-d

RE: Dirichlet Process Clustering not working

2011-10-18 Thread Jeff Eastman
ative values are coming from. This deserves further exploration, which I am doing... -Original Message- From: Jeff Eastman [mailto:jeast...@narus.com] Sent: Tuesday, October 18, 2011 9:24 AM To: user@mahout.apache.org Subject: RE: Dirichlet Process Clustering

RE: Dirichlet Process Clustering not working

2011-10-19 Thread Jeff Eastman
I agree something is amiss here, but it could be the model is just not suitable for this problem. Running with the Reuters dataset, I see all the points being assigned to C-0 in the very first iteration as you do. I think the problem is with the pdf() calculations in the mapper for very wide vec

RE: Dirichlet Process Clustering not working

2011-10-19 Thread Jeff Eastman
tQuick(i), getCenter().getQuick(i), getRadius().getQuick(i) + 0.01); } return pdf; } -Original Message- From: Jeff Eastman [mailto:jeast...@narus.com] Sent: Wednesday, October 19, 2011 9:04 AM To: user@mahout.apache.org Subject: RE: Dirichlet Process Clustering not wo

RE: Dirichlet Process Clustering not working

2011-10-19 Thread Jeff Eastman
-mp org.apache.mahout.math.DenseVector \ -dm org.apache.mahout.common.distance.CosineDistanceMeasure -Original Message- From: Jeff Eastman [mailto:jeast...@narus.com] Sent: Wednesday, October 19, 2011 9:53 AM To: user@mahout.apache.org Subject: RE: Dirichlet Process Clustering not working Th

RE: Dirichlet Process Clustering not working

2011-10-31 Thread Jeff Eastman
ave played around with Reuters set. Ed ps. The runtime has indeed reduced significantly!!! Possibly 100 times faster as you said. Loved it!! 2011/10/20 Jeff Eastman > R1186452 commits two small changes that seem to do much better with Reuters > than before: > - fixed Dista

RE: ClusterDumper issue

2011-10-31 Thread Jeff Eastman
Unfortunately, the cluster dumper loads all the points into memory so it can sort them by cluster for display. What are you trying to do with the 20M points? Certainly not display them! A better step for subsequent processing would be for you to write a short MR program to read in the clusteredP

RE: does anyone use the "row label bindings" stuff in Vector / Matrix?

2011-11-02 Thread Jeff Eastman
+1 from me too. IIRC this all got added when we were annotating Vectors too and there we ended up with NamedVector as a wrapper. If this Matrix annotation is not being used then let's clean it up. -Original Message- From: Ted Dunning [mailto:ted.dunn...@gmail.com] Sent: Wednesday, Novem

RE: meanshift clustering

2011-11-09 Thread Jeff Eastman
See inline, Jeff -Original Message- From: gaurav redkar [mailto:gauravred...@gmail.com] Sent: Wednesday, November 09, 2011 4:09 AM To: user@mahout.apache.org Subject: meanshift clustering Hi.. I am unable to identify where is the clusterPoints() function in the MeanShiftCanopyClusterer.j

RE: incosistent output while using clusterdumper

2011-11-14 Thread Jeff Eastman
Check out AbstractCluster.formatVector(). If the vector is sparse or if bindings are present it uses the index:value notation, else it uses the more compact notation. -Original Message- From: Grant Ingersoll [mailto:gsing...@apache.org] Sent: Friday, November 11, 2011 7:22 AM To: user@m

RE: meanshift clustering

2011-11-14 Thread Jeff Eastman
rt-m-* correct..?? On Wed, Nov 9, 2011 at 11:27 PM, Jeff Eastman wrote: > See inline, > Jeff > > -Original Message- > From: gaurav redkar [mailto:gauravred...@gmail.com] > Sent: Wednesday, November 09, 2011 4:09 AM > To: user@mahout.apache.org > Subject: me

RE: Dirichlet Clustering Output

2011-11-14 Thread Jeff Eastman
Sorry for the delay in responding. By now you may have already figured this out. If not: 1. Did you specify the -cl option on Dirichlet to emit the clusteredPoints directory? The default is not to do so. 2. Did you specify the -p option on ClusterDumper to use that directory? 3. Which model are

RE: Exceptions when running kmeans from the mahout launcher

2011-11-16 Thread Jeff Eastman
Usually we see this error when the expected input vectors are not present. Often it is a configuration issue. Verify your paths exist and that there are vectors where you expect them. -Original Message- From: Ahmad Ammari [mailto:ammari...@gmail.com] Sent: Wednesday, November 16, 2011 1

RE: NewsKMeansClustering does not find any clusters!

2011-11-16 Thread Jeff Eastman
K-means is attempting to load your initial clusters and is not finding any. Have you checked your -c path? You can also add -xm sequential so you can run the sequential algorithm. This allows you to use a debugger to verify your paths. -Original Message- From: Ahmad Ammari [mailto:ammar

RE: OutofMemoryError when running kmeans or fuzzykmeans cluster method

2011-11-17 Thread Jeff Eastman
How did you set the heap sizes? If you are running on a cluster you need to add properties to your mapred-site.xml. Something like this: mapred.map.child.java.opts -Xmx1500m Java opts for the map tasks. MapR: Default heapsize(-Xmx) is determined by memory reserved for mapreduc

Back On The Grid

2011-12-01 Thread Jeff Eastman
Hi to all of you who may have been wondering what ever happened to Jeff. I've unpacked most of my boxes in Northglenn, CO and am now, mostly, back on line. Of course, there have been over a thousand Mahout postings during my two week absence, so it may take me a while to wade through all of the

Re: Word and Phrase Clustering

2011-12-01 Thread Jeff Eastman
Could you elaborate a bit on what you mean by "cluster a collection of words and phrases by syntactic similarity over a distributed environment "? If you can describe your collection in terms of a set of (sparse or dense) term vectors then you should be able to use Mahout clustering directly. T

Re: DisplayKMean

2011-12-03 Thread Jeff Eastman
I usually run it from Eclipse too, but ./bin/mahout org.apache.mahout.clustering.display.DisplayKMeans just ran fine for me on trunk On 12/2/11 3:47 PM, Grant Ingersoll wrote: You should just be able to run it. It doesn't take in any input. "sample" and "output" are just the names of the d

Re: DisplayKMean

2011-12-03 Thread Jeff Eastman
You also want to run locally. unset HADOOP_HOME unset HADOOP_CONF_DIR On 12/3/11 11:03 AM, Jeff Eastman wrote: I usually run it from Eclipse too, but ./bin/mahout org.apache.mahout.clustering.display.DisplayKMeans just ran fine for me on trunk On 12/2/11 3:47 PM, Grant Ingersoll wrote

Re: MeanShiftCanopyDriver Output

2011-12-06 Thread Jeff Eastman
You will need to wrap your input vectors in a NamedVector, using your document ids as the names. These will pass through the clustering process and you will be able to map each clustered vector back to your input that way. On 12/5/11 5:02 PM, Neil Chaudhuri wrote: I am attempting to programma

Re: Clustering - k-means as a search

2011-12-19 Thread Jeff Eastman
The KMeansDriver has a method (clusterData) which you can invoke from a Java program to cluster (classify) your new data with the old clusters. You need to be sure the vectors are the same size (and the elements denote the same attributes) for this to work. There is currently no CLI to invoke t

Re: KMeans - getting gibrish output and running options

2011-12-27 Thread Jeff Eastman
Mahout in general uses sequence files for input and output. These are binary encoded files that can only be read by a compatible program. If you are trying to e.g. less .../part-xxx then you won't see much that is human readable. You can run the ClusterDumper to get human readable output from r

Re: number of clusters (Canopy Clustering)

2012-01-08 Thread Jeff Eastman
I'm almost certain there is no current way to do this from the command line. You could write a small utility to do this (see CanopyClusterer.buildClustersSeq() for a simple skeleton you could use). But I would suggest trying CosineDistanceMeasure instead of Euclidean for text. If you have a sma

Re: Help using mahout for k-means clustering on existing vectors

2012-01-09 Thread Jeff Eastman
The Synthetic Control examples use a similar (but space delimited) input format and there is an InputDriver in integration/ which can convert those files into Mahout Vector sequence files. You could easily modify the InputMapper to be comma delimited or modify your own file formats to use space

Re: Help using mahout for k-means clustering on existing vectors

2012-01-09 Thread Jeff Eastman
Even better, you might figure out how to pass the desired delimiter into the InputDriver as an argument and submit a patch to make that a permanent Mahout feature. It should be straightforward and it would start you down the path to become a committer. On 1/9/12 2:52 PM, Jeff Eastman wrote

Re: Synthetic control dataset clustering { Doubt Reg Cardinality }

2012-01-11 Thread Jeff Eastman
It kinda looks to me like you have some inconsistent data in your input data set. Here's the sequence of operations that occur in Synthetic Control: 1. The InputDriver reads your input text files and produces sequence files of VectorWritable. Each vector is produced by the InputMapper after pr

Re: Clustering user profiles

2012-01-12 Thread Jeff Eastman
What you have read is correct. Mahout clustering (unsupervised classification) can only deal with continuous, homogeneous vector representations of the input data, where each vector element is weighted the same as the other elements. Mahout (supervised) classification can deal with continuous,

Re: Clustering user profiles

2012-01-13 Thread Jeff Eastman
Just remember that Longitude is a spherical coordinate and +179 is closer to -179 than their numeric difference. Latitude is spherical too but +89 is indeed quite far from -89. On 1/13/12 4:36 AM, StreetCat wrote: The raw data had location expressed as strings such as "Paris, France" and I tr

Re: Running K-Means in memory

2012-01-23 Thread Jeff Eastman
That's probably because you are not performing the clustering (vector classification) step. The clusterer has a method (emitPointToNearestCluster) which supports that to files, but you will have to write your own method to do it all in memory. Suggest you look at the driver's sequential cluster

Re: Help regarding ClusterOutputPostProcessor

2012-01-25 Thread Jeff Eastman
Mean Shift accumulates the pointIds of every point assigned to a cluster, so I would expect n= to be correct in the cluster dumper output. It is most likely the postprocessor is misbehaving. Please create a JIRA and attach your dataset and we will take a look at it. It would also be useful for

Re: Apache Mahout 0.6 Released

2012-02-07 Thread Jeff Eastman
+1 Congratulations to Shannon for a job well done. We now have a 0.6 release and can begin to concentrate on the plan and issues for a 0.7 release. On 2/6/12 2:19 PM, Shannon Quinn wrote: Apache Mahout has reached version 0.6. All developers are encouraged to begin using version 0.6, as much h

Fwd: Re: Goals for Mahout 0.7

2012-02-12 Thread Jeff Eastman
2012 at 8:01 PM, Jeff Eastman wrote: Now that 0.6 is in the box, it seems a good time to start thinking about 0.7, from a high level goal perspective at least. Here are a couple that come to mind: Target code freeze date August 1, 2012 Get Jenkins working for us again Complete clustering

Re: Goals for Mahout 0.7

2012-02-12 Thread Jeff Eastman
dencies on Mahout and on an analyzer to test things out. Another thing would be adding or improving the integration tools. For example adding a mysql2seq to cluster text from a SQL database. On Sat, Feb 11, 2012 at 8:01 PM, Jeff Eastman wrote: Now that 0.6 is in the box, it seems a good time to sta

Re: Goals for Mahout 0.7

2012-02-12 Thread Jeff Eastman
We have a couple JIRAs that relate here: We want to factor all the (-cl) classification steps out of all of the driver classes (MAHOUT-930) and into a separate job to remove duplicated code; MAHOUT-931 is to add a pluggable outlier removal capability to this job; and MAHOUT-933 is aimed at fact

Re: Goals for Mahout 0.7

2012-02-14 Thread Jeff Eastman
+users@ Just to be clear, I'm not advocating replacing the JIRA process with a new set of green-field goals. Rather, IMHO, having a small number of overarching goals for a release *could* help us focus our efforts (triage our feature JIRAs) and *might* suggest some missing JIRAs that would gi

Re: Mahout 0.5 java.lang.IllegalStateException: No clusters found. Check your -c path.

2012-02-16 Thread Jeff Eastman
This is correct, and is actually what the documentation says though it may not be completely clear. The reason for this is somewhat historical: K-Means originally did not have a -k argument and required the user to provide the prior cluster centers in the -c argument, using canopy for example. Then

Re: How to use clusterpp?

2012-02-17 Thread Jeff Eastman
For human-readable output, yes. On 2/17/12 6:09 AM, Tharindu Mathew wrote: Or I can just use the cluster dump tool right...? On Fri, Feb 17, 2012 at 5:55 PM, Paritosh Ranjan wrote: Try logging in and updating. Thanks... On 17-02-2012 17:54, Tharindu Mathew wrote: OffTopic: How would I con

Re: Goals for Mahout 0.7

2012-02-22 Thread Jeff Eastman
of primitives for designing new serial and distributed machine learning algorithms. And I think it has a high utility for integration into highly visible commercial projects. But its high level public API really is a barrier to entry when trying to design commercial applications. On Sun, Feb 12,

Re: 0.7 Priorities

2012-02-22 Thread Jeff Eastman
Sure, just look at Hadoop. But I'm not hung up on the port/starboard numbering scheme either. Either 0.6.1 or 0.7 work for me. On 2/22/12 11:46 AM, Jake Mannix wrote: On Wed, Feb 22, 2012 at 10:40 AM, Dmitriy Lyubimovwrote: I guess i still prefer 0.6.1 for maintenance releases (esp. given the

Re: Hey, new committers!

2012-02-22 Thread Jeff Eastman
Grin. I broke it myself and I'm on it. Jeff On 2/22/12 5:45 PM, Lance Norskog wrote: Please fix the Jenkins build.

Re: Un-observing in a Canopy

2012-03-04 Thread Jeff Eastman
You can cause a canopy to un-observe a vector by observing it with a -1 weight. This will have the effect of subtracting all influence of the vector on observations of that canopy. But you can only do that before you computeParameters, as all observation semantics is reset by that operation and

Re: Query regarding meanshift clustering

2012-03-08 Thread Jeff Eastman
+1 exactly correct Paritosh. When I wrote that I was using a physical mental model of, perhaps, accretion of mass in a swarm of asteroids or in deep space where stars accrete from irregularities in the interstellar dust cloud. I've never been 100% sure this is truely MeanShift, but it seems to

Re: Not all Mapper/Reducer slots are taken when running K-Means cluster

2012-03-10 Thread Jeff Eastman
What's your Hadoop config in terms of the maximum number of reducers? It's a function of your available RAM on each node and numbers of nodes. On 3/10/12 8:55 PM, WangRamon wrote: > Hi ParitoshI did the tests with 1 job and 5 jobs, they all have the same > problem, the job i'm running is the

Re: canopy cluster size

2012-03-13 Thread Jeff Eastman
EuclideanDistance is not a great choice for document clustering, especially with a lot of terms. Suggest you try CosineDistance which will give you all distances between 0 and 1. If you still end up with only one canopy it is because T2 is too large. T1 has no effect upon the number of canopies

Re: canopy cluster size

2012-03-13 Thread Jeff Eastman
008. Every case, the reducer quickly passed 67%, then very very slowly progress, for example, it takes several minutes to finish 1% more. Is that something wrong in my data? Best Baoqiang On Tue, Mar 13, 2012 at 3:08 PM, Jeff Eastman wrote: EuclideanDistance is not a great choice for document

Re: canopy cluster size

2012-03-14 Thread Jeff Eastman
is the inevitable solution for my problem. Ironically, I went to canopy in hope of getting better results out of kmeans. Thanks again. Baoqiang On Tue, Mar 13, 2012 at 5:01 PM, Jeff Eastman wrote: No, Canopy only uses a single reducer, so what's happening is many mappers are munching your

Re: Canopy Job failed processing, Error: Java heap space

2012-03-14 Thread Jeff Eastman
With Canopy this is a symptom of T2 being too large. This causes an explosion of clusters - in the limit, one per input vector - and if vector dimension is large too there is no amount of memory which can hold them all for large datasets. Reduce T2 until you get a tractable number of clusters, then

Re: Why there is "Infinity" values for the vector of a K-Means cluster center point?

2012-03-16 Thread Jeff Eastman
Good question. The only way I can think of an infinity in a Kluster center is if there were some infinity values in the vectors it observed. The center (centroid) is calculated in each iteration after all points have been observed by dividing S1 by S0. If, for some reason, S0 was zero this would ca

Re: empty vector out of clusterdump

2012-03-20 Thread Jeff Eastman
Empty? Note that the printouts of Mahout vectors prints only the non-zero elements. It looks like you may have had many such zero vectors and they were clustered into VL-1705919 which has zero for center and radius. If your other clusters look differently, then I think this is probably correct.

Re: is hadoop necessary for clustering in mahout?

2012-03-22 Thread Jeff Eastman
Most of the Mahout clustering algorithms have an -xm sequential CLI option that runs locally in-memory from/to Hadoop-style sequence files. And, as below, you can also call the Java driver methods directly from your program. On 3/22/12 9:22 AM, Ahmed Abdeen Hamed wrote: Hi, I think I can ans

Re: Ask

2012-04-16 Thread Jeff Eastman
Hi Oscar, It would help a lot if you could provide a bit more information on the data that you wish to cluster, particularly the dimensionality of each record and the number of records. Also please note that Mahout's k-means implementation runs in a batch mode on Hadoop so integrating this wit

Re: Ask

2012-04-18 Thread Jeff Eastman
And, finally, if you want to embed Mahout clustering in your web application and the data all fits into memory or you can devise an Iterator to feed data into it, the ClusterIterator.iterate(...) method will do k-means (also fuzzyK and Dirichlet) clustering in memory. On 4/16/12 11:34 AM, Manu

  1   2   3   4   >