RE: Mahout parallel K-Means - algorithms analysis

2014-03-18 Thread hiroshi leon
Thank you Wei and Suneel, By the way, does somebody know if the Parallel K-means of Mahout is using Cannopy clustering at the beginning to generate the initial K in the K-Means driver class? Best regards, Hiroshi Date: Mon, 17 Mar 2014 13:05:01 -0700 Subject: Re: Mahout parallel K-Means -

Command line vector to sequence file

2014-03-18 Thread Margusja
Hi I am looking a simple way in a command line how to convert vector to sequence file. in example I have data.txt file contains vectors. 1,1 2,1 1,2 2,2 3,3 8,8 8,9 9,8 9,9 So is there command line possibility to convert that into sequence file? I tried mahout seqdirectory but after it hdfs

Re: Command line vector to sequence file

2014-03-18 Thread Margusja
Thank you, I am going to try it. Best regards, Margus (Margusja) Roo +372 51 48 780 http://margus.roo.ee http://ee.linkedin.com/in/margusroo skype: margusja ldapsearch -x -h ldap.sk.ee -b c=EE (serialNumber=37303140314) -BEGIN PUBLIC KEY-

Re: Command line vector to sequence file

2014-03-18 Thread Kevin Moulart
You're welcome ! Here's the repository if need be : https://github.com/kmoulart/hadoop_mahout_utils Kévin Moulart 2014-03-18 10:00 GMT+01:00 Margusja mar...@roo.ee: Thank you, I am going to try it. Best regards, Margus (Margusja) Roo +372 51 48 780 http://margus.roo.ee

Re: Mahout parallel K-Means - algorithms analysis

2014-03-18 Thread Suneel Marthi
Canopy and KMeans run independently and do not call eachother. For KMEans, the K value has to be specified when invoking KMeans. Typically u run Canopy first and then invoke KMeans with the appropriate K-value as inferred from Canopy. On Tuesday, March 18, 2014 4:33 AM, hiroshi leon

RE: Mahout parallel K-Means - algorithms analysis

2014-03-18 Thread hiroshi leon
Thanks Suneel, Can someone please explain me a litlte bit about the ClusteringPolicy and the clusterClassifier? and what are the benefits when using it with parallel K-Means? Thank you so much, Best regards. Date: Tue, 18 Mar 2014 04:35:14 -0700 From: suneel_mar...@yahoo.com Subject: Re:

Re: Naive Bayes classification

2014-03-18 Thread Frank Scholten
Hi Tharindu, If I understand correctly seqdirectory creates labels based on the file name but this is not what you want. What do you want the labels to be? Cheers, Frank On Tue, Mar 18, 2014 at 2:22 PM, Tharindu Rusira tharindurus...@gmail.comwrote: Hi everyone, I'm developing an

Introducing PredictionIO: A developer-friendly Mahout stack for production

2014-03-18 Thread Simon Chan
Hi, After a year of work, I would like to present PredictionIO project ( https://github.com/PredictionIO) to this community. When a few of us were doing PhD study, Mahout was the de facto Java package that we used in many research work. This is a very powerful algorithm library, yet we see that

Text clustering with hashing vector encoders

2014-03-18 Thread Frank Scholten
Hi all, Would it be possible to use hashing vector encoders for text clustering just like when classifying? Currently we vectorize using a dictionary where we map each token to a fixed position in the dictionary. After the clustering we use have to retrieve the dictionary to determine the

Re: Text clustering with hashing vector encoders

2014-03-18 Thread Ted Dunning
Yes. Hashing vector encoders will preserve distances when used with multiple probes. Interpretation becomes somewhat difficult, but there is code available to reverse engineer labels on hashed vectors. IDF weighting is slightly tricky, but quite doable if you keep a dictionary of, say, the most

Re: reduce is too slow in StreamingKmeans

2014-03-18 Thread Suneel Marthi
When dealing with Streaming KMeans, it would be helpful for troubleshooting purposes if u could provide the values for k (# of clusters), km ( = k log n) and n (# of datapoints). Try setting -Xmx to a higher heap size and run the sequential version again. I had seen OOM errors happen during

Re: Naive Bayes classification

2014-03-18 Thread Suneel Marthi
Tharindu, If I understand what u r trying to do:- a) You have a trained Bayes model. b) You would like to classify new documents using this trained model. c) You were trying to use TestNaiveBayesDriver to classify the documents in (b). Option 1: --- You could write a custom MapReduce

clusterdump samplePoints parameter

2014-03-18 Thread Terry Blankers
Hi all, Can someone please answer a quick question about the --samplePoints parameter in the clusterdump utility? I understand it specifies the number of points returned per cluster. But are the points per cluster ordered or ranked in any way before this truncation occurs? Thanks, Terry

Re: clusterdump samplePoints parameter

2014-03-18 Thread Suneel Marthi
Its the max. no. of points to include from each cluster in the clusterdump. If not specified all points would be included. On Tuesday, March 18, 2014 11:25 PM, Terry Blankers te...@amritanet.com wrote: Hi all, Can someone please answer a quick question about the --samplePoints parameter

Re: Text clustering with hashing vector encoders

2014-03-18 Thread Suneel Marthi
+1 to this. We could then use Hamming Distance to compute the distances between Hashed Vectors. We have  the code for HashedVector.java based on Moses Charikar's SimHash paper. On Tuesday, March 18, 2014 7:14 PM, Ted Dunning ted.dunn...@gmail.com wrote: Yes.  Hashing vector encoders

Re: Text clustering with hashing vector encoders

2014-03-18 Thread Andrew Musselman
How does with multiple probes affect distance preservation, and how would idf weighting get tricky just by hashing strings? Would we be computing distance between hashed strings, or distance between vectors based on counts of hashed strings? On Tue, Mar 18, 2014 at 8:50 PM, Suneel Marthi

Re: Naive Bayes classification

2014-03-18 Thread Tharindu Rusira
Hi, first of all I'm sorry that my previous mail was vague and poorly formulated. Yes, Suneel got exactly what I was asking.Both options will address my requirement. Thanks a lot. -Tharindu On Mar 19, 2014 8:51 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: Tharindu, If I understand what u

Multiple errors and messages

2014-03-18 Thread Mahmood Naderan
Hello When  run the following command on Mahout-0.9  and Hadoop-1.2.1, I get multiple errors and I can not figure out what is the problem? Sorry for the long post. [hadoop@solaris ~]$ mahout wikipediaDataSetCreator -i wikipedia/chunks -o wikipediainput -c ~/categories.txt Running on hadoop,