Thank you Wei and Suneel,
By the way, does somebody know if the Parallel K-means of Mahout is using
Cannopy clustering at the beginning to generate the initial K in the K-Means
driver class?
Best regards,
Hiroshi
Date: Mon, 17 Mar 2014 13:05:01 -0700
Subject: Re: Mahout parallel K-Means -
Hi
I am looking a simple way in a command line how to convert vector to
sequence file.
in example I have data.txt file contains vectors.
1,1
2,1
1,2
2,2
3,3
8,8
8,9
9,8
9,9
So is there command line possibility to convert that into sequence file?
I tried mahout seqdirectory but after it hdfs
Thank you, I am going to try it.
Best regards, Margus (Margusja) Roo
+372 51 48 780
http://margus.roo.ee
http://ee.linkedin.com/in/margusroo
skype: margusja
ldapsearch -x -h ldap.sk.ee -b c=EE (serialNumber=37303140314)
-BEGIN PUBLIC KEY-
You're welcome !
Here's the repository if need be :
https://github.com/kmoulart/hadoop_mahout_utils
Kévin Moulart
2014-03-18 10:00 GMT+01:00 Margusja mar...@roo.ee:
Thank you, I am going to try it.
Best regards, Margus (Margusja) Roo
+372 51 48 780
http://margus.roo.ee
Canopy and KMeans run independently and do not call eachother.
For KMEans, the K value has to be specified when invoking KMeans.
Typically u run Canopy first and then invoke KMeans with the appropriate
K-value as inferred from Canopy.
On Tuesday, March 18, 2014 4:33 AM, hiroshi leon
Thanks Suneel,
Can someone please explain me a litlte bit about the ClusteringPolicy and the
clusterClassifier?
and what are the benefits when using it with parallel K-Means?
Thank you so much,
Best regards.
Date: Tue, 18 Mar 2014 04:35:14 -0700
From: suneel_mar...@yahoo.com
Subject: Re:
Hi Tharindu,
If I understand correctly seqdirectory creates labels based on the file
name but this is not what you want. What do you want the labels to be?
Cheers,
Frank
On Tue, Mar 18, 2014 at 2:22 PM, Tharindu Rusira
tharindurus...@gmail.comwrote:
Hi everyone,
I'm developing an
Hi,
After a year of work, I would like to present PredictionIO project (
https://github.com/PredictionIO) to this community.
When a few of us were doing PhD study, Mahout was the de facto Java package
that we used in many research work. This is a very powerful algorithm
library, yet we see that
Hi all,
Would it be possible to use hashing vector encoders for text clustering
just like when classifying?
Currently we vectorize using a dictionary where we map each token to a
fixed position in the dictionary. After the clustering we use have to
retrieve the dictionary to determine the
Yes. Hashing vector encoders will preserve distances when used with
multiple probes.
Interpretation becomes somewhat difficult, but there is code available to
reverse engineer labels on hashed vectors.
IDF weighting is slightly tricky, but quite doable if you keep a dictionary
of, say, the most
When dealing with Streaming KMeans, it would be helpful for troubleshooting
purposes if u could provide the values for k (# of clusters), km ( = k log n)
and n (# of datapoints).
Try setting -Xmx to a higher heap size and run the sequential version again.
I had seen OOM errors happen during
Tharindu,
If I understand what u r trying to do:-
a) You have a trained Bayes model.
b) You would like to classify new documents using this trained model.
c) You were trying to use TestNaiveBayesDriver to classify the documents in (b).
Option 1:
---
You could write a custom MapReduce
Hi all,
Can someone please answer a quick question about the --samplePoints
parameter in the clusterdump utility? I understand it specifies the
number of points returned per cluster. But are the points per cluster
ordered or ranked in any way before this truncation occurs?
Thanks,
Terry
Its the max. no. of points to include from each cluster in the clusterdump. If
not specified all points would be included.
On Tuesday, March 18, 2014 11:25 PM, Terry Blankers te...@amritanet.com wrote:
Hi all,
Can someone please answer a quick question about the --samplePoints
parameter
+1 to this. We could then use Hamming Distance to compute the distances between
Hashed Vectors.
We have the code for HashedVector.java based on Moses Charikar's SimHash paper.
On Tuesday, March 18, 2014 7:14 PM, Ted Dunning ted.dunn...@gmail.com wrote:
Yes. Hashing vector encoders
How does with multiple probes affect distance preservation, and how would
idf weighting get tricky just by hashing strings?
Would we be computing distance between hashed strings, or distance between
vectors based on counts of hashed strings?
On Tue, Mar 18, 2014 at 8:50 PM, Suneel Marthi
Hi, first of all I'm sorry that my previous mail was vague and poorly
formulated.
Yes, Suneel got exactly what I was asking.Both options will address my
requirement.
Thanks a lot.
-Tharindu
On Mar 19, 2014 8:51 AM, Suneel Marthi suneel_mar...@yahoo.com wrote:
Tharindu,
If I understand what u
Hello
When run the following command on Mahout-0.9 and Hadoop-1.2.1, I get multiple
errors and I can not figure out what is the problem? Sorry for the long post.
[hadoop@solaris ~]$ mahout wikipediaDataSetCreator -i wikipedia/chunks -o
wikipediainput -c ~/categories.txt
Running on hadoop,
18 matches
Mail list logo