how to work with cluster data generated from the reuters example script

2013-09-20 Thread Jens Bonerz
Hello all, I have trouble figuring out how access the cluster data that are generated in the reuters example script. I am specifically interested in a plaintext export of the clustered data in a csv like format for further processing (key:value, distances etc.). The script already creates the "c

What are the best settings for my clustering problem?

2013-09-30 Thread Jens Bonerz
Hello all, I am currently trying create clusters from a group of 50.000 strings that contain product descriptions (around 70-100 characters length each). That group of 50.000 consists of roughly 5.000 individual products and ten varying product descriptions per product. The product descriptions a

Re: What are the best settings for my clustering task

2013-10-02 Thread Jens Bonerz
Isn't the streaming k-means just a different approach to crunch through the data? In other words, the result of streaming k-means should be comparable to using k-means in multiple chained map reduce cycles? I just read a paper about the k-means clustering and its underlying algorithm. According t

Re: What are the best settings for my clustering task

2013-10-02 Thread Jens Bonerz
ordan/papers/kulis-jordan-icml12.pdf > > > > > On Wed, Oct 2, 2013 at 9:11 AM, Jens Bonerz > wrote: > > > Isn't the streaming k-means just a different approach to crunch through > the > > data? In other words, the result of streaming k-means should be

Re: What are the best settings for my clustering task

2013-10-02 Thread Jens Bonerz
have to process these centroids to produce > the desired 5,000 clusters. Since 300,000 is a relatively small number of > data points, this clustering step should proceed relatively quickly. > > > > On Wed, Oct 2, 2013 at 10:21 AM, Jens Bonerz > wrote: > > > thx for your el

Re: What are the best settings for my clustering task

2013-10-03 Thread Jens Bonerz
stering will have to process these centroids to produce > the desired 5,000 clusters. Since 300,000 is a relatively small number of > data points, this clustering step should proceed relatively quickly. > > > > On Wed, Oct 2, 2013 at 10:21 AM, Jens Bonerz > wrote: > > > th

Re: What are the best settings for my clustering task

2013-10-06 Thread Jens Bonerz
; > I don't have command line specifics handy, but you seem to have done very > well already at figuring out the details. > > > On Oct 3, 2013, at 7:30 AM, Jens Bonerz wrote: > > > I created a series of scripts to try out streamingkmeans in mahout an > > incre

Re: What are the best settings for my clustering task

2013-10-06 Thread Jens Bonerz
eans. > > You up for trying to make a patch? > > Sent from my iPhone > > On Oct 6, 2013, at 12:37, Jens Bonerz wrote: > > > Hmmm.. has ballkmeans made it already into the 0.8 release? can't find it > > in the list of available programs when calling the mahout bin

Re: Naive bayes and character n-grams

2013-10-09 Thread Jens Bonerz
Hi Dean, i might be wrong. but try googling for "shingling"... could be something to start with. Cheers Jens 2013/10/9 Ted Dunning > Yes. Should work to use character n-grams. There are oddities in the > stats because the different n-grams are not independent, but Naive Bayes > methods are

Re: Clustering of text data on external categories

2013-10-11 Thread Jens Bonerz
what a nice idea :-) really like that approach 2013/10/11 Ted Dunning > You don't need Mahout for this. > > A very easy way to do this is to gather all the words for each category > into a document. Thus: > > CatA:selling buying sales payment > CatB:gathering collecting > CatC:information data