Re: How to improve clustering?

2013-03-28 Thread Dan Filimon
Sebastian, if you're interested I'd be glad to walk you through the main ideas, point you to the code and tell you how to run it. Testing it on more data would be very helpful the project. But, it makes hard cluster assignments. On Mar 28, 2013, at 2:23, Ted Dunning ted.dunn...@gmail.com

Re: How to improve clustering?

2013-03-28 Thread Ted Dunning
It makes hard cluster assignments, but that would be helpful two ways: a) it will help you diagnose data issues b) it can produce good starting points for fuzzy k-means. On Thu, Mar 28, 2013 at 7:19 AM, Dan Filimon dangeorge.fili...@gmail.comwrote: Sebastian, if you're interested I'd be glad

Re: Clustering 20newsgroups with StreamingKMeans [was How to improve clustering?]

2013-03-28 Thread Ted Dunning
I will have to think on this a bit. It should be possible to dump the sketches coming from each mapper and look at them for compatibility. Are the mappers seeing only docs from a single news group? That might produce some interesting and odd results. What happens with the sequential version

Re: Number of Clustering MR-Jobs

2013-03-28 Thread Sebastian Briesemeister
Thank you. Splitting the files leads to multiple MR-tasks! Only changing the MR settings of hadoop did not help. In the future it would be nice if the drivers would scale themself and would split the data according to the dataset size and the number of available MR-slots. Cheers Sebastian Am

Re: Number of Clustering MR-Jobs

2013-03-28 Thread Sean Owen
This is really a Hadoop-level thing. I am not sure I have ever successfully induced M/R to run multiple mappers on less than one block of data, even with a low max split size. Reducers you can control. On Thu, Mar 28, 2013 at 9:04 AM, Sebastian Briesemeister

Re: Number of Clustering MR-Jobs

2013-03-28 Thread Ted Dunning
This is a longstanding Hadoop issue. Your suggestion is interesting, but only a few cases would benefit. The problem is that splitting involves reading from a very small number of nodes and thus is not much better than just running the program with few mappers. If the data is large enough to

Re: Number of Clustering MR-Jobs

2013-03-28 Thread Sebastian Schelter
It would also be very hard to do automatically, as clusters are shared and a framework cannot know how much of the shared resources (available map slots) it can take. On 28.03.2013 10:07, Sean Owen wrote: This is really a Hadoop-level thing. I am not sure I have ever successfully induced M/R to

Re: Number of Clustering MR-Jobs

2013-03-28 Thread Sebastian Schelter
Sebastian, For CPU-bound problems like matrix factorization with ALS, we have recently seen good results with multithreaded mappers, where we had the users specify the number of cores to use per mapper. On 28.03.2013 10:20, Ted Dunning wrote: This is a longstanding Hadoop issue. Your

mahout_structure and FPGrowth

2013-03-28 Thread vsaxena
Hello, I am new to mahout, I wanted information about the mahout project structure (all the directories info, what they contain, how i can use them) , basically I am interested in frequent item mining stuff. Besides this, I have executed the command #mahout fpg -i accidents.dat -o patterns -k

Re: Number of Clustering MR-Jobs

2013-03-28 Thread Sebastian Briesemeister
In my case, each map processes requires a lot of memory and I would like to distribute this consumption on multiple nodes. However, I still get out of memory exceptions even if I split the input file into several very small input files??? I though the mapper would consider only one file at a time

RE: classifier for non-linear relationships

2013-03-28 Thread Michael Michael
Thanks Ted. I contacted them to find out about pricing, but I am sure it will be expensive though. It seems that since there are no open source solutions on this, my best is either matlab or to purchase something from a company like neurosolutions or skytree (there are a few others that fit

Regarding ItemBased Recommendation Results

2013-03-28 Thread ch raju
Hi all, I am working on mahout-0.7 recommendations, ran following command from the command line ./bin/mahout recommenditembased --input UserData.csv --output output/ --similarityClassname SIMILARITY_PEARSON_CORRELATION --numRecommendations 10 got the recommendations for every user. I deployed

Re: Regarding ItemBased Recommendation Results

2013-03-28 Thread Sebastian Schelter
The Hadoop-based implementation samples down users with more than 1000 interactions by default, that could be the reason for the differences that you are seeing. On 28.03.2013 15:09, ch raju wrote: Hi all, I am working on mahout-0.7 recommendations, ran following command from the command

Re: Regarding ItemBased Recommendation Results

2013-03-28 Thread Koobas
Are the suggestions completely different, or somewhat different? What about the neighborhoods? On Thu, Mar 28, 2013 at 10:09 AM, ch raju ch.raju...@gmail.com wrote: Hi all, I am working on mahout-0.7 recommendations, ran following command from the command line ./bin/mahout

Re: Number of Clustering MR-Jobs

2013-03-28 Thread Dan Filimon
From what I've seen, even if the mapper does throw an out of memory exception, Hadoop will restart it increasing the memory. There are ways to configure the mapper/reducer JVMs to use more memory by default through the Configuration although I don't recall the exact options. It's probably

Re: Number of Clustering MR-Jobs

2013-03-28 Thread Sebastian Briesemeister
I tried to increase the heap space, but it wasn't enough. It seems the problem is not the number of mappers. I will start another thread for this problem with some more details. Cheers Sebastian Am 28.03.2013 16:41, schrieb Dan Filimon: From what I've seen, even if the mapper does throw an

Fuzyy Clustering accumulates lots of memory

2013-03-28 Thread Sebastian Briesemeister
Dear all, I have a large dataset consisting of ~50,000 documents and a dimension of 90,000. I splitted the created input vectors in smaller files to run a single mapper task on each of the files. However, even with very small files containing only 50 documents, I run into heap space problems. I

Re: Fuzyy Clustering accumulates lots of memory

2013-03-28 Thread Chris Harrington
Don't know if this will help with your heap issues (or if you've already tried it) but increasing the mapred.child.java.opts in the mapred-site.xml resolved some heap issues I was having. I was clustering 67000 small text docs into ~180 clusters and was seeing mapper heap issues until I made

Re: Fuzyy Clustering accumulates lots of memory

2013-03-28 Thread Sebastian Briesemeister
I tried increasing the child heap size. But as I mentioned even 4GB wasn't enough. I am also not sure whether the block size has some influence on the memory, but I assume this is not the case since such a design would be really bad. Any other ideas? Am 28.03.2013 17:40, schrieb Chris

Re: Regarding ItemBased Recommendation Results

2013-03-28 Thread ch raju
yeah, recommendations are completely different, out of 10 only one suggestion got matched.. which neighborhoods are you asking about ? I am new to this, didn't understand.. Thanks regards, Raju On Thu, Mar 28, 2013 at 8:25 PM, Koobas koo...@gmail.com wrote: Are the suggestions completely

Re: classifier for non-linear relationships

2013-03-28 Thread Ray
Why does Mahout not have what this person wants? Is it not really in the scope of Mahout? Used to be in Mahout, gone now? Needed someone to put it there in the first place? Ray. On 03/28/2013 07:02 AM, Michael Michael wrote: Thanks Ted. I contacted them to find out about pricing, but I