Re: Unit test failing on 0.8-SNAPSHOT

2013-06-24 Thread Suneel Marthi
No, its not your environment. I am working on this code and might have caused this. Let me fix this tomorrow. From: Rafa Alfaro To: user@mahout.apache.org Sent: Tuesday, June 25, 2013 2:01 AM Subject: Unit test failing on 0.8-SNAPSHOT Hi, I'm getting the

Unit test failing on 0.8-SNAPSHOT

2013-06-24 Thread Rafa Alfaro
Hi, I'm getting the following error when compiling from the 0.8-SNAPSHOT trunk: SequenceFilesFromMailArchivesTest.testSequential:108->Assert.assertEquals:144->Assert.assertEquals:115 expected: but was: Tests in error: TestSequenceFilesFromDirectory.testSequenceFileFromDirectoryMapReduce:127 »

Re: Consistent repeatable results for distributed ALS-WR recommender

2013-06-24 Thread Sean Owen
On Tue, Jun 25, 2013 at 12:44 AM, Michael Kazekin wrote: > But doesn't alternation guarantee convexity? No, the problem remains non-convex. At each step, where half the parameters are fixed, yes that constrained problem is convex. But each of these is not the same as the overall global problem be

Re: Interpretating doc-topic output of cvb

2013-06-24 Thread Jake Mannix
Ah, thanks Sebastian, that would definitely do it! On Mon, Jun 24, 2013 at 9:23 PM, Sebastian Schelter wrote: > Hi Mark, > > I think I broke this code when I cleaned up LDA recently. Can you see > whether everything works after applying the patch attached to > https://issues.apache.org/jira/bro

Re: Interpretating doc-topic output of cvb

2013-06-24 Thread Sebastian Schelter
Hi Mark, I think I broke this code when I cleaned up LDA recently. Can you see whether everything works after applying the patch attached to https://issues.apache.org/jira/browse/MAHOUT-1268 ? Thanks, Sebastian On 24.06.2013 18:57, Mark Wicks wrote: > Thanks for the response. > > The command li

RE: How to Analyse K-mean Clustering output

2013-06-24 Thread Apurv Khare
Hey Ted, This is my java code package pkg; import java.awt.Point; import java.io.BufferedOutputStream; import java.io.BufferedReader; import java.io.File; import java.io.FileNotFoundException; import java.io.FileOutputStream; import java.io.FileReader; import java.io.IOException; import java.io

RE: Consistent repeatable results for distributed ALS-WR recommender

2013-06-24 Thread Michael Kazekin
Thanks for clarification, Owen! > ALS starts from a random solution and this will result in a different > solution. The overall problem is non-convex and the process will not > necessarily converge to the same solution. But doesn't alternation guarantee convexity? > Randomness is a common feature o

ForestVisualizer OutOfMemoryError?

2013-06-24 Thread Adam Baron
I'm trying to visualize a Random Forest using ForestVisualizer (with the output redirected to a file) and am getting an OutOfMemoryError: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2882) at java.lang.AbstractStringB

Re: Consistent repeatable results for distributed ALS-WR recommender

2013-06-24 Thread Koobas
Well, you know, the issue is there, whether we like it or not. Maybe replication is enough, maybe not. If there is a workshop on that issue, it's on the radar. http://beamtenherrschaft.blogspot.com/2013/06/acm-recsys-2013-workshop-on.html On Mon, Jun 24, 2013 at 6:36 PM, Sean Owen wrote: > Yeah

Re: Consistent repeatable results for distributed ALS-WR recommender

2013-06-24 Thread Sean Owen
Yeah this has gone well off-road. ALS is not non-deterministic because of hardware errors or cosmic rays. It's also nothing to do with floating-point round-off, or certainly, that is not the primary source of non-determinism to several orders of magnitude. ALS starts from a random solution and th

Re: Consistent repeatable results for distributed ALS-WR recommender

2013-06-24 Thread Ted Dunning
This is a common chestnut that gets trotted out commonly, but I doubt that the effects that the OP was worried about where on the same scale. Non-commutativity of FP arithmetic on doubles rarely has a very large effect. On Mon, Jun 24, 2013 at 11:17 PM, Michael Kazekin wrote: > Any algorithm is

RE: Consistent repeatable results for distributed ALS-WR recommender

2013-06-24 Thread Michael Kazekin
Any algorithm is non-deterministic because of non-deterministic behavior of underlying hardware, of course :) But that's an offtop. I'm talking about specific implementation of specific algorithm, and in general I'd like to know that at least some very general properties of the algorithm impleme

Re: Consistent repeatable results for distributed ALS-WR recommender

2013-06-24 Thread Koobas
On Mon, Jun 24, 2013 at 5:43 PM, Dmitriy Lyubimov wrote: > The point of non-determinism of parallel processing is well known. It was a > joke to remind to be careful with absolute statements like "never exists", > as they are very hard to prove. Bringing more positive examples still does > not pr

Re: Consistent repeatable results for distributed ALS-WR recommender

2013-06-24 Thread Dmitriy Lyubimov
The point of non-determinism of parallel processing is well known. It was a joke to remind to be careful with absolute statements like "never exists", as they are very hard to prove. Bringing more positive examples still does not prove an absolute statement made, or make it any stronger from the ma

Re: Consistent repeatable results for distributed ALS-WR recommender

2013-06-24 Thread Koobas
On Mon, Jun 24, 2013 at 5:07 PM, Dmitriy Lyubimov wrote: > On Mon, Jun 24, 2013 at 1:35 PM, Michael Kazekin >wrote: > > > I agree with you, I should have mentioned earlier that it would be good > to > > separate "noise from data" and deal with only what is separable. Of > course > > there is no

Re: Consistent repeatable results for distributed ALS-WR recommender

2013-06-24 Thread Dmitriy Lyubimov
On Mon, Jun 24, 2013 at 1:35 PM, Michael Kazekin wrote: > I agree with you, I should have mentioned earlier that it would be good to > separate "noise from data" and deal with only what is separable. Of course > there is no truly deterministic implementation of any algorithm, I am pretty sure "2

Re: Consistent repeatable results for distributed ALS-WR recommender

2013-06-24 Thread Dmitriy Lyubimov
On Mon, Jun 24, 2013 at 1:35 PM, Michael Kazekin wrote: > I agree with you, I should have mentioned earlier that it would be good to > separate "noise from data" and deal with only what is separable. Of course > there is no truly deterministic implementation of any algorithm, but I > would expect

RE: Consistent repeatable results for distributed ALS-WR recommender

2013-06-24 Thread Michael Kazekin
I agree with you, I should have mentioned earlier that it would be good to separate "noise from data" and deal with only what is separable. Of course there is no truly deterministic implementation of any algorithm, but I would expect to see "credible" results on a macro-level (in our case it wou

Re: Consistent repeatable results for distributed ALS-WR recommender

2013-06-24 Thread Dmitriy Lyubimov
On Mon, Jun 24, 2013 at 1:07 PM, Michael Kazekin wrote: > Thank you, Ted! > Any feedback on the usefulness of such functionality? Could it increase > the 'playability' of the recommender? > Almost all methods -- even deterministic ones -- will have a "credible interval" of prediction simply becau

RE: Consistent repeatable results for distributed ALS-WR recommender

2013-06-24 Thread Michael Kazekin
Thank you, Ted! Any feedback on the usefulness of such functionality? Could it increase the 'playability' of the recommender? > From: ted.dunn...@gmail.com > Date: Mon, 24 Jun 2013 20:46:43 +0100 > Subject: Re: Consistent repeatable results for distributed ALS-WR recommender > To: user@mahout.ap

Re: Consistent repeatable results for distributed ALS-WR recommender

2013-06-24 Thread Koobas
I am guessing (comments welcome) that it is going to be difficult to guarantee reproducibility under parallel execution conditions. MapReduce has reduction in its name. Reduction operations are the main cause of irreproducibility in parallel codes, because changing the order of summations changes t

Re: Consistent repeatable results for distributed ALS-WR recommender

2013-06-24 Thread Ted Dunning
See org.apache.mahout.common.RandomUtils#useTestSeed It provides the ability to freeze the initial seed. Normally this is only used during testing, but you could use it. On Mon, Jun 24, 2013 at 8:44 PM, Michael Kazekin wrote: > Thanks a lot! > Do you know by any chance what are the underlying

RE: Consistent repeatable results for distributed ALS-WR recommender

2013-06-24 Thread Michael Kazekin
Thanks a lot! Do you know by any chance what are the underlying reasons for including such mandatory random seed initialization? Do you see any sense in providing another option, such as filling them with zeroes in order to ensure the consistency and repeatability? (for example we might want to

Re: Interpretating doc-topic output of cvb

2013-06-24 Thread Jake Mannix
Ok, I'll have to look into this, because the last time I modified CVB0Driver etc, I specifically tested it against cluster_reuters.sh, and it did exactly what was expected (after having been broken for a while, at least the doc-topic output part). On Mon, Jun 24, 2013 at 10:05 AM, Suneel Marthi w

Re: Interpretating doc-topic output of cvb

2013-06-24 Thread Suneel Marthi
Here is an example of what's happening: https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters-II/522/console This was working fine. From: Suneel Marthi To: "user@mahout.apache.org" Sent: Monday, June 24, 2013 1:05 PM Subject: Re: Interpretating doc-

Re: Consistent repeatable results for distributed ALS-WR recommender

2013-06-24 Thread Sebastian Schelter
The matrices of the factorization are initalized randomly. If you fix the random seed (would require modification of the code) you should get exactly the same results. Am 24.06.2013 13:49 schrieb "Michael Kazekin" : > Hi! > Should I assume that under same dataset and same parameters for factorizer

Consistent repeatable results for distributed ALS-WR recommender

2013-06-24 Thread Michael Kazekin
Hi! Should I assume that under same dataset and same parameters for factorizer and recommender I will get the same results for any specific user? My current understanding that theoretically ALS-WR algorithm could guarantee this, but I was wondering could be there any numeric method issues and/or

Re: Interpretating doc-topic output of cvb

2013-06-24 Thread Suneel Marthi
I have been seeing a similar error (running against trunk) when running cluster_reuters.sh  (with CVB clustering).  It complains that "Output directory: /reuters-lda already exists". This was working fine a few weeks ago, not sure which of the intermediate code changes against CVB0Driver, or C

Re: Interpretating doc-topic output of cvb

2013-06-24 Thread Mark Wicks
Thanks for the response. The command line I used is mahout cvb -ow -dict sparse/dictionary.file-0 -i matrix/matrix -o cvb/topics -dt cvb/classifications -block 2 -x 2 -cd 1e-10 -k2 -seed 6956 -tf 0.25 This completes with no errors in Mahout 0.7. With Mahout/cvb from trunk I get: 13/06/24 12:

Re: database support for clustering

2013-06-24 Thread Ted Dunning
Better would be to build a Hive UDF that vectorizes your data directly from the Hive table and produces a sequence file with vectors ready to cluster. Then use the streaming k-means stuff. On Mon, Jun 24, 2013 at 4:43 PM, Chirag Lakhani wrote: > What data base interfaces are there for Mahout?

Re: Interpretating doc-topic output of cvb

2013-06-24 Thread Jake Mannix
What do you get out, and what exactly is your commandline invocation? On Mon, Jun 24, 2013 at 6:58 AM, Mark Wicks wrote: > As a slight correction to my earlier post on running cvb from the > trunk, the Nan values were my mistake. However, I still haven't had > any success getting it to write d

database support for clustering

2013-06-24 Thread Chirag Lakhani
What data base interfaces are there for Mahout? The website mentions MongoDB and Cassandra. I get the feeling these are for recommender systems only. Are there any database that Mahout can interface directly in order to perform clustering? I am thinking of an example where I have a large table

Re: Interpretating doc-topic output of cvb

2013-06-24 Thread Mark Wicks
As a slight correction to my earlier post on running cvb from the trunk, the Nan values were my mistake. However, I still haven't had any success getting it to write document/topic inferences. On Sat, Jun 22, 2013 at 7:21 AM, Mark Wicks wrote: > I tried with cvb from trunk and ran into several p

Re: How to Analyse K-mean Clustering output

2013-06-24 Thread Ted Dunning
What code? On Mon, Jun 24, 2013 at 8:00 AM, Apurv Khare wrote: > Hi, > > I am using clustering for one of my POC. > > ** ** > > My data looks like : > > ** ** > > Id > > Gender > > Education > > Occupation > > Income > > Age > > State > > Marital Status

How to Analyse K-mean Clustering output

2013-06-24 Thread Apurv Khare
Hi, I am using clustering for one of my POC. My data looks like : Id Gender Education Occupation Income Age State Marital Status children Duration of Relationship 1 1 19 3 1 20 1 3 1 2 2 1 16 15 1 40 7 2 3 2 But for the Clustering I'm excluding the ID field, as it