No, its not your environment. I am working on this code and might have caused
this.
Let me fix this tomorrow.
From: Rafa Alfaro
To: user@mahout.apache.org
Sent: Tuesday, June 25, 2013 2:01 AM
Subject: Unit test failing on 0.8-SNAPSHOT
Hi,
I'm getting the
Hi,
I'm getting the following error when compiling from the 0.8-SNAPSHOT trunk:
SequenceFilesFromMailArchivesTest.testSequential:108->Assert.assertEquals:144->Assert.assertEquals:115
expected: but
was:
Tests in error:
TestSequenceFilesFromDirectory.testSequenceFileFromDirectoryMapReduce:127
»
On Tue, Jun 25, 2013 at 12:44 AM, Michael Kazekin wrote:
> But doesn't alternation guarantee convexity?
No, the problem remains non-convex. At each step, where half the
parameters are fixed, yes that constrained problem is convex. But each
of these is not the same as the overall global problem be
Ah, thanks Sebastian, that would definitely do it!
On Mon, Jun 24, 2013 at 9:23 PM, Sebastian Schelter wrote:
> Hi Mark,
>
> I think I broke this code when I cleaned up LDA recently. Can you see
> whether everything works after applying the patch attached to
> https://issues.apache.org/jira/bro
Hi Mark,
I think I broke this code when I cleaned up LDA recently. Can you see
whether everything works after applying the patch attached to
https://issues.apache.org/jira/browse/MAHOUT-1268 ?
Thanks,
Sebastian
On 24.06.2013 18:57, Mark Wicks wrote:
> Thanks for the response.
>
> The command li
Hey Ted,
This is my java code
package pkg;
import java.awt.Point;
import java.io.BufferedOutputStream;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.IOException;
import java.io
Thanks for clarification, Owen!
> ALS starts from a random solution and this will result in a different
> solution. The overall problem is non-convex and the process will not
> necessarily converge to the same solution.
But doesn't alternation guarantee convexity?
> Randomness is a common feature o
I'm trying to visualize a Random Forest using ForestVisualizer (with the
output redirected to a file) and am getting an OutOfMemoryError:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2882)
at
java.lang.AbstractStringB
Well, you know, the issue is there, whether we like it or not.
Maybe replication is enough, maybe not.
If there is a workshop on that issue, it's on the radar.
http://beamtenherrschaft.blogspot.com/2013/06/acm-recsys-2013-workshop-on.html
On Mon, Jun 24, 2013 at 6:36 PM, Sean Owen wrote:
> Yeah
Yeah this has gone well off-road.
ALS is not non-deterministic because of hardware errors or cosmic
rays. It's also nothing to do with floating-point round-off, or
certainly, that is not the primary source of non-determinism to
several orders of magnitude.
ALS starts from a random solution and th
This is a common chestnut that gets trotted out commonly, but I doubt that
the effects that the OP was worried about where on the same scale.
Non-commutativity of FP arithmetic on doubles rarely has a very large
effect.
On Mon, Jun 24, 2013 at 11:17 PM, Michael Kazekin wrote:
> Any algorithm is
Any algorithm is non-deterministic because of non-deterministic behavior of
underlying hardware, of course :) But that's an offtop. I'm talking about
specific implementation of specific algorithm, and in general I'd like to know
that at least some very general properties of the algorithm impleme
On Mon, Jun 24, 2013 at 5:43 PM, Dmitriy Lyubimov wrote:
> The point of non-determinism of parallel processing is well known. It was a
> joke to remind to be careful with absolute statements like "never exists",
> as they are very hard to prove. Bringing more positive examples still does
> not pr
The point of non-determinism of parallel processing is well known. It was a
joke to remind to be careful with absolute statements like "never exists",
as they are very hard to prove. Bringing more positive examples still does
not prove an absolute statement made, or make it any stronger from the ma
On Mon, Jun 24, 2013 at 5:07 PM, Dmitriy Lyubimov wrote:
> On Mon, Jun 24, 2013 at 1:35 PM, Michael Kazekin >wrote:
>
> > I agree with you, I should have mentioned earlier that it would be good
> to
> > separate "noise from data" and deal with only what is separable. Of
> course
> > there is no
On Mon, Jun 24, 2013 at 1:35 PM, Michael Kazekin wrote:
> I agree with you, I should have mentioned earlier that it would be good to
> separate "noise from data" and deal with only what is separable. Of course
> there is no truly deterministic implementation of any algorithm,
I am pretty sure "2
On Mon, Jun 24, 2013 at 1:35 PM, Michael Kazekin wrote:
> I agree with you, I should have mentioned earlier that it would be good to
> separate "noise from data" and deal with only what is separable. Of course
> there is no truly deterministic implementation of any algorithm, but I
> would expect
I agree with you, I should have mentioned earlier that it would be good to
separate "noise from data" and deal with only what is separable. Of course
there is no truly deterministic implementation of any algorithm, but I would
expect to see "credible" results on a macro-level (in our case it wou
On Mon, Jun 24, 2013 at 1:07 PM, Michael Kazekin wrote:
> Thank you, Ted!
> Any feedback on the usefulness of such functionality? Could it increase
> the 'playability' of the recommender?
>
Almost all methods -- even deterministic ones -- will have a "credible
interval" of prediction simply becau
Thank you, Ted!
Any feedback on the usefulness of such functionality? Could it increase the
'playability' of the recommender?
> From: ted.dunn...@gmail.com
> Date: Mon, 24 Jun 2013 20:46:43 +0100
> Subject: Re: Consistent repeatable results for distributed ALS-WR recommender
> To: user@mahout.ap
I am guessing (comments welcome) that it is going to be difficult
to guarantee reproducibility under parallel execution conditions.
MapReduce has reduction in its name.
Reduction operations are the main cause of irreproducibility in parallel
codes,
because changing the order of summations changes t
See org.apache.mahout.common.RandomUtils#useTestSeed
It provides the ability to freeze the initial seed. Normally this is only
used during testing, but you could use it.
On Mon, Jun 24, 2013 at 8:44 PM, Michael Kazekin wrote:
> Thanks a lot!
> Do you know by any chance what are the underlying
Thanks a lot!
Do you know by any chance what are the underlying reasons for including such
mandatory random seed initialization?
Do you see any sense in providing another option, such as filling them with
zeroes in order to ensure the consistency and repeatability? (for example we
might want to
Ok, I'll have to look into this, because the last time I modified
CVB0Driver etc, I specifically tested it against cluster_reuters.sh, and it
did exactly what was expected (after having been broken for a while, at
least the doc-topic output part).
On Mon, Jun 24, 2013 at 10:05 AM, Suneel Marthi w
Here is an example of what's happening:
https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters-II/522/console
This was working fine.
From: Suneel Marthi
To: "user@mahout.apache.org"
Sent: Monday, June 24, 2013 1:05 PM
Subject: Re: Interpretating doc-
The matrices of the factorization are initalized randomly. If you fix the
random seed (would require modification of the code) you should get exactly
the same results.
Am 24.06.2013 13:49 schrieb "Michael Kazekin" :
> Hi!
> Should I assume that under same dataset and same parameters for factorizer
Hi!
Should I assume that under same dataset and same parameters for factorizer and
recommender I will get the same results for any specific user?
My current understanding that theoretically ALS-WR algorithm could guarantee
this, but I was wondering could be there any numeric method issues and/or
I have been seeing a similar error (running against trunk) when running
cluster_reuters.sh (with CVB clustering). It complains that "Output
directory: /reuters-lda already exists".
This was working fine a few weeks ago, not sure which of the intermediate code
changes against CVB0Driver, or C
Thanks for the response.
The command line I used is
mahout cvb -ow -dict sparse/dictionary.file-0 -i matrix/matrix -o
cvb/topics -dt cvb/classifications -block 2 -x 2 -cd 1e-10 -k2
-seed 6956 -tf 0.25
This completes with no errors in Mahout 0.7. With Mahout/cvb from trunk I get:
13/06/24 12:
Better would be to build a Hive UDF that vectorizes your data directly from
the Hive table and produces a sequence file with vectors ready to cluster.
Then use the streaming k-means stuff.
On Mon, Jun 24, 2013 at 4:43 PM, Chirag Lakhani wrote:
> What data base interfaces are there for Mahout?
What do you get out, and what exactly is your commandline invocation?
On Mon, Jun 24, 2013 at 6:58 AM, Mark Wicks wrote:
> As a slight correction to my earlier post on running cvb from the
> trunk, the Nan values were my mistake. However, I still haven't had
> any success getting it to write d
What data base interfaces are there for Mahout? The website mentions
MongoDB and Cassandra. I get the feeling these are for recommender systems
only. Are there any database that Mahout can interface directly in order
to perform clustering?
I am thinking of an example where I have a large table
As a slight correction to my earlier post on running cvb from the
trunk, the Nan values were my mistake. However, I still haven't had
any success getting it to write document/topic inferences.
On Sat, Jun 22, 2013 at 7:21 AM, Mark Wicks wrote:
> I tried with cvb from trunk and ran into several p
What code?
On Mon, Jun 24, 2013 at 8:00 AM, Apurv Khare wrote:
> Hi,
>
> I am using clustering for one of my POC.
>
> ** **
>
> My data looks like :
>
> ** **
>
> Id
>
> Gender
>
> Education
>
> Occupation
>
> Income
>
> Age
>
> State
>
> Marital Status
Hi,
I am using clustering for one of my POC.
My data looks like :
Id
Gender
Education
Occupation
Income
Age
State
Marital Status
children
Duration of Relationship
1
1
19
3
1
20
1
3
1
2
2
1
16
15
1
40
7
2
3
2
But for the Clustering I'm excluding the ID field, as it
35 matches
Mail list logo