Re: Preserving pairwise distances while normalizing vectors

2011-07-21 Thread Jake Mannix
(1, 0, 0) and (10, 0, 0) have very large distance in R^3, but 0 when projected onto the a patch near the north pole of S^4, while other pairs of vectors may have (nearly) unchanged distances. Am I misunderstanding what the question was? On Thu, Jul 21, 2011 at 9:43 PM, Ted Dunning wrote: > Embe

Re: Preserving pairwise distances while normalizing vectors

2011-07-21 Thread Ted Dunning
Embed onto a very small part of S^4 On Thu, Jul 21, 2011 at 9:14 PM, Jake Mannix wrote: > Think about it in 3-dimensions, how can this work? >

Re: Preserving pairwise distances while normalizing vectors

2011-07-21 Thread Jake Mannix
Wait, this is impossible, not underspecified: if you have 4 vectors, x, y of length N, and z, w of length 1, and six pairwise distances: d_xy, d_yz, d_xz, d_xw, d_yw, d_zw. You want (d_xy / d_zw), (d_xz / d_yw), and (d_xy / d_xz) to all remain fixed after transformation? The first will stay fixed

Re: Preserving pairwise distances while normalizing vectors

2011-07-21 Thread Ted Dunning
This is underspecified. Simply adding an additional large valued coordinate and normalizing back to the sphere will do you what you want. This works because small regions of S^{n+1} are very close to R^n in terms of the Euclidean metric. This is rarely that useful, however, if your interest is c

Preserving pairwise distances while normalizing vectors

2011-07-21 Thread Lance Norskog
I have vectors of different lengths and I would like to normalize them to a unit (hyper)sphere. However, I would like the pairwise distance ratios to be maintained. What transform does this? The use case for this is to make a vector set that uses cosine distances. -- Lance Norskog goks...@gmail.

Re: Treating User Demographics as (Pseudo) Items?

2011-07-21 Thread Lance Norskog
How do 'stacked' recommenders (like the Netflix winners) work? On Wed, Jul 20, 2011 at 9:22 PM, Jamey Wood wrote: > Great.  Thanks, Ted! > > --Jamey > > On Wed, Jul 20, 2011 at 9:57 PM, Ted Dunning wrote: > >> Oh... you do have to be careful with this a bit because some of these side >> factors

Re: Wald's Test / parameter significance tests (Logistic Regression)

2011-07-21 Thread Ted Dunning
Doing variable selection using a chi^2 statistic like Wald's are the log likelihood ratio is a very dangerous thing in high dimensional spaces that are the target of the SGD framework in Mahout. The problem is that the variable selection itself can over-fit. To address this problem, I suggest tha

FW: meanshift reduce task problem

2011-07-21 Thread Jeff Eastman
+dev +user r1149369 implements the previous MAHOUT-749 patch that introduces support for multiple reducers (specified by -Dmapred.reduce.tasks=N) for improved scalability beyond the default of 1. The heuristic sends the clusters produced by each mapper to 1 of N reducers in a round-robin fashio

Wald's Test / parameter significance tests (Logistic Regression)

2011-07-21 Thread Svetlomir Kasabov
Hello, I plan using Mahout's OnlineLogisticRegression for probability estimation. I have extracted many parametes for my classification situation and I want to test how each of them affects the target variable. Can I use Mahout to check this significance (for example using Wald's test or Logl

Re: Evaluating boolean preference data sets

2011-07-21 Thread Marko Ciric
Also the evaluation could be done per user, and thus manually running multiple times per each user. Or simple defining a matrix with relevant items per each user.. On Jul 21, 2011 4:18 PM, "Marko Ciric" wrote: > Yes, there should exist an evaluation that allows you to pass which items > are releva

Re: fkmeans or Cluster Dumper not working?

2011-07-21 Thread Jeffrey
Hi Jeff, lol, this is probably my last reply before i fall asleep (GMT+8 here). First thing first, data file is here: http://coolsilon.com/image-tag.mvc Q: What is the cardinality of your vector data? about 1000+ rows (resources) * 14 000+ columns (tags) Q: Is it sparse or dense? sparse (assumin

RE: fkmeans or Cluster Dumper not working?

2011-07-21 Thread Jeff Eastman
Excellent, so this appears to be localized to fuzzyk. Unfortunately, the Apache mail server strips off attachments so you'd need another mechanism (a JIRA?) to upload your data if it is not too large. Some more questions in the interim: - What is the cardinality of your vector data? - Is it spar

Re: fkmeans or Cluster Dumper not working?

2011-07-21 Thread Jeffrey
Hi Jeff, Q: Did you change your invocation to specify a different -c directory (e.g. clusters-0)? A: Yes :) Q: Did you add the -cl argument? A: Yes :) $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output sensei/clusters --clusters sensei/clusters/clusters-0 --clustering --overwrite

Re: Evaluating boolean preference data sets

2011-07-21 Thread Marko Ciric
Yes, there should exist an evaluation that allows you to pass which items are relevant. On the other hand, generally speaking, I am also trying to evaluate with having relevant items all chosen randomly. Maybe both implementations should exist. On 21 July 2011 15:59, Sean Owen wrote: > You mean,

Re: ItemSimilarity pre-processing

2011-07-21 Thread Abmar Barros
Thanks a lot Sean! I'll try this here. Regards, Abmar On Thu, Jul 14, 2011 at 12:51 PM, Sean Owen wrote: > yes that would probably be just fine for you too. > > On Thu, Jul 14, 2011 at 4:14 PM, Abmar Barros wrote: > > > Thanks for the reply Sean, > > > > Another doubt: Does the ReloadFromJDBCD

Re: Evaluating boolean preference data sets

2011-07-21 Thread Sean Owen
You mean, have the user specify all items that are considered relevant? yes that could be useful. Do you have a patch in mind? Your analysis is correct, and I would not call it a bug. It's a symptom of how little information the evaluation has to work with here without ratings. It has to pick rand

RE: fkmeans or Cluster Dumper not working?

2011-07-21 Thread Jeff Eastman
You are correct, the wiki for fkmeans did not mention the -cl argument. I've added that just now. I think this is what Frank means in his comment but you do *not* have to write any custom code to get the cluster dumper to do what you want, just use the -cl argument and specify clusteredPoints as

Re: fkmeans or Cluster Dumper not working?

2011-07-21 Thread Frank Scholten
Hi Jeffrey, Fuzzy kmeans outputs a [Cluster ID, WeightedVectorWritable] file under clusters/clusteredPoints and a [Cluster ID, SoftCluster] file under clusters/clusters-*, you don't need to write code for that. However if you want to display your clusters in an application, along with nice labels

Evaluating boolean preference data sets

2011-07-21 Thread Marko Ciric
Hi guys, I wonder if Mahout should have a "precision and recall" evaluator that calculates the relevant items data set without looking to the relevance threshold. This would be suitable for data sets with boolean preference nature. In addition, the relevant items can be removed from the training d

Re: Connection Pooling

2011-07-21 Thread Marko Ciric
Actually, as GenericDataModel class works very well as a super class of your desired data model. This way everything is cached into memory and boosts performance a lot. The reloading is actually easy to be implemented with the refresh mechanism (Taste objects implement Refreshable interface). You c

Re: Problem with method Plus in the Vector class

2011-07-21 Thread marco turchi
Dear Sean, thanks a a lot, I'll update the jar Thanks Marco On Thu, Jul 21, 2011 at 1:06 PM, Sean Owen wrote: > Going waaay back to the original question -- I can't reproduce this in the > latest code. > > The result ought to be "{}" since RandomAccessSparseVector will only print > entries that

Re: Problem with method Plus in the Vector class

2011-07-21 Thread Sean Owen
Going waaay back to the original question -- I can't reproduce this in the latest code. The result ought to be "{}" since RandomAccessSparseVector will only print entries that have a value set and are not defaulted to 0.0. And it's smart enough in this case to remove the entry you have set since i

Re: fkmeans or Cluster Dumper not working?

2011-07-21 Thread Jeffrey
Hi again, Let me update on what's working and what's not working. Works: fkmeans clustering (10 clusters) - thanks Jeff for the --cl tip fkmeans clustering (5 clusters) clusterdump (5 clusters) - so points are not included in the clusterdump and I need to write a program for it? Not Working: fk

Re: fkmeans or Cluster Dumper not working?

2011-07-21 Thread Jeffrey
Hi Jeff, Thanks for the help :) Oh, I didn't know there is this --cl argument (because the documentation that I rely on https://cwiki.apache.org/confluence/display/MAHOUT/fuzzy-k-means-commandline don't list it). I will try again later. I was told that the CLI fkmeans utility don't attach poi