Re: HMM investigations

2011-07-24 Thread Ted Dunning
On Sun, Jul 24, 2011 at 7:52 AM, Dhruv dhru...@gmail.com wrote: ... If you look into the *definition* of HMM, the hidden sequence is drawn from only one set. The hidden sequence's transitions can be expressed as a joint probability p(s0, s1). Similarly the observed sequence has a joint

Re: HMM investigations

2011-07-24 Thread Ted Dunning
, emittedState) method to compute the output probability for a particular hidden state. I believe this is not what the user wanted? Dhruv On Sun, Jul 24, 2011 at 12:56 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Sun, Jul 24, 2011 at 7:52 AM, Dhruv dhru...@gmail.com wrote

Re: HMM investigations

2011-07-24 Thread Ted Dunning
is not that bad idea? I've read that we can do it with PCA (Principle Components Analysis). Is there a Ḿahout code for this somewhere? Thanks a lot once again, Svetlomir. Am 24.07.2011 20:46, schrieb Ted Dunning: My impression (and Svetlomir should correct me) is that the intent was to use two

Re: HMM investigations

2011-07-24 Thread Ted Dunning
24.07.2011 21:15, schrieb Ted Dunning: I remember this problem. Is it possible for you to post some sample data? On Sun, Jul 24, 2011 at 12:08 PM, Svetlomir Kasabov skasa...@smail.inf.fh-brs.de wrote: Hello again and thanks for the replies of both of you, I really apreciate them

Re: Preserving pairwise distances while normalizing vectors

2011-07-22 Thread Ted Dunning
onto the a patch near the north pole of S^4, while other pairs of vectors may have (nearly) unchanged distances. Am I misunderstanding what the question was? On Thu, Jul 21, 2011 at 9:43 PM, Ted Dunning ted.dunn...@gmail.com wrote: Embed onto a very small part of S^4 On Thu

Re: Broken links

2011-07-22 Thread Ted Dunning
It is a family relationship for the most part. Mahout came from the Lucene community. Mahout still uses Lucene.Some Lucene users use Mahout, but Lucene and Solr themselves do not depend on Mahout. On Fri, Jul 22, 2011 at 2:57 PM, Joanne Sun joanneh...@gmail.com wrote: Hi I have a humble

Re: Wald's Test / parameter significance tests (Logistic Regression)

2011-07-21 Thread Ted Dunning
Doing variable selection using a chi^2 statistic like Wald's are the log likelihood ratio is a very dangerous thing in high dimensional spaces that are the target of the SGD framework in Mahout. The problem is that the variable selection itself can over-fit. To address this problem, I suggest

Re: Preserving pairwise distances while normalizing vectors

2011-07-21 Thread Ted Dunning
This is underspecified. Simply adding an additional large valued coordinate and normalizing back to the sphere will do you what you want. This works because small regions of S^{n+1} are very close to R^n in terms of the Euclidean metric. This is rarely that useful, however, if your interest is

Re: Preserving pairwise distances while normalizing vectors

2011-07-21 Thread Ted Dunning
Embed onto a very small part of S^4 On Thu, Jul 21, 2011 at 9:14 PM, Jake Mannix jake.man...@gmail.com wrote: Think about it in 3-dimensions, how can this work?

Re: Problem with method Plus in the Vector class

2011-07-20 Thread Ted Dunning
You constructed the first vector with a dimension of 1. It looks like you constructed the second one with a larger dimension of 2. When you offset a sparse vector, all of the zeros become non-zero and the vector becomes dense. This results in a bunch of cells being created. On Wed, Jul 20,

Re: Problem with method Plus in the Vector class

2011-07-20 Thread Ted Dunning
of entries in the final vector. Thanks a lot for your help Marco On 20 Jul 2011, at 17:42, Ted Dunning wrote: You constructed the first vector with a dimension of 1. It looks like you constructed the second one with a larger dimension of 2. When you offset a sparse vector, all

Re: Problem with method Plus in the Vector class

2011-07-20 Thread Ted Dunning
Nah... just the kind of blindness that keeps me from seeing the blueberries on the second shelf. Happens all the time in my world. On Wed, Jul 20, 2011 at 10:25 AM, Benson Margulies bimargul...@gmail.comwrote: My strong expectation is that this is a case of refrigerator blindness. Small

Re: Problem with method Plus in the Vector class

2011-07-20 Thread Ted Dunning
approach, because it becomes very time and computational expensive. Is there any implementation of an approximate way to compute it in Mahout? I have had a look in the library, but I do not find it. thanks for your help Marco On 20 Jul 2011, at 19:15, Ted Dunning wrote: Well

Re: Problem with method Plus in the Vector class

2011-07-20 Thread Ted Dunning
Just use a frequency weighted cosine distance and index words and anomalously common cooccurrences. That gives you pretty much all you are asking for. Also, your progressive increase approach sounds a lot like k-means. You might take a look to see if that could help. On Wed, Jul 20, 2011 at

Re: Problem with method Plus in the Vector class

2011-07-20 Thread Ted Dunning
cooccurrences, but I'll investigate. Thanks a lot Marco On 20 Jul 2011, at 20:36, Ted Dunning wrote: frequency weighted cosine distance

Re: Problem with method Plus in the Vector class

2011-07-20 Thread Ted Dunning
useful suggestions Marco On 20 Jul 2011, at 23:38, Ted Dunning wrote: Actually, I would suggest weighting words by something like tf-idf weighting. http://en.wikipedia.org/wiki/**Tf%E2%80%93idfhttp://en.wikipedia.org/wiki/Tf%E2%80%93idf log or sqrt(tf) is often good instead of linear tf

Re: Treating User Demographics as (Pseudo) Items?

2011-07-20 Thread Ted Dunning
Yes. This can work. And it can go both ways since you might do something like combine recommendations for a specific book with more general recommendations for a specific author or genre. You can also have recommendations for, say, an author or genre based on demographic quantities such as

Re: Treating User Demographics as (Pseudo) Items?

2011-07-20 Thread Ted Dunning
of prevalence can seriously impact your algorithm run-time (adversely). You can compensate for this by sampling or just recognizing that such pervasive features inherently cannot be very useful since too many things would be recommended. On Wed, Jul 20, 2011 at 8:51 PM, Ted Dunning ted.dunn

Re: Including Unrecommendable Items

2011-07-18 Thread Ted Dunning
I usually just post process the recommendations using a variety of business logic rules. Sent from my iPhone On Jul 18, 2011, at 14:26, Jamey Wood jamey.w...@gmail.com wrote: Is there any best practice for including user preferences for certain items as a Recommender input, but ensuring

Re: Including Unrecommendable Items

2011-07-18 Thread Ted Dunning
Yes... I always forget about that. You must have mentioned this half a dozen times. On Mon, Jul 18, 2011 at 3:10 PM, Sean Owen sro...@gmail.com wrote: (PS that's exactly Rescorer's role... just a hook for whatever biz logic you want to filter by) On Mon, Jul 18, 2011 at 10:52 PM, Ted

Re: what file format is required by naive bayes classfier?

2011-07-17 Thread Ted Dunning
You have the source code. You can make it do anything you like! On Sun, Jul 17, 2011 at 7:28 AM, Xiaobo Gu guxiaobo1...@gmail.com wrote: Hi , Can we use CSV without header or something else? regards, Xiaobo Gu

Re: Clustering demographic data

2011-07-15 Thread Ted Dunning
A typical work-flow for this is to define a disjoint set of demographic groups and then train a classifier that has access to user actions and free geo-demographic data such as IP, geo-IP, time of day and email domain. If you have meta-data from the actions, then you can augment these variables

Re: similarity metrics?

2011-07-13 Thread Ted Dunning
You would have to encode the distributions as vectors. For discrete distributions, I think that this is relatively trivial since you could interpret each vector entry as the probability for an element i of the domain of the distribution. I think that would result in the Hellinger distance [1]

Re: similarity metrics?

2011-07-13 Thread Ted Dunning
If you need this distance, please go for it! The procedure for publishing the results (or the first attempts) is to file a JIRA (see issues.apache.org/jira/browse/MAHOUT ) and attach patches to the JIRA for review or comment. On Wed, Jul 13, 2011 at 2:55 PM, Ian Upright ian-pub...@upright.net

Re: What's the accuracy of random forests in Mahout?

2011-07-12 Thread Ted Dunning
I don't believe that Mahout's random forests have been used in production. I have heard that some people got pretty good results in testing. On Tue, Jul 12, 2011 at 6:03 AM, Xiaobo Gu guxiaobo1...@gmail.com wrote: Hi, When the training data set can be loaded into memory, or each split can

Re: Logistic Regression: number of positives and negatives

2011-07-11 Thread Ted Dunning
Downsampling negatives should make little difference to accuracy. It can substantially affect training time however. Sent from my iPhone On Jul 11, 2011, at 6:56, Svetlomir Kasabov skasa...@smail.inf.fh-brs.de wrote: Hello, I plan using logistic regression for predicting the probability

Re: combination of features worsen the performance

2011-07-11 Thread Ted Dunning
. Feature A s about the Advertisement itself; Feature B is about the user's behaviors; Currently im only using feature A and B. Total training data is 250 for each class; thanks.. From: Ted Dunning [ted.dunn...@gmail.com] Sent: Monday, July 11, 2011 2:15 PM

Re: combination of features worsen the performance

2011-07-11 Thread Ted Dunning
an ad, or not; so 3 classes. Feature A s about the Advertisement itself; Feature B is about the user's behaviors; Currently im only using feature A and B. Total training data is 250 for each class; thanks.. From: Ted Dunning [ted.dunn

Re: Plagiarism - document similarity

2011-07-11 Thread Ted Dunning
Easier to simply index all, say, three word phrases and use a TF-IDF score. This will give you a good proxy for sequence similarity. Documents should either be chopped on paragraph boundaries to have a roughly constant length or the score should not be normalized by document length. Log

Re: Clustering with id

2011-07-11 Thread Ted Dunning
Can you give specific examples? The process should be relatively straightforward and the implication that rows have row labels that are defined by the left operand of a product and columns have column labels that are defined by the right operand should be sufficient. Sums should have the same

Re: Singular vectors of a recommendation Item-Item space

2011-07-10 Thread Ted Dunning
Also, item-item similarity is often (nearly) the result of a matrix product. If yours is, then you can decompose the user x item matrix and the desired eigenvalues are the singular values squared and the eigen vectors are the right singular vectors for the decomposition. On Sun, Jul 10, 2011 at

Re: Dimensional Reduction via Random Projection: investigations

2011-07-09 Thread Ted Dunning
? Are there specific ways to translate these numbers into probabilistic estimates? Is it just way too hairy? Lance On Thu, Jul 7, 2011 at 10:15 PM, Ted Dunning ted.dunn...@gmail.com wrote: This means that the rank 2 reconstruction of your matrix is close to your original in the sense that the Frobenius

Re: Dimensional Reduction via Random Projection: investigations

2011-07-09 Thread Ted Dunning
, with 0.1 percent of the singular vectors missing. What is my confidence in the output data? On Sat, Jul 9, 2011 at 11:46 AM, Ted Dunning ted.dunn...@gmail.com wrote: I don't understand the question. A rotation leaves the Frobenius norm unchanged. Period. Any rank-limited optimal least

Re: Dimensional Reduction via Random Projection: investigations

2011-07-08 Thread Ted Dunning
://en.wikipedia.org/wiki/Singular_value_decomposition#Low-rank_matrix_approximation On Fri, Jul 8, 2011 at 1:53 AM, Lance Norskog goks...@gmail.com wrote: Thanks! Very illuminating. On Thu, Jul 7, 2011 at 10:15 PM, Ted Dunning ted.dunn...@gmail.com wrote: This means that the rank 2

Re: Logistic Regression: poor results on small data set

2011-07-08 Thread Ted Dunning
On Thu, Jul 7, 2011 at 2:20 PM, hakeem t...@indeed.com wrote: Because I have so few documents, I run the set of documents through train() in epochs -- up to 1000 times, shuffling the order of the documents on each epoch. Fair. My questions: 1) Are these results surprising to you? Or,

Re: Logistic Regression: poor results on small data set

2011-07-08 Thread Ted Dunning
If you keep the probes at 2, you should have better results with sparse features and a large dimensionality reduction. On Thu, Jul 7, 2011 at 5:58 PM, hakeem t...@indeed.com wrote: I increased the vector size substantially and reduced the number of probes to 1. With the collisions eliminated,

Re: What's the difference between classic decision tree and Mahout Decision forest algorithm?

2011-07-07 Thread Ted Dunning
The summary of the reason is that this was a summer project and parallelizing the random forest algorithm at all was a big enough project. Writing a single pass on-line algorithm was considered a bit much for the project size. Figuring out how to make multiple passes through an input split was

Re: Dimensional Reduction via Random Projection: investigations

2011-07-07 Thread Ted Dunning
Random Projection, a lame random number generator (java.lang.Random) will generate a higher standard deviation than a high-quality one like MurmurHash. On Fri, Jul 1, 2011 at 5:25 PM, Ted Dunning ted.dunn...@gmail.com wrote: Here is R code that demonstrates what I mean by stunning (aka

Re: Available datasets for recommendations

2011-07-07 Thread Ted Dunning
Those are both reasonably large, but not commercial in scale. At Veoh, we had about 10 non-zero elements in our raw data. I think Netflix has 100 million. On Thu, Jul 7, 2011 at 8:05 PM, Lance Norskog goks...@gmail.com wrote: What recommendation datasets, that are available, are considered

Re: File format question when write map-reduce applications

2011-07-06 Thread Ted Dunning
Of course, this is only true of the TextInputFormat. You can write a CsvInputFormat in which every mapper reads the first line as well as their assigned split. This would cause some delay at the beginning as all of the first round of mappers whacked against the beginning of the file, but that

Re: Using naive bayes classification with continuous, categorical and word-like features

2011-07-05 Thread Ted Dunning
pick a female or male from a height,weight and shoe size. Thanks again for taking the time to answer me. -V On Tue, Jul 5, 2011 at 4:30 AM, Ted Dunning ted.dunn...@gmail.com wrote: The wikipedia page recommends binning if you have a large amount of data and a supervised variable

Re: Using naive bayes classification with continuous, categorical and word-like features

2011-07-05 Thread Ted Dunning
the model. But this isn't working out for me. Thanks for taking a look. Cheers, V On Tue, Jul 5, 2011 at 6:06 PM, Ted Dunning ted.dunn...@gmail.com wrote: How many training examples do you have? Sounds like you have very few. That is definitely not the sweet spot for on-linear

Re: Tranforming data for k-means analysis

2011-07-05 Thread Ted Dunning
Glad we could help. On Tue, Jul 5, 2011 at 7:09 AM, Radek Maciaszek ra...@maciaszek.co.ukwrote: Hello, I worked in the past on MSc project which involved quite a lot of Mahout calculation. I finished it a while ago but only recently got my head around posting it somewhere online. It would

Re: How could I use bayse model with my C++ online classifier

2011-07-05 Thread Ted Dunning
Well, PMML is the (complicated) standard solution. Otherwise, a Naive Bayes model would probably fit as CSV data. But seriously, it isn't that hard to read a sequence file. Re-implementing our serialization in C++ would be generally useful as well. On Tue, Jul 5, 2011 at 7:38 PM, Lance Norskog

Re: Using naive bayes classification with continuous, categorical and word-like features

2011-07-04 Thread Ted Dunning
The mahout implementation of Naive_Bayes does not use continuous variables well. The best bet is to discretize these variables either individually or together using k-means. Then use the discrete version for the classifier. The random forest implementation and the SGD implementation are both

Re: Using naive bayes classification with continuous, categorical and word-like features

2011-07-04 Thread Ted Dunning
The wikipedia page recommends binning if you have a large amount of data and a supervised variable extraction method if not. These are both ways of preprocessing to discretize continuous variables. On Mon, Jul 4, 2011 at 11:28 AM, Ted Dunning ted.dunn...@gmail.com wrote: The mahout

Re: Introducing randomness into my results

2011-07-03 Thread Ted Dunning
On Sat, Jul 2, 2011 at 11:34 AM, Sean Owen sro...@gmail.com wrote: Yes that's well put. My only objection is that this sounds like you're saying that there is a systematic problem with the ordering, so it will usually help to pick any different ordering than the one you thought was optimal.

Re: Introducing randomness into my results

2011-07-03 Thread Ted Dunning
That is the point of the exponential in the example that I gave you. The top few recommendations are nearly stable. It is the lower ranks that are really churned up. This has the property that you state. On Sat, Jul 2, 2011 at 12:45 PM, Salil Apte sa...@offlinelabs.com wrote: I really like

Re: Dimensional Reduction via Random Projection: investigations

2011-07-03 Thread Ted Dunning
I would be very surprised if java.lang.Random exhibited this behavior. It isn't *that* bad. On Sat, Jul 2, 2011 at 6:49 PM, Lance Norskog goks...@gmail.com wrote: For full Random Projection, a lame random number generator (java.lang.Random) will generate a higher standard deviation than a

Re: Dimensional Reduction via Random Projection: investigations

2011-07-03 Thread Ted Dunning
into spinning chains is very educational about entropy. For full Random Projection, a lame random number generator (java.lang.Random) will generate a higher standard deviation than a high-quality one like MurmurHash. On Fri, Jul 1, 2011 at 5:25 PM, Ted Dunning ted.dunn...@gmail.com wrote

Re: Introducing randomness into my results

2011-07-03 Thread Ted Dunning
On Sun, Jul 3, 2011 at 1:08 PM, Sean Owen sro...@gmail.com wrote: I don't see why one would believe that the randomly selected items farther down the list are more likely to engage a user. If anything, the recommender says they are less likely to be engaging. There are two issues with this

Re: Introducing randomness into my results

2011-07-03 Thread Ted Dunning
Roughly. But remember, a single recommendation isn't the end of the game. If this is the last recommendation to ever be made, dithering doesn't help at all. On Sun, Jul 3, 2011 at 1:02 PM, Konstantin Shmakov kshma...@gmail.comwrote: It seems that as long as recommenders are dealing with the

Re: Similarity between users' groups

2011-07-02 Thread Ted Dunning
, Radek On 18 February 2011 18:04, Sebastian Schelter s...@apache.org wrote: This shouldn't be too difficult and would maybe make a good newcomer or student project. --sebastian Am 18.02.2011 18:19, schrieb Ted Dunning: A better way to sample is to find groups with a very large number

Re: Similarity between users' groups

2011-07-02 Thread Ted Dunning
It is pretty easy to set up a reservoir sampler as a combiner and as the front end to a reducer. Sent from my iPhone On Jul 2, 2011, at 14:22, Lance Norskog goks...@gmail.com wrote: How to do this in an efficient way? No idea.

Re: Hadoop version compatibility.

2011-07-01 Thread Ted Dunning
You have to watch out, however, because Hadoop wire format changes pretty often. On Fri, Jul 1, 2011 at 7:21 AM, Xiaobo Gu guxiaobo1...@gmail.com wrote: I mean mahout using the higher version of Hadoop libraries connecting to a lower version of Hadoop cluster. On Fri, Jul 1, 2011 at 8:42 PM,

Re: Dimensional Reduction via Random Projection: investigations

2011-07-01 Thread Ted Dunning
Lance, You would get better results from the random projection if you did the first part of the stochastic SVD. Basically, you do the random projection: Y = A \Omega where A is your original data, R is the random matrix and Y is the result. Y will be tall and skinny. Then, find an

Re: Dimensional Reduction via Random Projection: investigations

2011-07-01 Thread Ted Dunning
. The standard deviation of the ratios gives a roughready measure of the fidelity of the reduction. The standard deviation of simple RP should be highest, then this RP + orthogonalization, then MDS. On Fri, Jul 1, 2011 at 11:03 AM, Ted Dunning ted.dunn...@gmail.com wrote: Lance, You would get

Re: Dimensional Reduction via Random Projection: investigations

2011-07-01 Thread Ted Dunning
, that's close. let's try a hundred dot products dot1 = rep(0,100);dot2 = rep(0,100) for (i in 1:100) {dot1[i] = sum(a[1,] * a[i,]); dot2[i] = sum(aa[1,]* aa[i,])} # how close to the same are those? max(abs(dot1-dot2)) # VERY [1] 3.45608e-11 On Fri, Jul 1, 2011 at 4:54 PM, Ted Dunning ted.dunn

Re: Incorrect calculation of pdf

2011-06-28 Thread Ted Dunning
- exponentiate the result. This will not change the function's expected result On Mon, Jun 27, 2011 at 9:03 PM, Ted Dunning ted.dunn...@gmail.com wrote: Actually, pdf() should always be a pdf(), not a logPdf(). Many algorithms want one or the other. Some don't much care because log is monotonic

Re: Incorrect calculation of pdf

2011-06-28 Thread Ted Dunning
not know if it will work OK: Do all calculations on logarithmic level and just before return - exponentiate the result. This will not change the function's expected result On Mon, Jun 27, 2011 at 9:03 PM, Ted Dunning ted.dunn...@gmail.com wrote: Actually, pdf() should always be a pdf

Re: Incorrect calculation of pdf

2011-06-27 Thread Ted Dunning
There should not be a change to an existing method. It would be find to add another method, perhaps called logPdf, that does what you suggest. This loss of precision is common with the normal distribution in high dimensions. On Mon, Jun 27, 2011 at 1:49 AM, Vasil Vasilev vavasi...@gmail.com

Re: Hybrid RecSys — ways to do it

2011-06-27 Thread Ted Dunning
, 2011 at 12:57 AM, Marko Ciric ciric.ma...@gmail.com wrote: Thanks Ted. Do you think weights (that depend on mentioned features) can be learned with simple linear regression once the outputs of Mahout recommenders are known? On 10 June 2011 08:02, Ted Dunning ted.dunn...@gmail.com wrote: When

Re: Incorrect calculation of pdf

2011-06-27 Thread Ted Dunning
be to create a new Model and ModelDistribution that uses log arithmetic of your choosing. The initial models are very simple minded and are likely not adequate for real applications. -Original Message- From: Ted Dunning [mailto:ted.dunn...@gmail.com] Sent: Monday, June 27, 2011 7:51 AM To: user

Re: Yahoo's LDA code

2011-06-27 Thread Ted Dunning
INdeed. On Mon, Jun 27, 2011 at 5:27 PM, Hector Yee hector@gmail.com wrote: So I tried Yahoo LDA on 52 M documents with 1000 topics. Yahoo LDA with a dictionary of 100k terms does 1 iteration every 30 minutes on a single machine using 4 cores. Mahout LDA using 20 nodes and a

Re: An inmemory sparse matrix multiplier

2011-06-26 Thread Ted Dunning
Regarding speed: How many non-zero elements? What is the size of your input matrices? How long does it take to read the matrices without doing any multiplication? Your test matrices seem small for big sparse matrices. This sort of thing could be very useful. On Sun, Jun 26, 2011 at 1:47 PM,

Re: An inmemory sparse matrix multiplier

2011-06-26 Thread Ted Dunning
And going down teh columns in a sparse matrix could do this to you. On Sun, Jun 26, 2011 at 6:40 PM, Jake Mannix jake.man...@gmail.com wrote: On Sun, Jun 26, 2011 at 1:47 PM, Vincent Xue xue@gmail.com wrote: Hi. I was wondering how useful an in memory sparse matrix multiplier would

Re: Adding dimensions to an existing TF-IDF vector

2011-06-25 Thread Ted Dunning
that cover how (and also why) it works, check out http://hunch.net/~jl/projects/hash_reps/index.html On Sat, Jun 25, 2011 at 1:51 AM, Ted Dunning ted.dunn...@gmail.com wrote: Look at the class FeatureValueEncoder. The test cases show most of the ways that is used. Also the class

Re: Can all the algorithms in Mahout be run locally without a Hadoop cluster.

2011-06-25 Thread Ted Dunning
overhead running on a single machine or there are other implications to run big jobs on a single machine? - edwin On Jun 24, 2011, at 7:11 PM, Ted Dunning wrote: I have done this with VM's but I would not generally recommend it. Without VM's you will have a pretty ugly configuration issue

Re: Can all the algorithms in Mahout be run locally without a Hadoop cluster.

2011-06-25 Thread Ted Dunning
I have had best results with somewhat beefier machines because you pay less VM overhead. Typical Hadoop configuration advice lately is 4GB per core and 1 disk spindle per two cores. For higher performance systems like MapR, the number of spindles can go up. On Sat, Jun 25, 2011 at 2:21 AM, Sean

Re: Adding dimensions to an existing TF-IDF vector

2011-06-24 Thread Ted Dunning
It is quite possible. If the new columns represent a relatively small contribution rather than a wholesale change in the statistics of the corpus (which is almost always true) then you can just add these columns and compute IDF weights for the new terms based on the updated corpus statistics.

Re: Should threadcount and poolsize of AdaptiveLogisticRegression be the same?

2011-06-24 Thread Ted Dunning
Shouldn't matter. On Fri, Jun 24, 2011 at 3:04 AM, XiaoboGu guxiaobo1...@gmail.com wrote: And should we call setPoolsize first, then call setThreadCount after that? Regards, Xiaobo Gu

Re: Can all the algorithms in Mahout be run locally without a Hadoop cluster.

2011-06-24 Thread Ted Dunning
Big iron is fine for some of the classifier stuff, but throughput per $ can be higher for other algorithms with a cluster of smaller machines. How big a machine are you talking about? Even relatively small machines are pretty massive any more. 8 core = 16 hyper-thread machines with 48GB seem to

Re: Adding dimensions to an existing TF-IDF vector

2011-06-24 Thread Ted Dunning
Look at the class FeatureValueEncoder. The test cases show most of the ways that is used. Also the class TrainNewsGroups in examples. See chapters 14 and 16 of Mahout in Action. The sample server for chapter 16 does encoding like you need. On Fri, Jun 24, 2011 at 5:04 PM, Mark

Re: Can all the algorithms in Mahout be run locally without a Hadoop cluster.

2011-06-24 Thread Ted Dunning
Pretty big. SHould scream for local classifier learning. Local Hadoop should run pretty fast as well. On Fri, Jun 24, 2011 at 5:54 PM, XiaoboGu guxiaobo1...@gmail.com wrote: 32Core, 256G RAM -Original Message- From: Ted Dunning [mailto:ted.dunn...@gmail.com] Sent: Saturday

Re: Can all the algorithms in Mahout be run locally without a Hadoop cluster.

2011-06-24 Thread Ted Dunning
. -Original Message- From: Ted Dunning [mailto:ted.dunn...@gmail.com] Sent: Saturday, June 25, 2011 9:26 AM To: user@mahout.apache.org Cc: d...@mahout.apache.org Subject: Re: Can all the algorithms in Mahout be run locally without a Hadoop cluster. Pretty big. SHould scream

Re: Mahout and Kolt

2011-06-23 Thread Ted Dunning
We changed lots of names as we pulled them over. We also added test cases. The changes at this point are pretty substantial. At the lower level, we changed the way things worked and added new kinds of collections. At the math layer, we pretty massively changed things by adding the ability to

Re: LanczosSVD and Eigenvalues

2011-06-23 Thread Ted Dunning
Try the QR trick. It is amazingly effective. 2011/6/23 tr...@cs.drexel.edu Alright, thanks guys. The cases where Lanczos or the stochastic projection helps are cases where you have *many* columns but where the data are sparse. If you have a very tall dense matrix, the QR method is to

Re: LanczosSVD and Eigenvalues

2011-06-23 Thread Ted Dunning
The cases where Lanczos or the stochastic projection helps are cases where you have *many* columns but where the data are sparse. If you have a very tall dense matrix, the QR method is to be muchly preferred. 2011/6/23 tr...@cs.drexel.edu Ok, then what would you think to be the minimum number

Re: LanczosSVD and Eigenvalues

2011-06-23 Thread Ted Dunning
This method isn't usually as numerically stable as, for instance, using a QR decomposition. If your original data matrix is n x 2, then Q is n x 2 and R is 2 x 2. R is trivial to decompose into U S V' and since Q is a unit matrix, the singular values and right singular vectors of R are your

Re: LanczosSVD and Eigenvalues

2011-06-23 Thread Ted Dunning
Btw.. the JIRA involved was https://issues.apache.org/jira/browse/MAHOUT-376 On Thu, Jun 23, 2011 at 11:44 AM, Ted Dunning ted.dunn...@gmail.com wrote: If you don't need all 5000 singular values, then you can directly use the stochastic decomposition algorithms in Mahout. If you do want all

Re: LanczosSVD and Eigenvalues

2011-06-23 Thread Ted Dunning
If you don't need all 5000 singular values, then you can directly use the stochastic decomposition algorithms in Mahout. If you do want all 5000 singular values, then you can probably use all but the first and last few step of the stochastic decomposition algorithm to get what you need. If you

Re: LanczosSVD and Eigenvalues

2011-06-23 Thread Ted Dunning
I think that you can do the covariance using Jakes old outer product trick. Of course you need to do something clever to deal with mean subtraction. 2011/6/23 tr...@cs.drexel.edu Yes but a M/R job to create the covariance matrix would be required. With millions of rows that is, unless I am

Re: LanczosSVD and Eigenvalues

2011-06-23 Thread Ted Dunning
For billions of rows, you can do block-wise QR and get the SVD pretty easily. Also, the distributed matrix times will get you there with slightly less numerical stability. On Thu, Jun 23, 2011 at 12:53 PM, Jake Mannix jake.man...@gmail.com wrote: Well I'm going to pretend for a second that

Re: LanczosSVD and Eigenvalues

2011-06-23 Thread Ted Dunning
Doh. Of course. I have been worried about sparsity so long that mean subtraction causes an autonomic twitch. On Thu, Jun 23, 2011 at 1:25 PM, Jake Mannix jake.man...@gmail.com wrote: with 2 dense matrices being multiplied will it? And it is conceivable that we will have billions of rows

Re: Which is more effective?

2011-06-21 Thread Ted Dunning
I have used the SGD classifiers for content based recommendation. It works out reasonably but the interaction variables can get kind of expensive. Doing it again, I think I would use latent factor log linear models to do the interaction features. See

Re: Which is more effective?

2011-06-21 Thread Ted Dunning
Actually, I should mention that I have done user-feature recommendations and then (mis) used text retrieval to pull back items that have features as text. This works reasonably well and is pretty easy to do. You will have to watch out for very common features. On Wed, Jun 22, 2011 at 12:50 AM,

Re: Which is more effective?

2011-06-21 Thread Ted Dunning
only have one feature vector per item. On Jun 21, 2011, at 3:49 PM, Ted Dunning wrote: I have used the SGD classifiers for content based recommendation. It works out reasonably but the interaction variables can get kind of expensive. Doing it again, I think I would use latent factor log

Re: Mahout on Github

2011-06-20 Thread Ted Dunning
Also, github mirrors all apache projects (and apache also provides git mirrors) I have some mahout stuff on github myself. I like to put work in progress there. What project did you see that was deficient? I see all of the live version at https://github.com/apache/mahout

Re: Running Iterative Recursive Least Squares

2011-06-20 Thread Ted Dunning
Sounds like you should invert your loops. These sparse matrices are probably very reasonable for solving each one on a single machine in memory. As such, take a look at the LSMR implementation which is a good implementation of a conjugate gradient-like algorithm that plays nice with sparse data.

Re: Running Iterative Recursive Least Squares

2011-06-20 Thread Ted Dunning
20, 2011 at 9:21 PM, Ted Dunning ted.dunn...@gmail.com wrote: Sounds like you should invert your loops. These sparse matrices are probably very reasonable for solving each one on a single machine in memory. As such, take a look at the LSMR implementation which is a good implementation

Re: Trending patterns

2011-06-19 Thread Ted Dunning
Two things will help in addition to what Josh suggested: a) when looking for items that are trending hot, use the difference in the log rank as a score. For most internetly things, rank is proportional to 1/rate so log rank is -log rate. Refining this slightly to -log (epsilon + 1/rank) makes

Re: LDA

2011-06-18 Thread Ted Dunning
I should add that this would be a cool thing to have if it can be made general enough! On Sat, Jun 18, 2011 at 7:35 PM, Ted Dunning ted.dunn...@gmail.com wrote: I don't think that the current LDA could be misused this way, but I wouldn't be surprised if the current variational code could

Re: Classification beginner questions

2011-06-16 Thread Ted Dunning
randomly permute things. On Wed, Jun 15, 2011 at 2:50 PM, Ted Dunning ted.dunn...@gmail.com wrote: It is already in Mahout, I think. On Tue, Jun 14, 2011 at 5:48 AM, Lance Norskog goks...@gmail.com wrote: Coding a permutation like this in Map/Reduce is a good beginner exercise

Re: Probabilities in Bayesian classifier

2011-06-16 Thread Ted Dunning
estimations using time series and I that's why I would know if they are critical for me. Many thanks and best regards, Svetlomir. Am 15.06.2011 20:44, schrieb Ted Dunning: This is what the term Naive is used in the name. The scores for this kind of algorithm are 0 to 1 or are logarithms

Re: Probabilities in Bayesian classifier

2011-06-16 Thread Ted Dunning
I should add that the regularization will also make the logistic regression classifier a little bit conservative about estimating probabilities near 0 or near 1. On Fri, Jun 17, 2011 at 12:19 AM, Ted Dunning ted.dunn...@gmail.com wrote: The problem is that logistic regression makes some

Re: tf-idf + svd + cosine similarity

2011-06-15 Thread Ted Dunning
The normal terminology is to name U and V in SVD as singular vectors as opposed to eigenvectors. The term eigenvectors is normally reserved for the symmetric case of U S U' (more generally, the Hermitian case, but we only support real values). On Wed, Jun 15, 2011 at 12:35 AM, Dmitriy Lyubimov

Re: tf-idf + svd + cosine similarity

2011-06-15 Thread Ted Dunning
something in mahout about this. Best, Fernando. 2011/6/15 Ted Dunning ted.dunn...@gmail.com The normal terminology is to name U and V in SVD as singular vectors as opposed to eigenvectors. The term eigenvectors is normally reserved

Re: Probabilities in Bayesian classifier

2011-06-15 Thread Ted Dunning
This is what the term Naive is used in the name. The scores for this kind of algorithm are 0 to 1 or are logarithms of such a number, but are not at all calibrated probabilities. And, frankly, it is rare in practice for the output of logistic regression to be calibrated either. Those outputs

Re: Classification beginner questions

2011-06-15 Thread Ted Dunning
It is already in Mahout, I think. On Tue, Jun 14, 2011 at 5:48 AM, Lance Norskog goks...@gmail.com wrote: Coding a permutation like this in Map/Reduce is a good beginner exercise. On Sun, Jun 12, 2011 at 11:34 PM, Ted Dunning ted.dunn...@gmail.com wrote: But the key is that you have

Re: a modified booleanrecommendation strategy with 'likes'

2011-06-15 Thread Ted Dunning
On Wed, Jun 15, 2011 at 9:27 PM, aaron barnes aa...@stasis.org wrote: I'm thinking this still most closely resembles a 'boolean' model, because it's not a matter of the user assigning a rating to every purchase, so we're not looking primarily for users who have given similar ratings to similar

<    9   10   11   12   13   14   15   16   17   18   >