Re: one vector or many vectors?

2012-11-01 Thread Ted Dunning
Your mileage will vary. It is often helpful to classify small parts of large articles and then somehow deal with these multiple classifications at the full document level. Sometimes it is not helpful, especially if the small parts get too small. Try it both ways. My tendency is to prefer to cla

Re: SGD: Logistic regression package in Mahout

2012-11-01 Thread Ted Dunning
rward for your reply. > >> > > >> > Thanks > >> > Rajesh > >> > > >> > On Wed, Oct 31, 2012 at 11:00 AM, Rajesh Nikam >> > >wrote: > >> > > >> > > Hi Ted, > >> > > > >> > > Thank

Re: one vector or many vectors?

2012-11-01 Thread Ted Dunning
run the extractor on a paragraph is very handy. > > Please correct me if I am wrong. > You are right. It does make things harder. It can also make them better. > > On Thu, Nov 1, 2012 at 9:39 PM, Ted Dunning wrote: > > > Your mileage will vary. > > &

Re: Mix of Content Based and Collaborative Filtering

2012-11-01 Thread Ted Dunning
Speaking with no principles in hand at all, I find that it is possible to encode multiple item similarity matrices together in a SolR instance and then do very nice coordinated recommendations from multiple sources of information. Abusing a text retrieval engine this way has only vague basis in th

Re: Mix of Content Based and Collaborative Filtering

2012-11-05 Thread Ted Dunning
between > the > > signals. Only question is how much you can achieve "by hand". Probably > you > > want to somehow learn which weights on the signals perform best. Maybe > this > > blog article by netflix is a good start > > > > > > &

Re: Mix of Content Based and Collaborative Filtering

2012-11-05 Thread Ted Dunning
On Mon, Nov 5, 2012 at 12:06 PM, Johannes Schulte < johannes.schu...@gmail.com> wrote: > > do you really mean payloads? Because i consider them part of the index as > they are stored per position and can be accessed during scoring. > I had the impression that they were not indexed. They are defi

Re: Mix of Content Based and Collaborative Filtering

2012-11-06 Thread Ted Dunning
On Mon, Nov 5, 2012 at 9:16 PM, Johannes Schulte wrote: > > is it possible you are mixing up payloads and stored fields? The latter > ones are not indexed and can only be used for the top n results. Maybe > we're talking about different things.. > I think I did mix these up. I haven't been acti

Re: Mix of Content Based and Collaborative Filtering

2012-11-06 Thread Ted Dunning
rote: > Maybe I'll try it out to throw the scores away we fought so hard for. > You're right, mixing vector space model score and LLR is questionable > without more sophisticated methods. > Thanks for the answers! > > > > > > On Tue, Nov 6, 2012 at 5:44 PM,

Re: need help on mahout

2012-11-09 Thread Ted Dunning
There is additional confusion typically because supervised and unsupervised methods are commonly used together. For instance, clustering (unsupervised) can be used to generate cluster proximity features that are then used as features for classification (supervised). Another example might be where

Re: Mahout dependency problem with asm-1.3

2012-11-10 Thread Ted Dunning
Why do you have maven.glassfish.org in your repo path? On Fri, Nov 9, 2012 at 7:17 PM, Lance Norskog wrote: > I'm getting this from the current git checkout. There are 301 > (redirections) but there is nothing at the target either. > > Downloading: > https://repository.apache.org/content/reposit

Re: Jobs Hadoop-Mahout: Full Capacity

2012-11-10 Thread Ted Dunning
If you want k-means speed see the new k-means code: https://github.com/tdunning/knn Can you describe your data a bit? On Sat, Nov 10, 2012 at 11:22 AM, pricila rr wrote: > I am running kmeans algorithm. > Increasing the number of tasktrackers and datanodes, increase the speed? > > Thank you > >

Re: MultiNormal distribution radius

2012-11-14 Thread Ted Dunning
Yes. Evil naming. I will patch shortly. On Wed, Nov 14, 2012 at 2:33 AM, Sean Owen wrote: > Ted I am also confused by the naming in this class. What I'd imagine is the > vector of means is called "offset". The variances come in to the picture > via a matrix called "mean". (That's not the covar

Re: MultiNormal distribution radius

2012-11-14 Thread Ted Dunning
On Wed, Nov 14, 2012 at 3:22 AM, Sean Owen wrote: > > Right, I probably want a modified version in my case where I normalize > > the distances somehow. > > > > You can divide the result by any scalar you want and it will still have > non-zero probability of being farther than any given distance d

Re: MultiNormal distribution radius

2012-11-14 Thread Ted Dunning
On Wed, Nov 14, 2012 at 9:48 AM, Sean Owen wrote: > I'm talking about the case here where covariances are 0. The marginals in > each dimension are independent and are normally distributed. Right? > Yes. With no covariance, all of the axes are independent. > What is that matrix connecting the

Re: How to interpret recommendation strength

2012-11-15 Thread Ted Dunning
My own preference (pun intended) is to use log-likelihood score for determining which similarities are non-zero and then use simple frequency weighting such as IDF for weighting the similarities. This doesn't make direct use of cooccurrence frequencies, but it works really well. One reason that

Re: How to interpret recommendation strength

2012-11-15 Thread Ted Dunning
gt; themselves. > > On Nov 15, 2012, at 10:50 AM, Sean Owen wrote: > > That's kind of what it does now... though it weights everything as "1". Not > so smart, but for sparse-ish data is not far off from a smarter answer. > > > On Thu, Nov 15, 2012 at 6:47 PM, Te

Re: Conversion of point numbers to key strings

2012-11-19 Thread Ted Dunning
This looks like it may be an artifact of switching to Lucene 4.0. Grant? On Mon, Nov 19, 2012 at 9:12 AM, Christopher Laux wrote: > Caused by: java.lang.NoSuchFieldError: LUCENE_36 > at > > org.apache.mahout.vectorizer.DefaultAnalyzer.(DefaultAnalyzer.java:34) > ... 11 more > > Any idea

Re: Mahout svd command question

2012-11-22 Thread Ted Dunning
That implementation is deprecated. The SSVD implement should be used instead. On Thu, Nov 22, 2012 at 9:58 AM, Abramov Pavel wrote: > Hi, > > Here is step by step manual for Lanczos implementation: > > https://cwiki.apache.org/MAHOUT/dimensional-reduction.html > > Pavel > ___

Re: getting started with mahout and kmeans

2012-11-26 Thread Ted Dunning
How many data points are you clustering? How many dimensions? On Mon, Nov 26, 2012 at 2:33 PM, Eduard Gamonal wrote: > Hi, > I'm doing a MSc at Northeastern and I'm working on analyzing some US > election polls with kmeans. > I'm a beginner with both Mahout and Hadoop. I've been reading the docs

Re: getting started with mahout and kmeans

2012-11-27 Thread Ted Dunning
and, in > that case, bash would be too slow, wouldn't it? > > On Tue, Nov 27, 2012 at 12:54 AM, Ted Dunning > wrote: > > How many data points are you clustering? How many dimensions? > > > > On Mon, Nov 26, 2012 at 2:33 PM, Eduard Gamonal < > eduard.g

Re: Mahout SGD - is it really descent?

2012-11-28 Thread Ted Dunning
Robert's analysis is correct. This would be worthy of a comment at the least. On Wed, Nov 28, 2012 at 11:53 AM, Lancaster, Robert (Orbitz) < robert.lancas...@orbitz.com> wrote: > graidentBase is coming from: > double gradientBase = gradient.get(i); > > Prior to that: > Vector gradient = this.gra

Re: Mahout SGD - is it really descent?

2012-11-28 Thread Ted Dunning
+1 On Wed, Nov 28, 2012 at 12:56 PM, Jake Mannix wrote: > or maybe call the variable negativeGradient, instead? >

Re: Visualizing class hierarchy of mahout

2012-11-29 Thread Ted Dunning
IntelliJ does this nicely as well. On Wed, Nov 28, 2012 at 9:21 PM, tuxdna wrote: > Please to be using UMLGraph [1]. It works very nicely. > > /tuxdna > > [1] http://www.umlgraph.org/ > > On Wed, Nov 28, 2012 at 8:12 PM, Ahmet Ylmaz > wrote: > > Hi, > > I'm trying to learn the internals of Mah

Re: How to concatenate Vectors?

2012-11-29 Thread Ted Dunning
The most efficient way is probably to just add them. You can also use assign with a max function. Or you can write a special function if you want the left vector or the right one to have preference. Vector a, b; // method 1 a.plus(b); // method 2 a.assign(b, Functio

Re: How to concatenate Vectors?

2012-11-29 Thread Ted Dunning
the same cardinality, so vector1.plus(vector2) does > not work. > Is there a way to resize a given vector? Sorry I am a complette > Mahout-noob. > > > > -Ursprüngliche Mitteilung- > Von: Ted Dunning > An: user > Verschickt: Do, 29 Nov 2012 11:45 pm >

Re: How to concatenate Vectors?

2012-11-29 Thread Ted Dunning
plus(vector2) > | does not work. > | Is there a way to resize a given vector? Sorry I am a complette > | Mahout-noob. > | > | > | > | -Ursprüngliche Mitteilung- > | Von: Ted Dunning > | An: user > | Verschickt: Do, 29 Nov 2012 11:45 pm > | Betreff: Re: How to

Re: Mahout Amazon EMR usage cost

2012-12-03 Thread Ted Dunning
On Mon, Dec 3, 2012 at 3:06 AM, Koobas wrote: > Thank you very much. > The pointer to Myrrix is a very useful piece of information. > Myrrix, however, relies on an iterative sparse matrix factorization to do > PCA. > I want to produce Amazon-like recommendations. > I.e., "70% of users who bough t

Re: Recommender Evaluator

2012-12-03 Thread Ted Dunning
Also, don't make algorithm choices based on small data samples. Bigger data will change the ordering of which algorithms work well. On Mon, Dec 3, 2012 at 10:04 PM, Sean Owen wrote: > You may do better with a latent feature approach -- working in lower > dimensional space won't have the problem

Re: Clustering algorithms

2012-12-04 Thread Ted Dunning
The minhash algorithm itself should work as well with non-English text. It is likely that the input phases where the text is analyzed would not work correctly, however. On Tue, Dec 4, 2012 at 6:05 PM, Varun Thacker wrote: > I'd tried out the MinHash algorithm in mahout using the Reuters data set

Re: Very high average absolute difference score

2012-12-04 Thread Ted Dunning
Bernát I am guessing from the fact that you have accents in your name that you may be in Europe. If so, it is possible that there is a confusion about the decimal point that Mahout uses and the one that you use. Is it possible that you have decimal numbers like 3,1 instead of 3.1? On Tue, Dec 4

Re: Seeking advice on a classification problem (needle-in-haystack situation)

2012-12-04 Thread Ted Dunning
What Kate says is good advice. You can have considerable amounts of bias, but you may be telling the model something about the relative cost of errors and that can result in things happening that you don't like. As you noted, your model could have gotten 95% correct by simply saying DON'T CARE to

Re: Mahout Amazon EMR usage cost

2012-12-04 Thread Ted Dunning
Also, you have to separate UI considerations from algorithm considerations. What algorithm populates the recommendations is the recommender algorithm. It has two responsibilities... first, find items that the users will like and second, pick out a variety of less certain items to learn about. It

Re: Mahout Amazon EMR usage cost

2012-12-04 Thread Ted Dunning
On Wed, Dec 5, 2012 at 6:57 AM, Paulo Villegas wrote: > On 05/12/12 00:53, Ted Dunning wrote: > >> Also, you have to separate UI considerations from algorithm >> considerations. >> What algorithm populates the recommendations is the recommender >> algorithm. &g

Re: Seeking advice on a classification problem (needle-in-haystack situation)

2012-12-05 Thread Ted Dunning
Try the cascaded model. Train the downstream model on data without the don't-care docs or train it on documents that actually get through the upstream model. On Wed, Dec 5, 2012 at 4:50 PM, Raman Srinivasan wrote: > I can exclude the "don't care" cases from the training set. However, the > real

Re: Clustering points in a unit hypercube

2012-12-05 Thread Ted Dunning
How many clusters are you talking about? If you pick a modest number then streaming k-means should work well if it has several times more surrogate points than there are clusters. Also, typically a hyper-cube test works with very small cluster radius. Try 0.1 or 0.01. Otherwise, your clusters o

Re: Seeking advice on a classification problem (needle-in-haystack situation)

2012-12-05 Thread Ted Dunning
eful cases may be thrown out at the first level even before I can > sub-class them. What's usually a good approach when less than 5% of the > data is meaningful. > > > On Wed, Dec 5, 2012 at 10:26 AM, Ted Dunning > wrote: > > > Try the cascaded model. Train th

Re: Clustering points in a unit hypercube

2012-12-05 Thread Ted Dunning
t; [1] https://gist.github.com/4220406 > [2] > https://github.com/dfilimon/knn/blob/d224eb7ca7bd6870eaef2e355012cac3aa59f051/src/test/java/org/apache/mahout/knn/cluster/StreamingKMeansTest.java#L104 > [3] https://github.com/dfilimon/knn/issues/1 > > On Thu, Dec 6, 2012 at 1:03 AM, Ted Dun

Re: Clustering points in a unit hypercube

2012-12-05 Thread Ted Dunning
Ahh... this may also be a problem. You should get better results with a Brute searcher here, but a ProjectionSearcher with lots of projections may work well. On Thu, Dec 6, 2012 at 12:22 AM, Dan Filimon wrote: > So, yes, it's probably a bug of some kind since I end up with anywhere > between 400

Re: Clustering points in a unit hypercube

2012-12-05 Thread Ted Dunning
: > But the weight referred to is the distance between a centroid and the > mean of a distribution (a cube vertice). > This should still be very small (also BallKMeans gets it right). > > On Thu, Dec 6, 2012 at 1:32 AM, Ted Dunning wrote: > > IN order to succeed here, SKM will nee

Re: Mahout Amazon EMR usage cost

2012-12-05 Thread Ted Dunning
On Wed, Dec 5, 2012 at 5:29 PM, Koobas wrote: > ... > Now yet another naive question. > Ted is probably going to go ballistic ;) > I hope not. > Assuming that simple overlap methods suck, > is there still a metric that works better than others > (i.e. Tanimoto vs. Jaccard vs something else)? >

Re: Clustering points in a unit hypercube

2012-12-06 Thread Ted Dunning
nk [1]. > > [1] https://github.com/dfilimon/knn/wiki/skm-visualization > > On Thu, Dec 6, 2012 at 2:01 AM, Ted Dunning wrote: > > Still not that odd if several clusters are getting squashed. This can > > happen if the threshold increases too high or if the searcher is unable &

Re: Remove unused recommenders?

2012-12-06 Thread Ted Dunning
Deprecating is a nice first step to let people know where things are headed. On Thu, Dec 6, 2012 at 4:21 PM, Sebastian Schelter wrote: > The other three recommenders seem to be used almost never, so I'd like > to remove them, however I wouldn't have a problem with keeping them for > any reason.

Re: Cluster: find medoid & its n nearest elements

2012-12-07 Thread Ted Dunning
There isn't a clever way to find the medoid in Mahout. Finding the n nearest elements can be done using a Searcher. The Brute implementation should suffice. On Thu, Dec 6, 2012 at 10:16 AM, Stefan Kreuzer wrote: > Hello, > > when inspecting a cluster of sparse vectors, what is the right way to

Re: Decision Forest - Partial implementation

2012-12-08 Thread Ted Dunning
There are several approaches that might help: 1) use shared memory via mmap to store the forest. This allows multiple mapper threads to access the same forest. The current Mahout in-memory structure for this is not suitable for shared memory, however. 2) split the forests across many mappers (a

Re: Decision Forest - Partial implementation

2012-12-09 Thread Ted Dunning
for mapping. If they are larger, then you can use MapR's NFS capabilities to present anything in the cluster as a normal file which can then be mapped. > On 12/08/2012 03:43 AM, Ted Dunning wrote: > >> There are several approaches that might help: >> >> 1) use share

Re: Decision Forest - Partial implementation

2012-12-09 Thread Ted Dunning
Yeah... right now you have the full cross product, but one side only has one element so the product is trivial. It isn't that much worse if that side has a few elements. On Sat, Dec 8, 2012 at 9:49 PM, Marty Kube < martyk...@beavercreekconsulting.com> wrote: > #2 Might be a nice general approach

Re: Decision Forest - Partial implementation

2012-12-10 Thread Ted Dunning
Yep. On Sun, Dec 9, 2012 at 11:33 PM, Marty Kube < martyk...@beavercreekconsulting.com> wrote: > Because it uses Java pointers instead of offsets. The mmap'ed structure >> could be mapped into memory at any address and thus must be position >> independent. >> > Okay, I think I get the point here

Re: Creating vectors from lucene index on EMR via the CLI

2012-12-12 Thread Ted Dunning
You are trying to run this job as a single step in an EMR flow. Mahout's command line programs assume that you are running against a live cluster that will hang around (since many mahout steps involve more than one map-reduce). It would probably be best to separate the creation of the cluster (w

Re: Creating vectors from lucene index on EMR via the CLI

2012-12-12 Thread Ted Dunning
t that mean that the keep alive is set? > > > > ____ > From: Ted Dunning > To: user@mahout.apache.org; hellen maziku > Sent: Wednesday, December 12, 2012 9:48 AM > Subject: Re: Creating vectors from lucene index on EMR via the CLI > > You are trying to ru

Re: Creating vectors from lucene index on EMR via the CLI

2012-12-12 Thread Ted Dunning
t; /elastic-mapreduce --create --alive--log-uri > s3n://mahout-output/logs/ --name dict_vectorize > > > doesn't that mean that the keep alive is set? > > > > ____ > From: Ted Dunning > To: user@mahout.apache.org; hellen maziku &g

Re: Creating vectors from lucene index on EMR via the CLI

2012-12-12 Thread Ted Dunning
r on ec2 and perform my tasks? > > > > > ________ > From: Ted Dunning > To: user@mahout.apache.org; hellen maziku > Sent: Wednesday, December 12, 2012 10:56 AM > Subject: Re: Creating vectors from lucene index on EMR via the CLI > > I would still recommend that you switch

Re: Streaming KMeans Text Clustering Concurrency and Advice

2012-12-13 Thread Ted Dunning
On Thu, Dec 13, 2012 at 2:29 PM, Brandon Root wrote: > This is a question regarding the new KNN library that Ted Dunning and Dan > Filimon are working on (as I understand it'll be in Mahout 0.8) so I hope > this is the appropriate list for this question instead of mahout-dev.

Re: Streaming KMeans Text Clustering Concurrency and Advice

2012-12-13 Thread Ted Dunning
What Dan says here is correct. The lack of dependence on k in the current code is definitely a problem. The work-around is to set the maxClusters to the point that the log factor should have grown to. That sucks so we should fix the heuristic sizing along the lines that Dan says. There should s

Re: Creating vectors from lucene index on EMR via the CLI

2012-12-13 Thread Ted Dunning
If your input files are in S3 then the map-reduce steps that mahout spawns can access them without problems. In order to run Mahout programs, you will need to install mahout. There are command line programs in $MAHOUT_HOME/bin that will do what you need. On Thu, Dec 13, 2012 at 10:58 AM, hellen

Re: Using SVD evaluator

2012-12-14 Thread Ted Dunning
If you understand the domain, you can inspect things by hand. That is only of limited utility. You can also test click through if your data comes from a live system and you have set up to do testing. (always set up to do testing!) On Fri, Dec 14, 2012 at 2:40 AM, Sean Owen wrote: > It's still

Re: How large should windowSize should be when setting parameters for AdaptiveLogisticRegression?

2012-12-14 Thread Ted Dunning
I would recommend testing with OnlineLogisticRegression first. The AdaptiveLogisticRegression has a tendency to freeze on sub-optimal parameter values sooner than it should. In any case, the averaging window for ALR should be set fairly long and should be at least 10% of your data set. If your d

Re: How large should windowSize should be when setting parameters for AdaptiveLogisticRegression?

2012-12-14 Thread Ted Dunning
setAveragingWindow(int windowSize) . Would you mind telling more about > > that? Thanks! > > > > > > On Sat, Dec 15, 2012 at 2:44 AM, Ted Dunning >wrote: > > > >> I would recommend testing with OnlineLogisticRegression first. > >> > >>

Re: Replacing Mahout's default Hadoop dependency with my customized Hadoop distribution

2012-12-15 Thread Ted Dunning
Change the pom to refer to your jar as a system dependency and insert the path where your jars are explicitly. http://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#System_Dependencies On Sat, Dec 15, 2012 at 12:12 AM, Yunming Zhang wrote: > Hi, > > I have impleme

Re: Run Mahout remotely

2012-12-18 Thread Ted Dunning
One method for dealing with this is to always submit jobs from a machine near or in the cluster. This is a pain because you wind up having to compile twice or transfer the jar occasionally to this machine. The (slightly) good news is that rsync is often quite clever about moving jars incrementall

Re: Is there anyway you could easily make a deep copy of Vector Writable class with hadoop's ReflectionUtils?

2012-12-18 Thread Ted Dunning
You can always just call clone() on the vector inside the VectorWritable. On Tue, Dec 18, 2012 at 5:49 PM, Yunming Zhang wrote: > Hi, > > I have been trying to find a way to make a deep copy of key, value pairs > inside SequenceFileRecordReader as I am implementing a getCurrentKeyCopy() > and get

Re: Replacing Mahout's default Hadoop dependency with my customized Hadoop distribution

2012-12-19 Thread Ted Dunning
>> >> >> But it still doesn't seem to work. when compiling CIMapper, it still >> couldn't find my customized Mapper class that was in >> modified-hadoop-core-1.0.3.jar file, >> >> I am not sure what is the cause of this? any suggest

Re: Mahout for item-item tables

2012-12-21 Thread Ted Dunning
The basic reason that it is common to binarize the relationships is that putting weights on these relationships makes it really easy to over-fit, thus giving you very goofy results. One method for putting weights on these elements is to simply use weight(i,j) = log ((N_rows +1)/(rowSum_i + 1)) lo

Re: Build error

2012-12-23 Thread Ted Dunning
To amplify others comments, make sure that when you open the Mahout project, you actually just open the pom.xml file at the top-level. It is a bad idea to try to do anything like import existing sources because a significant amount of the code is generate which would make a naively configured. com

Re: Mahout for item-item tables

2012-12-23 Thread Ted Dunning
On Sat, Dec 22, 2012 at 4:33 AM, Kai R. Larsen wrote: > ... > I'm not quite sure that your answer is directly responsive to the > question That would definitely not be the first time that I have missed the point. > ... > 1. Goal is to examine relationship between 250 web pages, so we extract >

Re: Build error

2012-12-23 Thread Ted Dunning
NetBeans works really well > for maven projects. > > On 12/23/2012 04:42 PM, Ted Dunning wrote: > >> To amplify others comments, make sure that when you open the Mahout >> project, you actually just open the pom.xml file at the top-level. It is >> a >> b

Re: Document Classification - Recommended Algorithms?

2012-12-26 Thread Ted Dunning
Do you have thousands of labeled documents for each category? Are the categories groupable into very similar clusters? Do categories come and go? What is high accuracy to you? My first recommendation for text classification always is L_1 regularized logistic regression. Since your training dat

Re: LSA in Mahout

2012-12-26 Thread Ted Dunning
LDA is vastly slower than LSA because LSA can use large scale SVD algorithms. LDA may be better for some applications, but even the fastest implementations tend to be much slower than large scale SVD. The LDA implementations in Mahout are not particularly fast. On Wed, Dec 26, 2012 at 5:01 PM, V

Re: Vectorizing 20 newsgroups

2012-12-27 Thread Ted Dunning
Random low dimensional projections tend to look like normal distributions. This is the law of large numbers at work. I think it is hard to diagnose anything from this. On the other hand, projections against the principal components tend to show more structure. On Thu, Dec 27, 2012 at 11:53 AM,

Re: Click probability prediction using Mahout. From model output to probability

2012-12-27 Thread Ted Dunning
This paper is probably of interest for this problem: http://research.microsoft.com/apps/pubs/default.aspx?id=122779 On Thu, Dec 27, 2012 at 6:14 AM, Johannes Schulte < johannes.schu...@gmail.com> wrote: > Oops, hit enter to early... > > Just wanted to say that those are the two ways I'm thinki

Re: time based price predictions

2012-12-27 Thread Ted Dunning
You have a sort of a regression problem here. Add features of each item if you can. Then add day-of-week, weekend or holiday features. Fit your regression. Can you say the size of your data? On Thu, Dec 27, 2012 at 7:26 AM, Matt Mitchell wrote: > I'm looking for a way to predict prices based

Re: Vectorizing 20 newsgroups

2012-12-27 Thread Ted Dunning
I have fixed the vectorizer in knn. Available from [0] as org.apache.mahout.knn.Vectorize20NewsGroups Typical invocation would have these command line options: lic subject false 1000 ../20news-bydate-train These are: - term weighting code. lic = log(tf) * IDF with cosine normalization - w

Re: Vectorizing 20 newsgroups

2012-12-28 Thread Ted Dunning
On Fri, Dec 28, 2012 at 12:35 AM, Dan Filimon wrote: > I have a couple of questions: > - how did you pick 1000 as the dimension of the vectors? > Out of nowhere. Partly motivated by a desire to be able to pull the data into R. > - what is spoking behavior? is it that there seem to be some lin

Re: Document Classification - Recommended Algorithms?

2012-12-28 Thread Ted Dunning
Glad this worked for you. There is a random forest implementation in Mahout as well. That might be helpful at some point. On Fri, Dec 28, 2012 at 5:23 AM, Magesh Sarma wrote: > On Wed, Dec 26, 2012 at 2:20 PM, Ted Dunning > wrote: > > > As an interesting tree-based alternativ

Re: Parallel MapReduce Classification Examples?

2012-12-28 Thread Ted Dunning
On Fri, Dec 28, 2012 at 5:30 PM, Adam Baron wrote: > I'm trying to get familiar with the the parallel MapReduce Classification > algorithms offered in Mahout . ... Online Passive Aggressive and Hidden > Markov Models might be > ready to explore as well. I don't think that either of these reall

Re: Maven

2012-12-31 Thread Ted Dunning
Sean has addressed why to use maven for builds. Regarding maven for execution, it is definitely not necessary. It made the support a little easier because we get fewer CLASSPATH questions. It made the users' experience a little uglier. On Mon, Dec 31, 2012 at 6:05 AM, Sloot, Hans-Peter < hans-p

Re: Parallel MapReduce Classification Examples?

2012-12-31 Thread Ted Dunning
if > there any wrapper functions out there to help enable it? Otherwise, it > would feel pretty monotonous to run clustering on every possible label > value. > > Thanks, > Adam > > On Fri, Dec 28, 2012 at 6:54 PM, Ted Dunning wrote: > >> On Fri, De

Re: Seeding k-means with canopy clustering / Filter canopies

2013-01-02 Thread Ted Dunning
Stefan, Have you looked at the k-means work that Dan Filimon and I are doing? On Wed, Jan 2, 2013 at 4:46 PM, Stefan Kreuzer wrote: > I try to seed a k-means clustering with canopy clustering. Problem: > Depending on the choice for t1 and t2, canopy clustering gives me too many > canopies or jus

Re: Seeding k-means with canopy clustering / Filter canopies

2013-01-02 Thread Ted Dunning
compatible with the current clustering API. The algorithms are being tested for quality by Dan Filimon who is also doing the scaling work. On Wed, Jan 2, 2013 at 6:00 PM, Stefan Kreuzer wrote: > Uhm no... where can I look? Sorry > > > > > -Ursprüngliche Mitteilung-

Re: Seeding k-means with canopy clustering / Filter canopies

2013-01-03 Thread Ted Dunning
s for the k-means seed? > > > -Ursprüngliche Mitteilung- > Von: Ted Dunning > An: user > Verschickt: Do, 3 Jan 2013 7:01 am > Betreff: Re: Seeding k-means with canopy clustering / Filter canopies > > > Bitlets have come into Mahout so far, but the core is in &

Re: Updated MIA samples

2013-01-03 Thread Ted Dunning
which version of Hadoop are you using? On Thu, Jan 3, 2013 at 6:26 AM, Robin Chesterman wrote: > Sorry, I didn't notice it was using Hadoop jars already. > > From what I've read, the error I'm getting is a Windows problem with a > nasty workaround: > > http://stackoverflow.com/questions/9755508/p

Re: Seeding k-means with canopy clustering / Filter canopies

2013-01-03 Thread Ted Dunning
On Thu, Jan 3, 2013 at 8:08 AM, Stefan Kreuzer wrote: > But even with a small weight (not sure how to apply that) i still have the > wrong number of centroids, i.e. the wrong k? > I didn't think so. I seem to be confused about what you want. > I imagined something like: > > 1. Do canopy cluste

Re: Updated MIA samples

2013-01-03 Thread Ted Dunning
then used mvn clean install, which I think > downloaded everything for me - in the project libraries there is a > reference to hadoop-core-1.0.4.jar > > > > > On 3 January 2013 16:00, Ted Dunning wrote: > > > which version of Hadoop are you using? > > > &g

Re: classifier predicting only using beginning subset of time-based feature

2013-01-03 Thread Ted Dunning
On Thu, Jan 3, 2013 at 1:28 PM, sam wu wrote: > Hi, > > Normally classifier does prediction based on the same set of feature used > in training. > What happens if we need to predict only based on some beginning subset of > time-based feature ? > > Say, we have an eCommerce web site, > user transa

Re: Mahout startup errors

2013-01-04 Thread Ted Dunning
The problems are mostly in dependencies. Try building Mahout with the hadoop-0.23 profile. http://maven.apache.org/guides/introduction/introduction-to-profiles.html On Fri, Jan 4, 2013 at 5:43 PM, Marty Kube < martyk...@beavercreekconsulting.com> wrote: > I'm in the middle of an upgrade to CD

Re: HMM - baum welch and hmmpredict

2013-01-06 Thread Ted Dunning
It sounds like you are getting some numerical stability issues with the training program. With HMM's, the most common problem that leads to this is numerical underflow. I haven't looked at this in detail, however, so I can't comment very knowledgeably. It is possible that the current implementat

Re: HMM - baum welch and hmmpredict

2013-01-06 Thread Ted Dunning
On Sun, Jan 6, 2013 at 1:35 PM, wrote: > Hi, > > I've been using the standalone trainer. > > I'll have a look at the log scaled trainers - thanks for the tip! > > Log scaling is absolutely required. Otherwise, you start dealing with numerical underflow amazingly quickly.

Re: HMM - baum welch and hmmpredict

2013-01-06 Thread Ted Dunning
On Sun, Jan 6, 2013 at 12:34 PM, wrote: > I think that one of the Mahout algorithms (DF) does use NaN for > "undecidable" > Yes. But I don't think the HMM codes do. > So perhaps there is a long term need to think through the output > semantics of the library? > Yes. And no. Yes, it would b

Re: HMM - baum welch and hmmpredict

2013-01-07 Thread Ted Dunning
a prediction with > > >mahout hmmpredict -o out.txt -m newmodel.mod -l 100 > > >cat out.txt > > >0 0 0 0 0 0 0 0 1 2 2 2 2 0 1 1 1 1 1 1 2 0 1 1 2 2 0 0 1 1 1 1 1 1 1 1 1 > 1 1 1 > 1 2 0 0 1 1 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 1 1 1 1 1 1 1 2 2 0 1 1 2 2 2 > 2 2 2

Re: alternating least squares

2013-01-08 Thread Ted Dunning
Can you refer to which documentation you are looking at? ALS is more like a block gradient descent SVD than like a QR. There are relationships in each step, but I don't think that they are identical. Others can comment more authoritatively. On Tue, Jan 8, 2013 at 2:55 PM, Koobas wrote: > I am

Re: Mahout RDBMS and sqoop

2013-01-08 Thread Ted Dunning
Different algorithms in Mahout require Hadoop. Or not. It depends on which algorithm you are using. Your question doesn't provide enough information to give you a better answer than that. Can you say what you are trying to do? On Tue, Jan 8, 2013 at 7:29 AM, Piero Giacomelli wrote: > Dear All

Re: alternating least squares

2013-01-08 Thread Ted Dunning
This particular part of the algorithm can be seen as similar to a least squares problem that might normally be solved by QR. I don't think that the updates are quite the same, however. On Tue, Jan 8, 2013 at 3:10 PM, Sebastian Schelter wrote: > This factorization is iteratively refined. In each

Re: alternating least squares

2013-01-08 Thread Ted Dunning
t; On Tue, Jan 8, 2013 at 5:27 PM, Ted Dunning wrote: > > This particular part of the algorithm can be seen as similar to a least > > squares problem that might normally be solved by QR. I don't think that > > the updates are quite the same, however. > > >

Re: alternating least squares

2013-01-08 Thread Ted Dunning
Great. On Tue, Jan 8, 2013 at 4:25 PM, Koobas wrote: > On Tue, Jan 8, 2013 at 7:18 PM, Ted Dunning wrote: > > > But is it actually QR of Y? > > > > > Ted, > This is my understanding: > In the process of solving the least squares problem, > you end up invert

Re: machine learning algorithm giving wrong results

2013-01-09 Thread Ted Dunning
This is a regression problem. The regression algorithm available in Mahout is logistic regression. You can force it to solve this problem in two ways. First, you can scale and offset the output by a large enough factor so that the normal 0 to 1 output range is much larger than necessary and the

Re: Representing key value dataset into Mahout vector

2013-01-09 Thread Ted Dunning
Look at the last third of the book, especially chapter 14. One important thing to check is whether your integers represent codes or actually represent numbers. Codes should be encoded as key words. Hashed vector encoding should work quite well. On Wed, Jan 9, 2013 at 10:10 PM, Haddad Said wrot

Re: machine learning algorithm giving wrong results

2013-01-10 Thread Ted Dunning
Who is the moderator for our lists? On Thu, Jan 10, 2013 at 9:04 AM, Jeff Eastman wrote: > To unsubscribe from this list, send an email to user-unsubscribe@mahout.** > apache.org > > > On 1/10/13 11:14 AM, Walshe, Maurice (RBI-UK) wrote: > >> unsubscribe >> >> -Original Message- >> From:

Re: vector encoding of text documents

2013-01-10 Thread Ted Dunning
It has to do with a few things. First, most classifiers can learn as good or better weights than TF-IDF weighting, but k-means really needs the extra help. Second, there is an issue of when the code was developed. Most of the clustering code was developed before the feature hashing stuff was ava

Re: Difficulty with math.solver.LSRM

2013-01-10 Thread Ted Dunning
Hmm... The LSMRTest code works. So it seems like there is a mismatch somewhere. In debugging that test, it seems that the loop exits via ITERATION_LIMIT which avoids the problematic if statement. This is likely due to the fact that the test is solving a Hilbert matrix which has exceedingly bad

Re: Difficulty with math.solver.LSRM

2013-01-10 Thread Ted Dunning
In the meantime, can you file a JIRA with your sample code? On Thu, Jan 10, 2013 at 5:57 PM, Ted Dunning wrote: > Hmm... > > The LSMRTest code works. So it seems like there is a mismatch somewhere. > > In debugging that test, it seems that the loop exits via ITERATION_LIMIT >

Re: Difficulty with math.solver.LSRM

2013-01-10 Thread Ted Dunning
I filed MAHOUT-1139, added a test case and committed a fix. Let me know if that solves your problem. Thanks for noticing the problem! On Thu, Jan 10, 2013 at 5:58 PM, Ted Dunning wrote: > In the meantime, can you file a JIRA with your sample code? > > > On Thu, Jan 10, 2013 at

  1   2   3   4   5   6   7   8   9   10   >