Random forest questions

2011-05-06 Thread Yang Zhang
I've been playing around with the RF implementation and I had a couple questions: - Does this RF implementation support weighted examples? (If so how do I specify weights?) - How do I get the RF score (confidence, probability, etc.) of a prediction? Thanks!

Re: Is any more detailed documentation aout the sgd logistic regression example.

2011-05-06 Thread Xiaobo Gu
On Thu, May 5, 2011 at 11:21 PM, Ted Dunning wrote: > On Thu, May 5, 2011 at 7:48 AM, Xiaobo Gu wrote: > >> On Thu, May 5, 2011 at 10:40 PM, Stanley Xu wrote: >> > 1. You could use the command line to add shape as category features, it >> will >> > hash categoryname=value as the feature and set

Re: Which maven command to use to put all the binaries into the distribution layout?

2011-05-06 Thread Xiaobo Gu
On Fri, May 6, 2011 at 11:34 PM, Sean Owen wrote: > I think you'd have to set up release keys and all that to make the package. > Does "mvn release:prepare" (without -Prelease) do what you want or am > I crazy? That's ultimately what makes the artifacts. Here's our > process: https://cwiki.apache.

Re: Vectorizing arbitrary value types with seq2sparse

2011-05-06 Thread Ted Dunning
Yeah.. that doesn't work at all. You need different analyzers at least and some fields are numeric, some textual. The same words in different fields (usually) need to be considered separately. N-grams raises all kinds of crazy issues. For instance, what does an n-gram of tags mean? Are tags ev

implicit data relative ratings

2011-05-06 Thread Ted Dunning
Here is an interesting paper that claims that implicit rankings based on logging requests for directions are at least as good as explicit ratings and ten times more available. http://www.vldb.org/pvldb/vol4/p290-venetis.pdf My bias in favor of implicit ratings just got stronger.

Re: Vectorizing arbitrary value types with seq2sparse

2011-05-06 Thread Frank Scholten
Hmm, seems more complex that I thought. I thought of a simple approach where you could configure your own class that concatenated the desired fields into one Text value and have the SequenceFileTokenizerMapper process that value. But this can give unexpected results? I guess it may find incorrect

Re: Vectorizing arbitrary value types with seq2sparse

2011-05-06 Thread Ted Dunning
This is definitely desirable but is very different from the current tool. My guess is the big difficulty will be describing the vectorization to be done. The hashed representations would make that easier, but still not trivial. Dictionary based methods add multiple dictionary specifications and

Vectorizing arbitrary value types with seq2sparse

2011-05-06 Thread Frank Scholten
Hi everyone, At the moment seq2sparse can generate vectors from sequence values of type Text. More specifically, SequenceFileTokenizerMapper handles Text values. Would it be useful if seq2sparse could be configured to vectorize value types such as a Blog article with several textual fields like t

Re: Which maven command to use to put all the binaries into the distribution layout?

2011-05-06 Thread Ted Dunning
Which is glued to the package life cycle in Mahout. On Fri, May 6, 2011 at 9:42 AM, Patrick Angeles wrote: > You probably want the maven assembly plugin. > > On Fri, May 6, 2011 at 12:07 PM, Ted Dunning > wrote: > > > Isn't there a mvn package target that is better for this? > > > > On Fri, May

Re: Which maven command to use to put all the binaries into the distribution layout?

2011-05-06 Thread Patrick Angeles
You probably want the maven assembly plugin. On Fri, May 6, 2011 at 12:07 PM, Ted Dunning wrote: > Isn't there a mvn package target that is better for this? > > On Fri, May 6, 2011 at 8:34 AM, Sean Owen wrote: > > > I think you'd have to set up release keys and all that to make the > package. >

Re: Which maven command to use to put all the binaries into the distribution layout?

2011-05-06 Thread Ted Dunning
Isn't there a mvn package target that is better for this? On Fri, May 6, 2011 at 8:34 AM, Sean Owen wrote: > I think you'd have to set up release keys and all that to make the package. > Does "mvn release:prepare" (without -Prelease) do what you want or am > I crazy? That's ultimately what makes

Re: Which maven command to use to put all the binaries into the distribution layout?

2011-05-06 Thread Sean Owen
I think you'd have to set up release keys and all that to make the package. Does "mvn release:prepare" (without -Prelease) do what you want or am I crazy? That's ultimately what makes the artifacts. Here's our process: https://cwiki.apache.org/confluence/display/MAHOUT/How+To+Release What are you

Re: Which maven command to use to put all the binaries into the distribution layout?

2011-05-06 Thread Xiaobo Gu
mvn -Prelease prompt me to enter a password for :GPG Passphrase: And I can't provide one. How can I build and package the release zip file without running the unit tests? Another question, why the mvn download a lot of files from the Internet while building? Regards, On Mon, Apr 11, 2011 at

Re: MapReduce Stats calculations

2011-05-06 Thread Ted Dunning
yeah... un-re-used re-usable primitives are of little help, but a Mahout big data equivalent of the R summary function would handy to have. The fact is, we already have the re-usable bits anyway. It is common to want column-wise summaries of big matrices. Useful summaries include: a) moment bas

Re: Transposing a matrix is limited by how large a node is.

2011-05-06 Thread Ted Dunning
If you have the code and would like to contribute it, file a JIRA and attach a patch. It will be interesting to hear how the SVD proceeds. Such a large dense matrix is an unusual target for SVD. Also, it is possible to adapt the R version of random projection to never keep all of the large matri

Re: Transposing a matrix is limited by how large a node is.

2011-05-06 Thread Vincent Xue
Hi Jake, As requested the stats from the job are listed below: Counter Map Reduce Total Job Counters Launched reduce tasks 0 0 2 Rack-local map tasks 0 0 69 Launched map tasks 0 0 194 Data-local map tasks 0 0 125 FileSystemCounters FILE_BYTES_READ 66,655,795,630 0 66,655,795,630 HDFS_BYTES_READ 12

Re: MapReduce Stats calculations

2011-05-06 Thread Sean Owen
Hadoop has something like this: http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/aggregate/package-summary.html I find there's a very strong and unfortunate tension between reusability and performance in some cases. Having a discrete stage to compute something li

Re: Transposing a matrix is limited by how large a node is.

2011-05-06 Thread Jake Mannix
On Fri, May 6, 2011 at 6:01 AM, Vincent Xue wrote: > Dear Mahout Users, > > I am using Mahout-0.5-SNAPSHOT to transpose a dense matrix of 55000 x > 31000. > My matrix is in stored on the HDFS as a > SequenceFile, consuming just about 13 GB. When > I > run the transpose function on my matrix, the

MapReduce Stats calculations

2011-05-06 Thread Grant Ingersoll
MAHOUT-688 has a M/R job to calculate std. deviation for document frequencies so that it can prune noisy words. I'm thinking of making it a bit more generic and adding a stats package to org.apache.mahout.math.hadoop that contains this and other basic stats calculations (mean, variance, sum of

Transposing a matrix is limited by how large a node is.

2011-05-06 Thread Vincent Xue
Dear Mahout Users, I am using Mahout-0.5-SNAPSHOT to transpose a dense matrix of 55000 x 31000. My matrix is in stored on the HDFS as a SequenceFile, consuming just about 13 GB. When I run the transpose function on my matrix, the function falls over during the reduce phase. With closer inspection,