Re: cvb vectordump

2013-04-17 Thread Chris Harrington
So I've got 0.8 now but I'm running into an error, ../../workspace2/trunk/bin/mahout seqdirectory -i ./contentDataDir/output-content-segment -o ./contentDataDir/sequenced ../../workspace2/trunk/bin/mahout seq2sparse -i ./contentDataDir/sequenced -o ./contentDataDir/sparseVectors --namedVector -

Re: Feature reduction for LibLinear weights

2013-04-17 Thread Ken Krugler
Hi Ted, On Apr 13, 2013, at 8:46pm, Ted Dunning wrote: > On Sat, Apr 13, 2013 at 7:05 AM, Ken Krugler > wrote: > >> >> On Apr 12, 2013, at 11:55pm, Ted Dunning wrote: >> >>> The first thing to try is feature hashing to reduce your feature vector >> size. >> >> Unfortunately LibLinear takes f

Re: Boosting User-Based with the user's attributes

2013-04-17 Thread Agata Filiana
Just a thought, when you say to combine the metrics by multiplying their, for example Sim1 = 0.9 and Sim2 = 0.2 When they are multiplied it would give a result of 0.18 which is very low, remembering that they are pretty "similar" based on Sim1 - how can this problem be tackled? * Agata Filiana Er

Re: Boosting User-Based with the user's attributes

2013-04-17 Thread Sean Owen
If all of your similarities are a product like this, then they're all "low". In a relative sense this is fine. But this is also why I proposed a geometric mean instead. For example the geometric mean of these is about 0.424 and this notion can be extended to include weights as well, which is what m

Re: Boosting User-Based with the user's attributes

2013-04-17 Thread Agata Filiana
I see it makes more sense with geometric mean. And with weight, if I want to apply say 70% for Sim1 and 30% for Sim2, would it also make sense to have it like this? The result should be around 0.194. * Agata Filiana Erasmus Mundus DMKM Student 2011-2013 * On 17 April 20

Re: Does it make sense to use Mahout for text classification when I have a huge number of documents but a small number of labels?

2013-04-17 Thread Ryan Compton
Any ideas where to look? Does anyone get more than 20 mappers when running the 20 news groups data? On Tue, Apr 16, 2013 at 9:04 PM, Robin Anil wrote: > Sounds like a config issue. the Mr version should be able to parallelize > based on the size of the input.

Re: Does it make sense to use Mahout for text classification when I have a huge number of documents but a small number of labels?

2013-04-17 Thread Robin Anil
You wont its tiny amount of data. Mapper are determined by the split size and input shards. Either shard the input more than 10 or reduce the map split size. Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Wed, Apr 17, 2013 at 3:32 PM, Ryan Compton wrote: > Any ideas where to

DenseRowMatrix?

2013-04-17 Thread Gokhan Capan
Hi, Using Mahout Matrix interface I want to represent some data where the row vector is dense iff an instance is associated to this row index, empty otherwise. The max possible index for rows (a.k.a. rowSize) is defined. I never query the matrix by column. I want to be able to add rows if the row

Re: DenseRowMatrix?

2013-04-17 Thread Robin Anil
SparseRowMatrix? On Apr 17, 2013 5:26 PM, "Gokhan Capan" wrote: > Hi, > > Using Mahout Matrix interface I want to represent some data where the row > vector is dense iff an instance is associated to this row index, empty > otherwise. The max possible index for rows (a.k.a. rowSize) is defined. >

Re: DenseRowMatrix?

2013-04-17 Thread Gokhan Capan
Robin, Aren't SparseRowMatrix rows are sparse vectors? In my use case row vectors don't need to be sparse, they are either full or empty. On Thu, Apr 18, 2013 at 1:32 AM, Robin Anil wrote: > SparseRowMatrix? > On Apr 17, 2013 5:26 PM, "Gokhan Capan" wrote: > > > Hi, > > > > Using Mahout Matri

Re: DenseRowMatrix?

2013-04-17 Thread Robin Anil
Make one? On Apr 17, 2013 5:37 PM, "Gokhan Capan" wrote: > Robin, > > Aren't SparseRowMatrix rows are sparse vectors? In my use case row vectors > don't need to be sparse, they are either full or empty. > > > On Thu, Apr 18, 2013 at 1:32 AM, Robin Anil wrote: > > > SparseRowMatrix? > > On Apr 17

Re: DenseRowMatrix?

2013-04-17 Thread Gokhan Capan
I didn't quite get that, and assuming you tell me to implement it Thanks On Thu, Apr 18, 2013 at 1:44 AM, Robin Anil wrote: > Make one? > On Apr 17, 2013 5:37 PM, "Gokhan Capan" wrote: > > > Robin, > > > > Aren't SparseRowMatrix rows are sparse vectors? In my use case row > vectors > > don't

Re: DenseRowMatrix?

2013-04-17 Thread Robin Anil
Yes! Yes! Go for it!. On Apr 17, 2013 5:52 PM, "Gokhan Capan" wrote: > I didn't quite get that, and assuming you tell me to implement it > > Thanks > > > On Thu, Apr 18, 2013 at 1:44 AM, Robin Anil wrote: > > > Make one? > > On Apr 17, 2013 5:37 PM, "Gokhan Capan" wrote: > > > > > Robin, > > >

Re: Does it make sense to use Mahout for text classification when I have a huge number of documents but a small number of labels?

2013-04-17 Thread Ryan Compton
Got it, thanks. For some reason I had the impression that mahout wanted one (splittable) file per label. I ran a test on the 20 news groups where I split soc.religion.christian.txt into several files with arbitrary names before training. Mahout's trainclassifier launched as many mappers as files,

Re: DenseRowMatrix?

2013-04-17 Thread Jake Mannix
SparseMatrix is implemented as a Map, you could modify that class to allow you to chose between dense or sparse rows at construction time. On Wed, Apr 17, 2013 at 4:01 PM, Robin Anil wrote: > Yes! Yes! Go for it!. > On Apr 17, 2013 5:52 PM, "Gokhan Capan" wrote: > > > I didn't quite get that,