Re: Clustering Question

2011-04-06 Thread sarath pr
I am using Netbeans IDE.
I use CanopyDriver.run to create initial clusters and KmeansDriver.run
for clustering news articles.

On 4/6/11, Grant Ingersoll  wrote:
> What commands are you running to do the actual clustering?
>
>
> On Apr 3, 2011, at 4:27 AM, sarath pr wrote:
>
>> SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, new
>> Path(inputDir,"documents.seq"),Text.class, Text.class);
>>
>> for(int i=0;i>{
>>
>> writer.append(new Text(s[i][0]), new Text(s[i][1]));
>> }
>>  writer.close();
>>
>> Here Text(s[i][0]) is a string value, which is the ID of a news
>> article and Text(s[i][1]) is the news article text . I have clustered
>> some 100+ news articles like this and i get the output in
>> clusteredPoints/part-m-0. My question is that is it possible to
>> extract the article ID (ie Texts[i][0]), which i had appended) and
>> corresponding cluster id from the part-m-0 file.
>>
>> Anyone knows ???
>>
>> --
>> Thank You..!!
>> Sarath Ramachandran
>> sarath.amr...@gmail.com
>> +919995024287
>
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
>
>

-- 
Sent from my mobile device

Thank You..!!
Sarath Ramachandran
sarath.amr...@gmail.com
+919995024287


Re: PFP Growth : ParallelFPGrowth reduce taking a lo--ong time

2011-04-06 Thread Stanley Xu
I thought you could also modify the code a little bit, to add a
ParrallelFPGGrowthPartitioner and use multiple reducer to min the FPGrowth
result. For the real mine work is processed in the reducer and without the
partitioner, the Hadoop only use 1 reducer which would be probably slow.

Best wishes,
Stanley Xu



On Tue, Apr 5, 2011 at 4:35 AM, Robin Anil  wrote:

> Could you try with the Performance patch on the JIRA issue page 619
>
>
> On Tue, Apr 5, 2011 at 1:14 AM, Vipul Pandey  wrote:
>
> > So I have a new problem now.
> > I have about 3.5M baskets and about a 100K items/features. After
> > successfully completing parallel-counting, grouping and transaction
> sorting
> > in a reasonable amount of time - PFP Growth just gets stuck in the Reduce
> > phase of ParallelFPGrowth.
> > The reducer is running since about 6:30 PM PST Friday, 1st of April -
> (it's
> > beyond noon of 4th April here - i.e. running for more than 65 hours now)
> > and
> > it's still going on.
> >
> > Anyone faced this issue before? (Console logs pasted below)
> > Also,
> > - We don't have anything else running on our cluster and it's a
> reasonably
> > powerful box.
> > - the jobtracker UI shows this status for the reducer : "Processing
> FPTree:
> > FPGrowth Algorithm for a given feature: 49797 > reduce" and the feature
> > keeps changing every minute or two- which shows that it's still doing
> it's
> > job!
> > - no other errors/messages in the logs
> > - config :  -s 100 -k 50 -g 2  - Any Recommendations on these? my usual
> > values are "-s 10 -g 10" but that doesn't seem to cut for me either.
> >
> >
> >
> > Any clue?
> >
> > Thanks
> > Vipul
> >
> >
> > Log :
> >
> > 11/04/01 18:20:07 INFO common.HadoopUtil: Deleting
> >  /run/MR20/fim/parallelcounting
> >
> > 11/04/01 18:20:07 WARN mapred.JobClient: Use GenericOptionsParser for
> > parsing the arguments. Applications should implement Tool for the same.
> >
> > 11/04/01 18:20:08 INFO input.FileInputFormat: Total input paths to
> process
> > :
> > 20
> >
> > 11/04/01 18:20:08 INFO mapred.JobClient: Running job:
> job_201103311244_0045
> >
> > 11/04/01 18:20:09 INFO mapred.JobClient:  map 0% reduce 0%
> >
> > 11/04/01 18:20:20 INFO mapred.JobClient:  map 2% reduce 0%
> >
> > 11/04/01 18:20:21 INFO mapred.JobClient:  map 8% reduce 0%
> >
> > 11/04/01 18:20:22 INFO mapred.JobClient:  map 11% reduce 0%
> >
> > 11/04/01 18:20:23 INFO mapred.JobClient:  map 13% reduce 0%
> >
> > 11/04/01 18:20:24 INFO mapred.JobClient:  map 16% reduce 0%
> >
> > 11/04/01 18:20:25 INFO mapred.JobClient:  map 17% reduce 0%
> >
> > 11/04/01 18:20:26 INFO mapred.JobClient:  map 18% reduce 0%
> >
> > 11/04/01 18:20:27 INFO mapred.JobClient:  map 21% reduce 0%
> >
> > 11/04/01 18:20:28 INFO mapred.JobClient:  map 22% reduce 0%
> >
> > 11/04/01 18:20:29 INFO mapred.JobClient:  map 24% reduce 0%
> >
> > 11/04/01 18:20:30 INFO mapred.JobClient:  map 27% reduce 0%
> >
> > 11/04/01 18:20:31 INFO mapred.JobClient:  map 28% reduce 0%
> >
> > 11/04/01 18:20:32 INFO mapred.JobClient:  map 31% reduce 0%
> >
> > 11/04/01 18:20:34 INFO mapred.JobClient:  map 33% reduce 0%
> >
> > 11/04/01 18:20:35 INFO mapred.JobClient:  map 35% reduce 0%
> >
> > 11/04/01 18:20:36 INFO mapred.JobClient:  map 38% reduce 0%
> >
> > 11/04/01 18:20:37 INFO mapred.JobClient:  map 40% reduce 0%
> >
> > 11/04/01 18:20:38 INFO mapred.JobClient:  map 42% reduce 0%
> >
> > 11/04/01 18:20:39 INFO mapred.JobClient:  map 43% reduce 0%
> >
> > 11/04/01 18:20:40 INFO mapred.JobClient:  map 45% reduce 0%
> >
> > 11/04/01 18:20:41 INFO mapred.JobClient:  map 47% reduce 0%
> >
> > 11/04/01 18:20:42 INFO mapred.JobClient:  map 50% reduce 0%
> >
> > 11/04/01 18:20:44 INFO mapred.JobClient:  map 53% reduce 0%
> >
> > 11/04/01 18:20:45 INFO mapred.JobClient:  map 55% reduce 0%
> >
> > 11/04/01 18:20:47 INFO mapred.JobClient:  map 57% reduce 0%
> >
> > 11/04/01 18:20:48 INFO mapred.JobClient:  map 60% reduce 0%
> >
> > 11/04/01 18:20:49 INFO mapred.JobClient:  map 61% reduce 0%
> >
> > 11/04/01 18:20:50 INFO mapred.JobClient:  map 63% reduce 0%
> >
> > 11/04/01 18:20:51 INFO mapred.JobClient:  map 64% reduce 0%
> >
> > 11/04/01 18:20:52 INFO mapred.JobClient:  map 66% reduce 0%
> >
> > 11/04/01 18:20:53 INFO mapred.JobClient:  map 68% reduce 0%
> >
> > 11/04/01 18:20:54 INFO mapred.JobClient:  map 69% reduce 0%
> >
> > 11/04/01 18:20:55 INFO mapred.JobClient:  map 71% reduce 0%
> >
> > 11/04/01 18:20:56 INFO mapred.JobClient:  map 72% reduce 0%
> >
> > 11/04/01 18:20:57 INFO mapred.JobClient:  map 74% reduce 0%
> >
> > 11/04/01 18:20:58 INFO mapred.JobClient:  map 75% reduce 0%
> >
> > 11/04/01 18:20:59 INFO mapred.JobClient:  map 76% reduce 0%
> >
> > 11/04/01 18:21:00 INFO mapred.JobClient:  map 78% reduce 0%
> >
> > 11/04/01 18:21:01 INFO mapred.JobClient:  map 79% reduce 0%
> >
> > 11/04/01 18:21:03 INFO mapred.JobClient:  map 81% reduce 0%
> >
> > 11/04/01 18:21:06 INFO mapred.JobClient:  map 83% reduce 0%
> >
> > 11/04/01 18:21:07 INFO mapred.JobCli

How I could run Logistic Regression with a word predictor?

2011-04-06 Thread Stanley Xu
Dear all,

I am trying to evaluate if we could use Mahout's Logistic Regression
implementation to predict CTR for a Ad Network.

I run the command in Mahout In Action(chapter 13) successfully, but meet an
error while try to introduce a categorical predictor by the following
command with the built-in example on Mahout-0.4

mahout trainlogistic --input donut.csv --output ./model --target color
--categories 2 --predictors shape x y --types word numeric --features 20
--passes 100 --rate 50

but get a NullPointerException
Running on hadoop, using HADOOP_HOME=/opt/hadoop
No HADOOP_CONF_DIR set, using /opt/hadoop/conf
20
color ~ -1.945*Intercept Term + 0.532*x + 1.304*yException in thread "main"
java.lang.NullPointerException
at
org.apache.mahout.classifier.sgd.TrainLogistic.predictorWeight(TrainLogistic.java:138)
at
org.apache.mahout.classifier.sgd.TrainLogistic.main(TrainLogistic.java:113)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

I am wondering how to fix it.

And does the Mahout's implementation transfer a word/category predictor to
multiple unigram predictor to do the logistic regression? Or it use some
other way?

Thanks.

Best wishes,
Stanley Xu


Re: is it possible to compute the SVD for a large scale matrix

2011-04-06 Thread Danny Bickson
Did you try to increase the java allocation for child process
mapred.child.java.opts
(found in
conf/mapred-site.xml config file)?

Do you mean 60 Million by 60 Million?




On Wed, Apr 6, 2011 at 2:13 AM, Wei Li  wrote:

> Hi Danny:
>
>  I have transformed the csv data into the DistributedRowMatrix format,
> but it still failed due to the memory problem after 2 or 3 iterations.
>
>  my matrix dimension is about 60w * 60w, it is possible to do the svd
> decomposition for this scale using Mahout?
>
> Best
> Wei
>
>
> On Sat, Mar 26, 2011 at 1:43 AM, Danny Bickson wrote:
>
>> Hi Wei,
>> You must verify you use SPARSE matrix and not dense, or else you will
>> surely get out of memory.
>> Take a look at this example:
>> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html
>> On how to prepare the input.
>>
>> Best,
>>
>> Danny Bickson
>>
>>
>> On Fri, Mar 25, 2011 at 1:33 PM, Dmitriy Lyubimov wrote:
>>
>>> Wei,
>>>
>>> 1) i think DenseMatrix is a RAM-only representation. Naturally, you
>>> get OOM because it all has to fit in memory. If you want to run
>>> RAM-only SVD computation, you perhaps don't need Mahout. If you want
>>> to run distributed SVD computations, you need to prepare your data in
>>> what is called DistributedRowMatrix format. This is a sequence file
>>> with keys being whatever key you need to identify your rows, and
>>> values being VectorWritable wrapping either of vector implementations
>>> found in mahout (Dense, sparse sequenctial, sparse random).
>>> 2) Once you've prepared your data in DRM format, you can run either of
>>> SVD algorithms found in Mahout. It can be Lanczos solver ('mahout svd
>>> ... ") or, on the trunk you can also find a stochastic svd method
>>> ('mahout ssvd ...") which is issue MAHOUT-593 i mentioned earlier.
>>>
>>> Either way, I am not sure why you want DenseMatrix unless you want to
>>> use RAM-only Colt SVD solver -- but you certainly don't have to focus
>>> on Mahout implementation of one if you just want a RAM solver.
>>>
>>> -d
>>>
>>> On Fri, Mar 25, 2011 at 3:25 AM, Wei Li  wrote:
>>> >
>>> > Actually, I would like to perform the spectral clustering on a large
>>> scale
>>> > sparse matrix, but it failed due to the OutOfMemory error when creating
>>> the
>>> > DenseMatrix for SVD decomposition.
>>> >
>>> > Best
>>> > Wei
>>> >
>>> > On Fri, Mar 25, 2011 at 4:05 PM, Dmitriy Lyubimov 
>>> wrote:
>>> >>
>>> >> SSVD != Lanczos. if you do PCA or LSI it is perhaps what you need. it
>>> >> can take on these things. Well at least some of my branches can, if
>>> >> not the official patch.
>>> >>
>>> >> -d
>>> >>
>>> >> On Thu, Mar 24, 2011 at 11:09 PM, Wei Li  wrote:
>>> >> >
>>> >> > thanks for your reply
>>> >> >
>>> >> > my matrix is not very dense, a sparse matrix.
>>> >> >
>>> >> > I have tried the svd of Mahout, but failed due to the OutOfMemory
>>> error.
>>> >> >
>>> >> > Best
>>> >> > Wei
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Fri, Mar 25, 2011 at 2:03 PM, Dmitriy Lyubimov <
>>> dlie...@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> you can certainly try to write it out into a DRM (distributed row
>>> >> >> matrix) and run stochastic SVD on  hadoop (off the trunk now). see
>>> >> >> MAHOUT-593. This is suitable if you have a good decay of singular
>>> >> >> values (but if you don't it probably just means you have so much
>>> noise
>>> >> >> that it masks the problem you are trying to solve in your data).
>>> >> >>
>>> >> >> Current committed solution is not most efficient yet, but it should
>>> be
>>> >> >> quite capable.
>>> >> >>
>>> >> >> If you do, let me know how it went.
>>> >> >>
>>> >> >> thanks.
>>> >> >> -d
>>> >> >>
>>> >> >> On Thu, Mar 24, 2011 at 10:59 PM, Dmitriy Lyubimov <
>>> dlie...@gmail.com>
>>> >> >> wrote:
>>> >> >> > Are you sure your matrix is dense?
>>> >> >> >
>>> >> >> > On Thu, Mar 24, 2011 at 9:59 PM, Wei Li 
>>> wrote:
>>> >> >> >> Hi All:
>>> >> >> >>
>>> >> >> >>is it possible to compute the SVD factorization for a 600,000
>>> *
>>> >> >> >> 600,000
>>> >> >> >> matrix using Mahout?
>>> >> >> >>
>>> >> >> >>I have got the OutOfMemory error when creating the
>>> DenseMatrix.
>>> >> >> >>
>>> >> >> >> Best
>>> >> >> >> Wei
>>> >> >> >>
>>> >> >> >
>>> >> >
>>> >> >
>>> >
>>> >
>>>
>>
>>
>


Re: About formatting patches

2011-04-06 Thread Grant Ingersoll
I believe there is one of those in Lucene as well, check the Lucene Wiki.

On Apr 4, 2011, at 10:20 AM, Ted Dunning wrote:

> I would imagine that we could wrassle up an IntelliJ style as well.
> 
> On Mon, Apr 4, 2011 at 12:54 AM, Dawid Weiss
> wrote:
> 
>> There is definitely an Eclipse formatter style inside Lucene's source
>> code that you can import and use for Mahout.
>> 
>> Dawid
>> 
>> On Mon, Apr 4, 2011 at 9:46 AM, Sean Owen  wrote:
>>> That's right, it's just standard Java/Sun convention. When in doubt
>> follow
>>> the surrounding code.
>>> I think there is an Eclipse template in here somewhere that has some of
>> the
>>> basic settings.
>>> 
>>> On Mon, Apr 4, 2011 at 8:44 AM, Sebastian Schelter 
>> wrote:
>>> 
 I always try to adhere to Lucene's conventions, which AFAIK are the same
>> as
 the standard sun code conventions with the difference that a 2-space
>> indent
 is used and lines are allowed to be 120 characters.
 
 --sebastian
 
 
 On 04.04.2011 09:41, Lance Norskog wrote:
 
> There seems to be some discrepancies between the preferences of
> various committers v.s. the Eclipse formatting template. Can someone
> please describe a 'Mahout style'?
> 
> 
>>> 
>> 

--
Grant Ingersoll
Lucene Revolution -- Lucene and Solr User Conference
May 25-26 in San Francisco
www.lucenerevolution.org



Re: GSOC Application

2011-04-06 Thread Grant Ingersoll
Can you list what the Stanford dependencies are?  I seem to recall some of them 
have non-compatible licenses with the Apache license.  In other words, I'd hate 
to see you put a lot of effort into something that can't be accepted b/c of 
legal mumbo-jumbo.See http://www.apache.org/legal/3party.html

-Grant

On Apr 4, 2011, at 3:53 AM, Harsh wrote:

> I proposed this project during application discussion.
> I want to build an NLP over the Stanford dependencies which can understand
> the theme of a given document by using word count of nouns and related
> pronouns in a given paragraph.
> 
> I want to know how can I make a formal proposal for GSOC 2011. I will be
> grateful if you could guide me through the proposal process.

--
Grant Ingersoll
Lucene Revolution -- Lucene and Solr User Conference
May 25-26 in San Francisco
www.lucenerevolution.org



Re: is it possible to compute the SVD for a large scale matrix

2011-04-06 Thread Ted Dunning
If you did mean 60 million by 60 million, is that matrix sparse?

Also, how many eigenvectors did you ask for?

How large is your machine in terms of memory?

You might also experiment with the random projection version of SVD.
 Dmitriy can comment on
how to run that.

On Wed, Apr 6, 2011 at 3:55 AM, Danny Bickson wrote:

> ...
> Do you mean 60 Million by 60 Million?
>
>
>
>
> On Wed, Apr 6, 2011 at 2:13 AM, Wei Li  wrote:
>
> ...
> >
> >  I have transformed the csv data into the DistributedRowMatrix
> format,
> > but it still failed due to the memory problem after 2 or 3 iterations.
> >
> >  my matrix dimension is about 60w * 60w, it is possible to do the svd
> > decomposition for this scale using Mahout?
> >
>


Re: How I could run Logistic Regression with a word predictor?

2011-04-06 Thread Ted Dunning
Stan,

Yes.  SGD is fine for CTR prediction.

You should upgrade to trunk and read the later chapters.  The packaged
version in chapter 13 is optimized
for the trivial examples there.  You will need to encode your vectors
specifically for your application and I
would strongly recommend that you use the AdaptiveLogisticRegression instead
of the raw
OnlineLogisticRegression.

On Wed, Apr 6, 2011 at 3:15 AM, Stanley Xu  wrote:

> I run the command in Mahout In Action(chapter 13) successfully, but meet an
> error while try to introduce a categorical predictor by the following
> command with the built-in example on Mahout-0.4
>


Re: is it possible to compute the SVD for a large scale matrix

2011-04-06 Thread Jake Mannix
On Thu, Mar 24, 2011 at 11:03 PM, Dmitriy Lyubimov wrote:

> you can certainly try to write it out into a DRM (distributed row
> matrix) and run stochastic SVD on  hadoop (off the trunk now). see
> MAHOUT-593. This is suitable if you have a good decay of singular
> values (but if you don't it probably just means you have so much noise
> that it masks the problem you are trying to solve in your data).
>

You don't need to run it as stochastic, either.  The regular LanczosSolver
will work on this data, if it lives as a DRM.

  -jake


Re: is it possible to compute the SVD for a large scale matrix

2011-04-06 Thread Dmitriy Lyubimov
Jake, since we are on the topic, what's the running times of Lanczos
on a ~1G worth sequence file input might be?

On Wed, Apr 6, 2011 at 11:11 AM, Jake Mannix  wrote:
>
>
> On Thu, Mar 24, 2011 at 11:03 PM, Dmitriy Lyubimov 
> wrote:
>>
>> you can certainly try to write it out into a DRM (distributed row
>> matrix) and run stochastic SVD on  hadoop (off the trunk now). see
>> MAHOUT-593. This is suitable if you have a good decay of singular
>> values (but if you don't it probably just means you have so much noise
>> that it masks the problem you are trying to solve in your data).
>
> You don't need to run it as stochastic, either.  The regular LanczosSolver
> will work on this data, if it lives as a DRM.
>
>   -jake


Re: is it possible to compute the SVD for a large scale matrix

2011-04-06 Thread Dmitriy Lyubimov
p.S. for 500 singular values?

On Wed, Apr 6, 2011 at 11:14 AM, Dmitriy Lyubimov  wrote:
> Jake, since we are on the topic, what's the running times of Lanczos
> on a ~1G worth sequence file input might be?
>


Re: is it possible to compute the SVD for a large scale matrix

2011-04-06 Thread Jake Mannix
Hmmm... that's a really tiny data set.  Lanczos-based SVD, for k singular
values, requires k passes over the data, and each row which has d non-zero
entries will do d^2 computations in each pass.  So if there are n rows in
the
data set, it's k*n*d^2 if all rows are the same size.

I guess "how long" depends on how big the cluster is!

On Wed, Apr 6, 2011 at 11:14 AM, Dmitriy Lyubimov  wrote:

> Jake, since we are on the topic, what's the running times of Lanczos
> on a ~1G worth sequence file input might be?
>
> On Wed, Apr 6, 2011 at 11:11 AM, Jake Mannix 
> wrote:
> >
> >
> > On Thu, Mar 24, 2011 at 11:03 PM, Dmitriy Lyubimov 
> > wrote:
> >>
> >> you can certainly try to write it out into a DRM (distributed row
> >> matrix) and run stochastic SVD on  hadoop (off the trunk now). see
> >> MAHOUT-593. This is suitable if you have a good decay of singular
> >> values (but if you don't it probably just means you have so much noise
> >> that it masks the problem you are trying to solve in your data).
> >
> > You don't need to run it as stochastic, either.  The regular
> LanczosSolver
> > will work on this data, if it lives as a DRM.
> >
> >   -jake
>


Re: is it possible to compute the SVD for a large scale matrix

2011-04-06 Thread Ted Dunning
The key is the k passes.  This bounds the time from below for large values
of k since it typically takes 10's of seconds to light up a map-reduce job.
 Larger clusters can actually be worse for this computation because of that.

On Wed, Apr 6, 2011 at 11:16 AM, Jake Mannix  wrote:

> ...  Lanczos-based SVD, for k singular
> values, requires k passes over the data, and each row which has d non-zero
> entries will do d^2 computations in each pass.  ...
>
> I guess "how long" depends on how big the cluster is!
>


Re: is it possible to compute the SVD for a large scale matrix

2011-04-06 Thread Dmitriy Lyubimov
so, assuming 500 oversampled svalues is equivalent to perhaps 300
'good' values depending on decay... so 300 singular values would
require 300 passes over the whole input? or only sub-part of it?
Given it takes about 20 s just to set up a MR run and 10 sec to
confirm it's completion, that's just what... about 100-150 minutes
just in initialization time?

Also, the size of the problem must also affect sorting i/o time
(unless all jobs are map-only, but i don't think they can be). That's
kind of at least proportional to the size of the input. so I guess
problem size does matter, not just the # of available slots for the
mappers.


On Wed, Apr 6, 2011 at 11:16 AM, Jake Mannix  wrote:
> Hmmm... that's a really tiny data set.  Lanczos-based SVD, for k singular
> values, requires k passes over the data, and each row which has d non-zero
> entries will do d^2 computations in each pass.  So if there are n rows in
> the
> data set, it's k*n*d^2 if all rows are the same size.
> I guess "how long" depends on how big the cluster is!
>
> On Wed, Apr 6, 2011 at 11:14 AM, Dmitriy Lyubimov  wrote:
>>
>> Jake, since we are on the topic, what's the running times of Lanczos
>> on a ~1G worth sequence file input might be?
>>
>> On Wed, Apr 6, 2011 at 11:11 AM, Jake Mannix 
>> wrote:
>> >
>> >
>> > On Thu, Mar 24, 2011 at 11:03 PM, Dmitriy Lyubimov 
>> > wrote:
>> >>
>> >> you can certainly try to write it out into a DRM (distributed row
>> >> matrix) and run stochastic SVD on  hadoop (off the trunk now). see
>> >> MAHOUT-593. This is suitable if you have a good decay of singular
>> >> values (but if you don't it probably just means you have so much noise
>> >> that it masks the problem you are trying to solve in your data).
>> >
>> > You don't need to run it as stochastic, either.  The regular
>> > LanczosSolver
>> > will work on this data, if it lives as a DRM.
>> >
>> >   -jake
>
>


Re: is it possible to compute the SVD for a large scale matrix

2011-04-06 Thread Ted Dunning
Yes.  It would take a long time.

But, on the other side of the discussion, it is unlikely that any singular
vectors that you get past 20-50 (depends on the problem)
will be anything but elaborate encodings of noise anyway.  For lots of
problems, a very small number of real singular vectors plus
a bunch of random numbers will suffice just as well.  So I wouldn't expect
more than 50 passes would ever be needed.

Lots of people have studied the problem of how performance improves with
larger numbers of reduced dimension, but few have
studied the problem properly by looking at the trade-off singular vectors
versus random vectors.  My guess is that most systems
would work well with no more than 10 singular vectors plus 30-40 random
projections.

On Wed, Apr 6, 2011 at 11:26 AM, Dmitriy Lyubimov  wrote:

> so, assuming 500 oversampled svalues is equivalent to perhaps 300
> 'good' values depending on decay... so 300 singular values would
> require 300 passes over the whole input? or only sub-part of it?
> Given it takes about 20 s just to set up a MR run and 10 sec to
> confirm it's completion, that's just what... about 100-150 minutes
> just in initialization time?
>
> Also, the size of the problem must also affect sorting i/o time
> (unless all jobs are map-only, but i don't think they can be). That's
> kind of at least proportional to the size of the input. so I guess
> problem size does matter, not just the # of available slots for the
> mappers.
>
>
> On Wed, Apr 6, 2011 at 11:16 AM, Jake Mannix 
> wrote:
> > Hmmm... that's a really tiny data set.  Lanczos-based SVD, for k singular
> > values, requires k passes over the data, and each row which has d
> non-zero
> > entries will do d^2 computations in each pass.  So if there are n rows in
> > the
> > data set, it's k*n*d^2 if all rows are the same size.
> > I guess "how long" depends on how big the cluster is!
> >
> > On Wed, Apr 6, 2011 at 11:14 AM, Dmitriy Lyubimov 
> wrote:
> >>
> >> Jake, since we are on the topic, what's the running times of Lanczos
> >> on a ~1G worth sequence file input might be?
> >>
> >> On Wed, Apr 6, 2011 at 11:11 AM, Jake Mannix 
> >> wrote:
> >> >
> >> >
> >> > On Thu, Mar 24, 2011 at 11:03 PM, Dmitriy Lyubimov  >
> >> > wrote:
> >> >>
> >> >> you can certainly try to write it out into a DRM (distributed row
> >> >> matrix) and run stochastic SVD on  hadoop (off the trunk now). see
> >> >> MAHOUT-593. This is suitable if you have a good decay of singular
> >> >> values (but if you don't it probably just means you have so much
> noise
> >> >> that it masks the problem you are trying to solve in your data).
> >> >
> >> > You don't need to run it as stochastic, either.  The regular
> >> > LanczosSolver
> >> > will work on this data, if it lives as a DRM.
> >> >
> >> >   -jake
> >
> >
>


Re: is it possible to compute the SVD for a large scale matrix

2011-04-06 Thread Dmitriy Lyubimov
SSVD for 50 values actually would take significantly less time that
for 500. actually ~500^2/50^2 times faster, i think, as the flops are
in QR computation for most part. I did not try to run it with such
small k+p values though -- not on bigger inputs anyway.

On Wed, Apr 6, 2011 at 11:39 AM, Ted Dunning  wrote:
> Yes.  It would take a long time.
> But, on the other side of the discussion, it is unlikely that any singular
> vectors that you get past 20-50 (depends on the problem)
> will be anything but elaborate encodings of noise anyway.  For lots of
> problems, a very small number of real singular vectors plus
> a bunch of random numbers will suffice just as well.  So I wouldn't expect
> more than 50 passes would ever be needed.
> Lots of people have studied the problem of how performance improves with
> larger numbers of reduced dimension, but few have
> studied the problem properly by looking at the trade-off singular vectors
> versus random vectors.  My guess is that most systems
> would work well with no more than 10 singular vectors plus 30-40 random
> projections.
>
> On Wed, Apr 6, 2011 at 11:26 AM, Dmitriy Lyubimov  wrote:
>>
>> so, assuming 500 oversampled svalues is equivalent to perhaps 300
>> 'good' values depending on decay... so 300 singular values would
>> require 300 passes over the whole input? or only sub-part of it?
>> Given it takes about 20 s just to set up a MR run and 10 sec to
>> confirm it's completion, that's just what... about 100-150 minutes
>> just in initialization time?
>>
>> Also, the size of the problem must also affect sorting i/o time
>> (unless all jobs are map-only, but i don't think they can be). That's
>> kind of at least proportional to the size of the input. so I guess
>> problem size does matter, not just the # of available slots for the
>> mappers.
>>
>>
>> On Wed, Apr 6, 2011 at 11:16 AM, Jake Mannix 
>> wrote:
>> > Hmmm... that's a really tiny data set.  Lanczos-based SVD, for k
>> > singular
>> > values, requires k passes over the data, and each row which has d
>> > non-zero
>> > entries will do d^2 computations in each pass.  So if there are n rows
>> > in
>> > the
>> > data set, it's k*n*d^2 if all rows are the same size.
>> > I guess "how long" depends on how big the cluster is!
>> >
>> > On Wed, Apr 6, 2011 at 11:14 AM, Dmitriy Lyubimov 
>> > wrote:
>> >>
>> >> Jake, since we are on the topic, what's the running times of Lanczos
>> >> on a ~1G worth sequence file input might be?
>> >>
>> >> On Wed, Apr 6, 2011 at 11:11 AM, Jake Mannix 
>> >> wrote:
>> >> >
>> >> >
>> >> > On Thu, Mar 24, 2011 at 11:03 PM, Dmitriy Lyubimov
>> >> > 
>> >> > wrote:
>> >> >>
>> >> >> you can certainly try to write it out into a DRM (distributed row
>> >> >> matrix) and run stochastic SVD on  hadoop (off the trunk now). see
>> >> >> MAHOUT-593. This is suitable if you have a good decay of singular
>> >> >> values (but if you don't it probably just means you have so much
>> >> >> noise
>> >> >> that it masks the problem you are trying to solve in your data).
>> >> >
>> >> > You don't need to run it as stochastic, either.  The regular
>> >> > LanczosSolver
>> >> > will work on this data, if it lives as a DRM.
>> >> >
>> >> >   -jake
>> >
>> >
>
>


Re: is it possible to compute the SVD for a large scale matrix

2011-04-06 Thread Jake Mannix
Of course, for a data set of only 1GB in size, you don't need to map-reduce
it.  You can
use the regular sparse LanczosSolver in memory, and then you don't have to
worry
about this 10's of seconds of startup time.

On Wed, Apr 6, 2011 at 11:25 AM, Ted Dunning  wrote:

> The key is the k passes.  This bounds the time from below for large values
> of k since it typically takes 10's of seconds to light up a map-reduce job.
>  Larger clusters can actually be worse for this computation because of that.
>
> On Wed, Apr 6, 2011 at 11:16 AM, Jake Mannix wrote:
>
>> ...  Lanczos-based SVD, for k singular
>>
>> values, requires k passes over the data, and each row which has d non-zero
>> entries will do d^2 computations in each pass.  ...
>>
>>
>> I guess "how long" depends on how big the cluster is!
>>
>
>


Re: is it possible to compute the SVD for a large scale matrix

2011-04-06 Thread Dmitriy Lyubimov
Sure. I was just throwing in numbers just to get an idea.

On Wed, Apr 6, 2011 at 11:58 AM, Jake Mannix  wrote:
> Of course, for a data set of only 1GB in size, you don't need to map-reduce
> it.  You can
> use the regular sparse LanczosSolver in memory, and then you don't have to
> worry
> about this 10's of seconds of startup time.
>
> On Wed, Apr 6, 2011 at 11:25 AM, Ted Dunning  wrote:
>>
>> The key is the k passes.  This bounds the time from below for large values
>> of k since it typically takes 10's of seconds to light up a map-reduce job.
>>  Larger clusters can actually be worse for this computation because of that.
>>
>> On Wed, Apr 6, 2011 at 11:16 AM, Jake Mannix 
>> wrote:
>>>
>>> ...  Lanczos-based SVD, for k singular
>>> values, requires k passes over the data, and each row which has d
>>> non-zero
>>> entries will do d^2 computations in each pass.  ...
>>>
>>> I guess "how long" depends on how big the cluster is!
>
>


Re: is it possible to compute the SVD for a large scale matrix

2011-04-06 Thread Jake Mannix
On Wed, Apr 6, 2011 at 11:26 AM, Dmitriy Lyubimov  wrote:

> so, assuming 500 oversampled svalues is equivalent to perhaps 300
> 'good' values depending on decay... so 300 singular values would
> require 300 passes over the whole input? or only sub-part of it?
> Given it takes about 20 s just to set up a MR run and 10 sec to
> confirm it's completion, that's just what... about 100-150 minutes
> just in initialization time?
>

In general, yes, DistributedLanczosSolver is dominated by startup
costs for nearly all data sets I've used.


> Also, the size of the problem must also affect sorting i/o time
> (unless all jobs are map-only, but i don't think they can be). That's
>

And they're not map-only, there is a shuffle on every pass, but the
combiners are pretty will utilized, so the shuffle is pretty small.


> kind of at least proportional to the size of the input. so I guess
> problem size does matter, not just the # of available slots for the
> mappers.
>
>
> On Wed, Apr 6, 2011 at 11:16 AM, Jake Mannix 
> wrote:
> > Hmmm... that's a really tiny data set.  Lanczos-based SVD, for k singular
> > values, requires k passes over the data, and each row which has d
> non-zero
> > entries will do d^2 computations in each pass.  So if there are n rows in
> > the
> > data set, it's k*n*d^2 if all rows are the same size.
> > I guess "how long" depends on how big the cluster is!
> >
> > On Wed, Apr 6, 2011 at 11:14 AM, Dmitriy Lyubimov 
> wrote:
> >>
> >> Jake, since we are on the topic, what's the running times of Lanczos
> >> on a ~1G worth sequence file input might be?
> >>
> >> On Wed, Apr 6, 2011 at 11:11 AM, Jake Mannix 
> >> wrote:
> >> >
> >> >
> >> > On Thu, Mar 24, 2011 at 11:03 PM, Dmitriy Lyubimov  >
> >> > wrote:
> >> >>
> >> >> you can certainly try to write it out into a DRM (distributed row
> >> >> matrix) and run stochastic SVD on  hadoop (off the trunk now). see
> >> >> MAHOUT-593. This is suitable if you have a good decay of singular
> >> >> values (but if you don't it probably just means you have so much
> noise
> >> >> that it masks the problem you are trying to solve in your data).
> >> >
> >> > You don't need to run it as stochastic, either.  The regular
> >> > LanczosSolver
> >> > will work on this data, if it lives as a DRM.
> >> >
> >> >   -jake
> >
> >
>


Re: is it possible to compute the SVD for a large scale matrix

2011-04-06 Thread Jake Mannix
On Wed, Apr 6, 2011 at 12:01 PM, Dmitriy Lyubimov  wrote:

> Sure. I was just throwing in numbers just to get an idea.
>

Yeah, it's just good to remember what "sweet spot" in data set size we're
talking
about.

DistributedLanczos is currently best for very large data sets which aren't
*too* wide (numColumns < a few million), on not terribly huge clusters.

I'm working on a change to allow the numColumns restriction to go away
entirely,
but it'll require even more MR passes over the data (still O(k), but with
maybe a
factor of 4 out in front), because it won't be able to use the current
map-side
join trick, nor will it be able to utilize the "timesSquared" operation.


On Wed, Apr 6, 2011 at 11:58 AM, Jake Mannix  wrote:
> > Of course, for a data set of only 1GB in size, you don't need to
> map-reduce
> > it.  You can
> > use the regular sparse LanczosSolver in memory, and then you don't have
> to
> > worry
> > about this 10's of seconds of startup time.
> >
> > On Wed, Apr 6, 2011 at 11:25 AM, Ted Dunning 
> wrote:
> >>
> >> The key is the k passes.  This bounds the time from below for large
> values
> >> of k since it typically takes 10's of seconds to light up a map-reduce
> job.
> >>  Larger clusters can actually be worse for this computation because of
> that.
> >>
> >> On Wed, Apr 6, 2011 at 11:16 AM, Jake Mannix 
> >> wrote:
> >>>
> >>> ...  Lanczos-based SVD, for k singular
> >>> values, requires k passes over the data, and each row which has d
> >>> non-zero
> >>> entries will do d^2 computations in each pass.  ...
> >>>
> >>> I guess "how long" depends on how big the cluster is!
> >
> >
>


Re: is it possible to compute the SVD for a large scale matrix

2011-04-06 Thread Dmitriy Lyubimov
But I guess if there's an evidence that 200 singular is better than
50, isn't that translated in stochastic world that 200+300 is perhaps
better enough than 50+90 so it's worth the effort?


On Wed, Apr 6, 2011 at 12:15 PM, Ted Dunning  wrote:
>
>
> On Wed, Apr 6, 2011 at 11:47 AM, Dmitriy Lyubimov  wrote:
>>
>> But with LSI (which is what i use it for) they recommend to get at
>> least about 200 'good' values i think I read it? Just to fit all
>> possible 'soft clusters' which would be approximate but a lot of them
>> sticking in different directions?
>
> That is exactly my point.  They analyzed the performance with 50, 100 and
> 200 singular vectors, but not between
> 20 singular + 180 random vectors.
> The random vectors stick out in different directions.  The issue is whether
> the data can really tell you what good directions are.  I think not.
>>
>> Disclaimer: i havent' analyzed yet the decay on sv's of our data, it
>> would certainly show how soon reasonable is reasonable. I think I saw
>> one of presentations of the authors of that paper where they show a
>> formula to estimate when \sigma_{n}\over\sigma_{n+1} is small enough
>> to be comparable to noise. It was one of ideas i had how to advise on
>> actually useful number of singular values produced, post-run .
>
> The decay of singular values actually tells you very little.  If you were to
> analyze purely random text with
> the same word frequencies, you would see similar decay of singular values.
>  All that the singular values
> tell you is how many singular values/vectors that are required to replicate
> the *training* data to a particular
> level of fidelity.  They say nothing about how well you will be able to
> replicate unseen data and that is
> the only important question.
>
>


Re: is it possible to compute the SVD for a large scale matrix

2011-04-06 Thread Ted Dunning
No.

There is no evidence that 200 singular vectors is better than 50 if the 50
are augmented by 150 random vectors.

Likewise, there is no reason that 200 singular + 300 random would
necessarily be better than 50+90.  The larger the
dimension used, the more the system simple emulates conventional term-based
methods.  The advantage of LSI is
improvement of recall due to smoothing and you get no smoothing with very
high dimension.

Also, the LSI results were for text retrieval.  I don't think that is what
you are doing.

On Wed, Apr 6, 2011 at 12:22 PM, Dmitriy Lyubimov  wrote:

> But I guess if there's an evidence that 200 singular is better than
> 50, isn't that translated in stochastic world that 200+300 is perhaps
> better enough than 50+90 so it's worth the effort?
>
>
> On Wed, Apr 6, 2011 at 12:15 PM, Ted Dunning 
> wrote:
> >
> >
> > On Wed, Apr 6, 2011 at 11:47 AM, Dmitriy Lyubimov 
> wrote:
> >>
> >> But with LSI (which is what i use it for) they recommend to get at
> >> least about 200 'good' values i think I read it? Just to fit all
> >> possible 'soft clusters' which would be approximate but a lot of them
> >> sticking in different directions?
> >
> > That is exactly my point.  They analyzed the performance with 50, 100 and
> > 200 singular vectors, but not between
> > 20 singular + 180 random vectors.
> > The random vectors stick out in different directions.  The issue is
> whether
> > the data can really tell you what good directions are.  I think not.
> >>
> >> Disclaimer: i havent' analyzed yet the decay on sv's of our data, it
> >> would certainly show how soon reasonable is reasonable. I think I saw
> >> one of presentations of the authors of that paper where they show a
> >> formula to estimate when \sigma_{n}\over\sigma_{n+1} is small enough
> >> to be comparable to noise. It was one of ideas i had how to advise on
> >> actually useful number of singular values produced, post-run .
> >
> > The decay of singular values actually tells you very little.  If you were
> to
> > analyze purely random text with
> > the same word frequencies, you would see similar decay of singular
> values.
> >  All that the singular values
> > tell you is how many singular values/vectors that are required to
> replicate
> > the *training* data to a particular
> > level of fidelity.  They say nothing about how well you will be able to
> > replicate unseen data and that is
> > the only important question.
> >
> >
>


Kmeans clustering options

2011-04-06 Thread Kate Ericson
Hi all,

I just got the latest update to the first 6 chapters of Mahout In
Action, and it still says that '-r' is an option to k-means
clustering.  I'm working with 0.4, and I'm not seeing it as an option
off of -h.
I'm just looking for a sanity check - can you set the number of
reducers for k-means?

Thanks,

Kate


RE: Kmeans clustering options

2011-04-06 Thread Jeff Eastman
Only using a -Dmapred.reduce.tasks=n parameter. The explicit CLI argument was 
dropped in 0.4. Looks like the book has a typo.

-Original Message-
From: moving...@gmail.com [mailto:moving...@gmail.com] On Behalf Of Kate Ericson
Sent: Wednesday, April 06, 2011 5:45 PM
To: user@mahout.apache.org
Subject: Kmeans clustering options

Hi all,

I just got the latest update to the first 6 chapters of Mahout In
Action, and it still says that '-r' is an option to k-means
clustering.  I'm working with 0.4, and I'm not seeing it as an option
off of -h.
I'm just looking for a sanity check - can you set the number of
reducers for k-means?

Thanks,

Kate


Re: Kmeans clustering options

2011-04-06 Thread Kate Ericson
Thanks for the quick reply!

-Kate

On Wed, Apr 6, 2011 at 6:58 PM, Jeff Eastman  wrote:
> Only using a -Dmapred.reduce.tasks=n parameter. The explicit CLI argument was 
> dropped in 0.4. Looks like the book has a typo.
>
> -Original Message-
> From: moving...@gmail.com [mailto:moving...@gmail.com] On Behalf Of Kate 
> Ericson
> Sent: Wednesday, April 06, 2011 5:45 PM
> To: user@mahout.apache.org
> Subject: Kmeans clustering options
>
> Hi all,
>
> I just got the latest update to the first 6 chapters of Mahout In
> Action, and it still says that '-r' is an option to k-means
> clustering.  I'm working with 0.4, and I'm not seeing it as an option
> off of -h.
> I'm just looking for a sanity check - can you set the number of
> reducers for k-means?
>
> Thanks,
>
> Kate
>


Re: Kmeans clustering options

2011-04-06 Thread Ted Dunning
Thanks for the typo catching!

On Wed, Apr 6, 2011 at 6:16 PM, Kate Ericson wrote:

> Thanks for the quick reply!
>
> -Kate
>
> On Wed, Apr 6, 2011 at 6:58 PM, Jeff Eastman  wrote:
> > Only using a -Dmapred.reduce.tasks=n parameter. The explicit CLI argument
> was dropped in 0.4. Looks like the book has a typo.
> >
> > -Original Message-
> > From: moving...@gmail.com [mailto:moving...@gmail.com] On Behalf Of Kate
> Ericson
> > Sent: Wednesday, April 06, 2011 5:45 PM
> > To: user@mahout.apache.org
> > Subject: Kmeans clustering options
> >
> > Hi all,
> >
> > I just got the latest update to the first 6 chapters of Mahout In
> > Action, and it still says that '-r' is an option to k-means
> > clustering.  I'm working with 0.4, and I'm not seeing it as an option
> > off of -h.
> > I'm just looking for a sanity check - can you set the number of
> > reducers for k-means?
> >
> > Thanks,
> >
> > Kate
> >
>


Re: How I could run Logistic Regression with a word predictor?

2011-04-06 Thread Stanley Xu
Thanks a lot, Ted. Will follow your instruction.

Best wishes,
Stanley Xu



On Thu, Apr 7, 2011 at 12:50 AM, Ted Dunning  wrote:

> for the trivial examples there.  You will need to encode your vectors
> specifically for your application and I
>


Re: Check the input files present in cluster

2011-04-06 Thread Madhusudan Joshi
Thank you. Adding --nameVector parameter returned the members of the cluster
during clusterdump.

On Wed, Apr 6, 2011 at 12:27 PM, Geek Gamer  wrote:

> How are you preparing the vectors? You will get the cluster members if
> these
> are named vectors. you can prepare named vectors from a sequence file using
> $MAHOUT_HOME/bin/mahout seq2sparse
>
> add the parameter --namedVector to the command to create named vectors, the
> same clusterdump command will then yield the members of the clusters.
> Hope this helped.
>
>
> On Wed, Apr 6, 2011 at 9:23 AM, Madhusudan Joshi <
> madhusudanrjo...@gmail.com
> > wrote:
>
> > The command I used to cluster dump is
> >
> > mahout clusterdump -s mytest/kmeans/clusters-1 -p
> > mytest/kmeans/clusteredPoints -d mytest/seqdir-sparse/dictionary.file-0
> -dt
> > sequencefile -n 20 -o Desktop/ClusterDump/Kmeans/cl1.txt
> >
> > I tried the reuters example and then clustered using my sample files. The
> > output of my sample files is
> >
> > CL-0{n=2 c=[article:3.009, first:3.279, third:3.279] r=[first:3.279,
> > third:3.279]}
> >Top Terms:
> >third   =>  3.2787654399871826
> >first   =>  3.2787654399871826
> >article =>  3.0087521076202393
> >Weight:  Point:
> >1.0: [article:3.009, first:6.558]
> >1.0: [article:3.009, third:6.558]
> > VL-1{n=1 c=[article:3.009, second:6.558] r=[article:0.000, first:0.000,
> > fourth:0.000, second:0.000, third:0.000]}
> >Top Terms:
> >second  =>   6.557530879974365
> >article =>  3.0087521076202393
> >Weight:  Point:
> >1.0: [article:3.009, second:6.558]
> > VL-3{n=1 c=[article:3.009, fourth:6.558] r=[article:0.000, first:0.000,
> > fourth:0.000, second:0.000, third:0.000]}
> >Top Terms:
> >fourth  =>   6.557530879974365
> >article =>  3.0087521076202393
> >Weight:  Point:
> >1.0: [article:3.009, fourth:6.558]
> >
> > The output showed the number of documents present in the cluster but did
> > not
> > mention which documents. I need to be able to check which documents are
> > present in any given clusters.
> >
> > On Tue, Apr 5, 2011 at 11:34 PM, Jeff Eastman 
> wrote:
> >
> > > You are going to have to be much more explicit in terms of what command
> > > line invocations you did and what results you got in order for anybody
> to
> > be
> > > able help you much here. Have you tried the clustering examples in the
> > wiki?
> > >
> > > -Original Message-
> > > From: Madhusudan Joshi [mailto:madhusudanrjo...@gmail.com]
> > > Sent: Monday, April 04, 2011 10:23 PM
> > > To: user@mahout.apache.org
> > > Subject: Check the input files present in cluster
> > >
> > > Hi,
> > >
> > > I am new to mahout and trying out clustering. I created a cluster using
> > > kmeans in bash. I want to know which files are present in a given
> > clusters.
> > > I tried looking for it in cluster dumper but didn't find the required
> > > solution. Can anyone help me with this?
> > >
> > > Thanks.
> > >
> > > --
> > > Everything we hear is an opinion, not a fact.
> > > Everything we see is perspective, not the truth.
> > >
> >
> >
> >
> > --
> > Everything we hear is an opinion, not a fact.
> > Everything we see is perspective, not the truth.
> >
>



-- 
Everything we hear is an opinion, not a fact.
Everything we see is perspective, not the truth.