Sure 5mil articles, but the thing I'm trying to figure out is the number of
unique unigram and bigrams on average per document.  I was hoping it
was around 1000 or more (these are wikipedia articles, so I would
imagine so...).

We certainly have ways of generating interesting auxiliary data sets
bigger than this one (but the ngram-ngram and doc-doc matrices
are pretty artificial, because doing SVD on the ngram-doc matrix gets
the eigenvectors of both of these other matrices anyways), if we
need to.

  -jake

On Sat, Feb 27, 2010 at 2:27 PM, Robin Anil <robin.a...@gmail.com> wrote:

> like i said only 5 mil articles. Maybe you can generate a co-occurrence
> matrix :) every ngram to every other ngram :) Sounds fun? It will be HUGE!
>
>
> On Sun, Feb 28, 2010 at 3:43 AM, Jake Mannix <jake.man...@gmail.com>
> wrote:
>
> > 15GB of tokenized documents, not bad, not bad.  We're not going
> > to get a multi-billion entry matrix out of this though, are we?
> >
> >  -jake
> >
> > On Sat, Feb 27, 2010 at 2:06 PM, Robin Anil <robin.a...@gmail.com>
> wrote:
> >
> > > Update:
> > >
> > > in 20 mins the tokenization stage is complete But its not evident in
> the
> > > online UI.
> > > I found it by checking the s3 output folder
> > >
> > > 2010-02-27 21:50  2696826329
> > > s3://robinanil/wikipedia/tokenized-documents/part-00000
> > > 2010-02-27 21:52  2385184391
> > > s3://robinanil/wikipedia/tokenized-documents/part-00001
> > > 2010-02-27 21:52  2458566158
> > > s3://robinanil/wikipedia/tokenized-documents/part-00002
> > > 2010-02-27 21:53  2500213973
> > > s3://robinanil/wikipedia/tokenized-documents/part-00003
> > > 2010-02-27 21:50  2533593862
> > > s3://robinanil/wikipedia/tokenized-documents/part-00004
> > > 2010-02-27 21:54  3580695441
> > > s3://robinanil/wikipedia/tokenized-documents/part-00005
> > > 2010-02-27 22:02         0
> > > s3://robinanil/wikipedia/tokenized-documents_$folder$
> > > 2010-02-27 22:02         0
> > > s3://robinanil/wikipedia/wordcount/subgrams/_temporary_$folder$
> > > 2010-02-27 22:02         0
> > > s3://robinanil/wikipedia/wordcount/subgrams_$folder$
> > > 2010-02-27 22:02         0
> s3://robinanil/wikipedia/wordcount_$folder$
> > >
> >
>

Reply via email to