Sure 5mil articles, but the thing I'm trying to figure out is the number of unique unigram and bigrams on average per document. I was hoping it was around 1000 or more (these are wikipedia articles, so I would imagine so...).
We certainly have ways of generating interesting auxiliary data sets bigger than this one (but the ngram-ngram and doc-doc matrices are pretty artificial, because doing SVD on the ngram-doc matrix gets the eigenvectors of both of these other matrices anyways), if we need to. -jake On Sat, Feb 27, 2010 at 2:27 PM, Robin Anil <robin.a...@gmail.com> wrote: > like i said only 5 mil articles. Maybe you can generate a co-occurrence > matrix :) every ngram to every other ngram :) Sounds fun? It will be HUGE! > > > On Sun, Feb 28, 2010 at 3:43 AM, Jake Mannix <jake.man...@gmail.com> > wrote: > > > 15GB of tokenized documents, not bad, not bad. We're not going > > to get a multi-billion entry matrix out of this though, are we? > > > > -jake > > > > On Sat, Feb 27, 2010 at 2:06 PM, Robin Anil <robin.a...@gmail.com> > wrote: > > > > > Update: > > > > > > in 20 mins the tokenization stage is complete But its not evident in > the > > > online UI. > > > I found it by checking the s3 output folder > > > > > > 2010-02-27 21:50 2696826329 > > > s3://robinanil/wikipedia/tokenized-documents/part-00000 > > > 2010-02-27 21:52 2385184391 > > > s3://robinanil/wikipedia/tokenized-documents/part-00001 > > > 2010-02-27 21:52 2458566158 > > > s3://robinanil/wikipedia/tokenized-documents/part-00002 > > > 2010-02-27 21:53 2500213973 > > > s3://robinanil/wikipedia/tokenized-documents/part-00003 > > > 2010-02-27 21:50 2533593862 > > > s3://robinanil/wikipedia/tokenized-documents/part-00004 > > > 2010-02-27 21:54 3580695441 > > > s3://robinanil/wikipedia/tokenized-documents/part-00005 > > > 2010-02-27 22:02 0 > > > s3://robinanil/wikipedia/tokenized-documents_$folder$ > > > 2010-02-27 22:02 0 > > > s3://robinanil/wikipedia/wordcount/subgrams/_temporary_$folder$ > > > 2010-02-27 22:02 0 > > > s3://robinanil/wikipedia/wordcount/subgrams_$folder$ > > > 2010-02-27 22:02 0 > s3://robinanil/wikipedia/wordcount_$folder$ > > > > > >