15GB of tokenized documents, not bad, not bad. We're not going to get a multi-billion entry matrix out of this though, are we?
-jake On Sat, Feb 27, 2010 at 2:06 PM, Robin Anil <robin.a...@gmail.com> wrote: > Update: > > in 20 mins the tokenization stage is complete But its not evident in the > online UI. > I found it by checking the s3 output folder > > 2010-02-27 21:50 2696826329 > s3://robinanil/wikipedia/tokenized-documents/part-00000 > 2010-02-27 21:52 2385184391 > s3://robinanil/wikipedia/tokenized-documents/part-00001 > 2010-02-27 21:52 2458566158 > s3://robinanil/wikipedia/tokenized-documents/part-00002 > 2010-02-27 21:53 2500213973 > s3://robinanil/wikipedia/tokenized-documents/part-00003 > 2010-02-27 21:50 2533593862 > s3://robinanil/wikipedia/tokenized-documents/part-00004 > 2010-02-27 21:54 3580695441 > s3://robinanil/wikipedia/tokenized-documents/part-00005 > 2010-02-27 22:02 0 > s3://robinanil/wikipedia/tokenized-documents_$folder$ > 2010-02-27 22:02 0 > s3://robinanil/wikipedia/wordcount/subgrams/_temporary_$folder$ > 2010-02-27 22:02 0 > s3://robinanil/wikipedia/wordcount/subgrams_$folder$ > 2010-02-27 22:02 0 s3://robinanil/wikipedia/wordcount_$folder$ >