G'night! Looking forward to seeing the ngram outputs. It should do well for getting us some Big Data. :)
-jake On Sat, Feb 27, 2010 at 2:58 PM, Robin Anil <robin.a...@gmail.com> wrote: > as suspected bigram generation is taking forever (with the explosion in > number of key,values thrown around the cluster) (almost 1 hour now and I > dont have any counter update :( ) That means bigrams might do it for us. If > by morning I find myself not in debt to amazon :) You will know what the > result is. > > BTW, I am presenting mahout at India Hadoop summit tomorrow. So need to get > up in like 4 hours. So bye > > Robin > > > On Sun, Feb 28, 2010 at 3:57 AM, Robin Anil <robin.a...@gmail.com> wrote: > > > like i said only 5 mil articles. Maybe you can generate a co-occurrence > > matrix :) every ngram to every other ngram :) Sounds fun? It will be > HUGE! > > > > > > > > On Sun, Feb 28, 2010 at 3:43 AM, Jake Mannix <jake.man...@gmail.com > >wrote: > > > >> 15GB of tokenized documents, not bad, not bad. We're not going > >> to get a multi-billion entry matrix out of this though, are we? > >> > >> -jake > >> > >> On Sat, Feb 27, 2010 at 2:06 PM, Robin Anil <robin.a...@gmail.com> > wrote: > >> > >> > Update: > >> > > >> > in 20 mins the tokenization stage is complete But its not evident in > the > >> > online UI. > >> > I found it by checking the s3 output folder > >> > > >> > 2010-02-27 21:50 2696826329 > >> > s3://robinanil/wikipedia/tokenized-documents/part-00000 > >> > 2010-02-27 21:52 2385184391 > >> > s3://robinanil/wikipedia/tokenized-documents/part-00001 > >> > 2010-02-27 21:52 2458566158 > >> > s3://robinanil/wikipedia/tokenized-documents/part-00002 > >> > 2010-02-27 21:53 2500213973 > >> > s3://robinanil/wikipedia/tokenized-documents/part-00003 > >> > 2010-02-27 21:50 2533593862 > >> > s3://robinanil/wikipedia/tokenized-documents/part-00004 > >> > 2010-02-27 21:54 3580695441 > >> > s3://robinanil/wikipedia/tokenized-documents/part-00005 > >> > 2010-02-27 22:02 0 > >> > s3://robinanil/wikipedia/tokenized-documents_$folder$ > >> > 2010-02-27 22:02 0 > >> > s3://robinanil/wikipedia/wordcount/subgrams/_temporary_$folder$ > >> > 2010-02-27 22:02 0 > >> > s3://robinanil/wikipedia/wordcount/subgrams_$folder$ > >> > 2010-02-27 22:02 0 > s3://robinanil/wikipedia/wordcount_$folder$ > >> > > >> > > > > >