G'night!  Looking forward to seeing the ngram outputs.  It should do well
for getting us some Big Data. :)

  -jake

On Sat, Feb 27, 2010 at 2:58 PM, Robin Anil <robin.a...@gmail.com> wrote:

> as suspected bigram generation is taking forever (with the explosion in
> number of key,values thrown around the cluster) (almost 1 hour now and I
> dont have any counter update :( ) That means bigrams might do it for us. If
> by morning I find myself not in debt to amazon :) You will know what the
> result is.
>
> BTW, I am presenting mahout at India Hadoop summit tomorrow. So need to get
> up in like 4 hours. So bye
>
> Robin
>
>
> On Sun, Feb 28, 2010 at 3:57 AM, Robin Anil <robin.a...@gmail.com> wrote:
>
> > like i said only 5 mil articles. Maybe you can generate a co-occurrence
> > matrix :) every ngram to every other ngram :) Sounds fun? It will be
> HUGE!
> >
> >
> >
> > On Sun, Feb 28, 2010 at 3:43 AM, Jake Mannix <jake.man...@gmail.com
> >wrote:
> >
> >> 15GB of tokenized documents, not bad, not bad.  We're not going
> >> to get a multi-billion entry matrix out of this though, are we?
> >>
> >>  -jake
> >>
> >> On Sat, Feb 27, 2010 at 2:06 PM, Robin Anil <robin.a...@gmail.com>
> wrote:
> >>
> >> > Update:
> >> >
> >> > in 20 mins the tokenization stage is complete But its not evident in
> the
> >> > online UI.
> >> > I found it by checking the s3 output folder
> >> >
> >> > 2010-02-27 21:50  2696826329
> >> > s3://robinanil/wikipedia/tokenized-documents/part-00000
> >> > 2010-02-27 21:52  2385184391
> >> > s3://robinanil/wikipedia/tokenized-documents/part-00001
> >> > 2010-02-27 21:52  2458566158
> >> > s3://robinanil/wikipedia/tokenized-documents/part-00002
> >> > 2010-02-27 21:53  2500213973
> >> > s3://robinanil/wikipedia/tokenized-documents/part-00003
> >> > 2010-02-27 21:50  2533593862
> >> > s3://robinanil/wikipedia/tokenized-documents/part-00004
> >> > 2010-02-27 21:54  3580695441
> >> > s3://robinanil/wikipedia/tokenized-documents/part-00005
> >> > 2010-02-27 22:02         0
> >> > s3://robinanil/wikipedia/tokenized-documents_$folder$
> >> > 2010-02-27 22:02         0
> >> > s3://robinanil/wikipedia/wordcount/subgrams/_temporary_$folder$
> >> > 2010-02-27 22:02         0
> >> > s3://robinanil/wikipedia/wordcount/subgrams_$folder$
> >> > 2010-02-27 22:02         0
> s3://robinanil/wikipedia/wordcount_$folder$
> >> >
> >>
> >
> >
>

Reply via email to