In this specific scenario. Ability to handle bigger dictionary per node
where the dictionary is load once is a big win for the dictionary
vectorizer. This in turn reduces the number of partial vector generation
passes.


I ran the whole wikipedia. I got an 880MB dictionary. I pruned words which
occur only once in the entire set and i got a 351MB dictionary file. I had
to split it on c1.medium(2 core 1.7GB ec2 instance) at about 180-190 mb each
so that it could be loaded in to the memory. This added another 1-2 hours to
the whole job.

Currently the stats are as follows

20 GB of wikipedia data in sequence files(uncompressed)
Counting Job took 1:20 mins
2 partial vector generation each took 2 hours each
vector merging took about 40 mins more.
finally generated a gzip compressed vectors file of 3.50GB(which i think is
too large)

Total 6 hours to run. I could easily cut down the 2 pass into one pass had I
was able to fit the whole dictionary in memory

Robin



On Sat, Jan 16, 2010 at 7:45 PM, Benson Margulies <bimargul...@gmail.com>wrote:

> While I egged Robin on to some extent on this topic by IM, I should
> point out the following.
>
> We run large amounts of text through Java at Basis, and we always use
> String. I have an 8G laptop :-), but there you have it. Anything we do
> in English we do shortly afterwards in Arabic (UTF-8=UTF-16) and Hanzi
> (UTF-8>UTF-16) so it doesn't make sense for us to optimize this.
> Obviously, compression is an option in various ways, and we could
> imagine some magic containers that optimized string storage in one way
> or the other.
>
> On Sat, Jan 16, 2010 at 9:10 AM, Robin Anil <robin.a...@gmail.com> wrote:
> > Currently java strings use double the space of the characters in it
> because
> > its all in utf-16. A 190MB dictionary file therefore uses around 600MB
> when
> > loaded into a HashMap<String, Integer>.  Is there some optimization we
> could
> > do in terms of storing them and ensuring that chinese, devanagiri and
> other
> > characters dont get messed up in the process.
> >
> > Some options benson suggested was: storing just the byte[] form and
> adding
> > the the option of supplying the hash function in OpenObjectIntHashmap or
> > even using a UTF-8 string.
> >
> > Or we could leave this alone. I currently estimate the memory requirement
> > using the formula 8 *  ( (int) ( num_chars *2  + 45)/8 ) for strings when
> > generating the dictionary split for the vectorizer
> >
> > Robin
> >
>

Reply via email to