In this specific scenario. Ability to handle bigger dictionary per node where the dictionary is load once is a big win for the dictionary vectorizer. This in turn reduces the number of partial vector generation passes.
I ran the whole wikipedia. I got an 880MB dictionary. I pruned words which occur only once in the entire set and i got a 351MB dictionary file. I had to split it on c1.medium(2 core 1.7GB ec2 instance) at about 180-190 mb each so that it could be loaded in to the memory. This added another 1-2 hours to the whole job. Currently the stats are as follows 20 GB of wikipedia data in sequence files(uncompressed) Counting Job took 1:20 mins 2 partial vector generation each took 2 hours each vector merging took about 40 mins more. finally generated a gzip compressed vectors file of 3.50GB(which i think is too large) Total 6 hours to run. I could easily cut down the 2 pass into one pass had I was able to fit the whole dictionary in memory Robin On Sat, Jan 16, 2010 at 7:45 PM, Benson Margulies <bimargul...@gmail.com>wrote: > While I egged Robin on to some extent on this topic by IM, I should > point out the following. > > We run large amounts of text through Java at Basis, and we always use > String. I have an 8G laptop :-), but there you have it. Anything we do > in English we do shortly afterwards in Arabic (UTF-8=UTF-16) and Hanzi > (UTF-8>UTF-16) so it doesn't make sense for us to optimize this. > Obviously, compression is an option in various ways, and we could > imagine some magic containers that optimized string storage in one way > or the other. > > On Sat, Jan 16, 2010 at 9:10 AM, Robin Anil <robin.a...@gmail.com> wrote: > > Currently java strings use double the space of the characters in it > because > > its all in utf-16. A 190MB dictionary file therefore uses around 600MB > when > > loaded into a HashMap<String, Integer>. Is there some optimization we > could > > do in terms of storing them and ensuring that chinese, devanagiri and > other > > characters dont get messed up in the process. > > > > Some options benson suggested was: storing just the byte[] form and > adding > > the the option of supplying the hash function in OpenObjectIntHashmap or > > even using a UTF-8 string. > > > > Or we could leave this alone. I currently estimate the memory requirement > > using the formula 8 * ( (int) ( num_chars *2 + 45)/8 ) for strings when > > generating the dictionary split for the vectorizer > > > > Robin > > >