If there is an option of storing keys in compressed form in memory, I am all for exploring that
On Sat, Jan 16, 2010 at 7:59 PM, Robin Anil <robin.a...@gmail.com> wrote: > In this specific scenario. Ability to handle bigger dictionary per node > where the dictionary is load once is a big win for the dictionary > vectorizer. This in turn reduces the number of partial vector generation > passes. > > > I ran the whole wikipedia. I got an 880MB dictionary. I pruned words which > occur only once in the entire set and i got a 351MB dictionary file. I had > to split it on c1.medium(2 core 1.7GB ec2 instance) at about 180-190 mb each > so that it could be loaded in to the memory. This added another 1-2 hours to > the whole job. > > Currently the stats are as follows > > 20 GB of wikipedia data in sequence files(uncompressed) > Counting Job took 1:20 mins > 2 partial vector generation each took 2 hours each > vector merging took about 40 mins more. > finally generated a gzip compressed vectors file of 3.50GB(which i think is > too large) > > Total 6 hours to run. I could easily cut down the 2 pass into one pass had > I was able to fit the whole dictionary in memory > > Robin > > > > On Sat, Jan 16, 2010 at 7:45 PM, Benson Margulies > <bimargul...@gmail.com>wrote: > >> While I egged Robin on to some extent on this topic by IM, I should >> point out the following. >> >> We run large amounts of text through Java at Basis, and we always use >> String. I have an 8G laptop :-), but there you have it. Anything we do >> in English we do shortly afterwards in Arabic (UTF-8=UTF-16) and Hanzi >> (UTF-8>UTF-16) so it doesn't make sense for us to optimize this. >> Obviously, compression is an option in various ways, and we could >> imagine some magic containers that optimized string storage in one way >> or the other. >> >> On Sat, Jan 16, 2010 at 9:10 AM, Robin Anil <robin.a...@gmail.com> wrote: >> > Currently java strings use double the space of the characters in it >> because >> > its all in utf-16. A 190MB dictionary file therefore uses around 600MB >> when >> > loaded into a HashMap<String, Integer>. Is there some optimization we >> could >> > do in terms of storing them and ensuring that chinese, devanagiri and >> other >> > characters dont get messed up in the process. >> > >> > Some options benson suggested was: storing just the byte[] form and >> adding >> > the the option of supplying the hash function in OpenObjectIntHashmap or >> > even using a UTF-8 string. >> > >> > Or we could leave this alone. I currently estimate the memory >> requirement >> > using the formula 8 * ( (int) ( num_chars *2 + 45)/8 ) for strings >> when >> > generating the dictionary split for the vectorizer >> > >> > Robin >> > >> > >