If there is an option of storing keys in compressed form in memory, I am all
for exploring that


On Sat, Jan 16, 2010 at 7:59 PM, Robin Anil <robin.a...@gmail.com> wrote:

> In this specific scenario. Ability to handle bigger dictionary per node
> where the dictionary is load once is a big win for the dictionary
> vectorizer. This in turn reduces the number of partial vector generation
> passes.
>
>
> I ran the whole wikipedia. I got an 880MB dictionary. I pruned words which
> occur only once in the entire set and i got a 351MB dictionary file. I had
> to split it on c1.medium(2 core 1.7GB ec2 instance) at about 180-190 mb each
> so that it could be loaded in to the memory. This added another 1-2 hours to
> the whole job.
>
> Currently the stats are as follows
>
> 20 GB of wikipedia data in sequence files(uncompressed)
> Counting Job took 1:20 mins
> 2 partial vector generation each took 2 hours each
> vector merging took about 40 mins more.
> finally generated a gzip compressed vectors file of 3.50GB(which i think is
> too large)
>
> Total 6 hours to run. I could easily cut down the 2 pass into one pass had
> I was able to fit the whole dictionary in memory
>
> Robin
>
>
>
> On Sat, Jan 16, 2010 at 7:45 PM, Benson Margulies 
> <bimargul...@gmail.com>wrote:
>
>> While I egged Robin on to some extent on this topic by IM, I should
>> point out the following.
>>
>> We run large amounts of text through Java at Basis, and we always use
>> String. I have an 8G laptop :-), but there you have it. Anything we do
>> in English we do shortly afterwards in Arabic (UTF-8=UTF-16) and Hanzi
>> (UTF-8>UTF-16) so it doesn't make sense for us to optimize this.
>> Obviously, compression is an option in various ways, and we could
>> imagine some magic containers that optimized string storage in one way
>> or the other.
>>
>> On Sat, Jan 16, 2010 at 9:10 AM, Robin Anil <robin.a...@gmail.com> wrote:
>> > Currently java strings use double the space of the characters in it
>> because
>> > its all in utf-16. A 190MB dictionary file therefore uses around 600MB
>> when
>> > loaded into a HashMap<String, Integer>.  Is there some optimization we
>> could
>> > do in terms of storing them and ensuring that chinese, devanagiri and
>> other
>> > characters dont get messed up in the process.
>> >
>> > Some options benson suggested was: storing just the byte[] form and
>> adding
>> > the the option of supplying the hash function in OpenObjectIntHashmap or
>> > even using a UTF-8 string.
>> >
>> > Or we could leave this alone. I currently estimate the memory
>> requirement
>> > using the formula 8 *  ( (int) ( num_chars *2  + 45)/8 ) for strings
>> when
>> > generating the dictionary split for the vectorizer
>> >
>> > Robin
>> >
>>
>
>

Reply via email to