Doug,

> Go ahead and try to define some semantics for your idea.   
> Maybe it would be clear or not, I haven't thought about it.  

First, let me be clear that we are using Judy in a variety of ways now, and 
overall I think you did an excellent job of covering the most common use cases 
with the existing Judy arrays and sets. I expect our needs are fairly specific 
so I'm unsure whether others will get much value out of the 32->32 array. 
(Since 32->32 is a bit cumbersome, let me instead propose that we call these 
Judy4 arrays, where the keys and values are 4 byte unsigned integers). Note 
that we currently use JudyL arrays as Judy4 arrays, but on 64-bit machines, 
JudyL uses 8-byte keys and values.

In our case, the 32-bit keys are tokens that we generate to represent strings 
in large symbol tables. We create 10s of millions of these tokens, and we 
likewise create 10s of millions of these Judy4 arrays. Most of these Judy4 
arrays contain just a few entries, say 10 or less keys. However, some will 
contain hundreds or possibly thousands of keys, and it is not possible to 
predict in advance how many keys will be inserted into a give Judy4 array.

In our usage, values are always counts of occurrences of the key (in a given 
context). So, there is a phase in which we are aggregating data, and the 
fundamental operation is "increment the count for key K in context C." If the 
key K did not previously exist in context C, we insert (K,1) into the array. If 
the key K did exist, then we replace (K,n) with (K,n+1).

After the aggregation phase, there is an analysis phase. In the analysis phase 
we primarily do iterations over the Judy4 arrays, but there are also some 
random lookups.

We currently have no real need to be able to delete a key from a Judy4 array. 
In our usage, it would be acceptable to simply reset the value to 0 and leave 
the key.

Finally, in our usage, 0 is never a valid key, though every other 32-bit value 
is theoretically a valid key.

The above defines the most important aspects of our semantics. Again, these may 
be too specific, and to make the Judy4 arrays be generally useful to others, 
you should perhaps generalize the above, by making no assumptions about values 
other than the fact that they are opaque 4-byte values, and also make no 
assumptions about the keys other than that they are unique unsigned 4-byte 
values.

Finally, the API would be essentially identical to the JudyL API, with the 
exception that both the key and value would be defined as uint32_t (using the 
<stdint.h> type).

Inside your implementation, I imagine that you would allocate chunks of memory 
that are each 1 cache line. In these cache lines you will sometimes have 8-byte 
pointers to other cache lines. You will also sometimes have 8-bytes 
representing a specific (key,value) pair. But of course, there will probably be 
other encodings that use some kind of compression.

This implementation must be different than the implementation of JudyL on 
32-bit architectures, due to the fact that pointers are 8 bytes instead of 4. 
Otherwise, the implementations might be similar.

I am doing some research to see what other possibilities exist that we might 
use, but I would be very pleased if you decided to add Judy4 to Judy.

Thanks,
Jim


----- Original Message -----
From: "Doug Baskins" <[email protected]>
To: "Jim Lloyd" <[email protected]>, [email protected]
Sent: Saturday, December 5, 2009 10:33:39 PM
Subject: Re: 32->32 and 64->64 on 64-bit machines

Jim:

> but I suspect that internally the code relies on the assumption that a Word_t 
> must 
   be the same size as a pointer.

That is correct.  Fortunately 4Gb of RAM is a lot cheaper now.  Of course 
anything 
is possible, but memory savings would only be in very dense (non-sparse) data 
sets.  

One of the serious problems with Judy is people understanding the calling 
semantics.  
I think that has been one of my biggest headaches. 
Go ahead and try to define some semantics for your idea.   
Maybe it would be clear or not, I haven't thought about it.  

I am currently in the process of trying to come up with semantics 
for JudySLNext*() -- the new version of JudySL that includes a string 
length parameter (and allows imbedded nul chars in the string).
I suspect I will take a lot of heat on whatever I come out with.

Thank you for your interest,

Doug
 
Doug Baskins <[email protected]>



----- Original Message ----
From: Jim Lloyd <[email protected]>
To: [email protected]
Sent: Sun, December 6, 2009 5:15:03 AM
Subject: 32->32 and 64->64 on 64-bit machines

Hi,

I've used Judy Arrays in various projects for about 8 years now and have been 
very pleased with the performance for both speed and memory usage. In my 
current project I'm beginning to wish for a feature that is not currently 
available, and I'm wondering whether it is feasible.

Our applications run on 64-bit machines, and we rely heavily on both JudyL and 
JudySL. For some of our JudyL use, the keys are always 32-bits, and the values 
are counts. These counts are always well less than 2^32, so ideally we'd have a 
version of JudyL that mapped 32-bit words to 32-bit words. Its not unusual for 
one of our applications to consume well over 4Gb with these 32to32 JudyL 
arrays. I'd love to be able to cut the storage for these arrays in half.

How much effort would it be to create this feature? A JudyL array compiled for 
32-bit machine would produce the correct interface, but I suspect that 
internally the code relies on the assumption that a Word_t must be the same 
size as a pointer.

Thanks,

Jim Lloyd
Silver Tail Systems

------------------------------------------------------------------------------
Join us December 9, 2009 for the Red Hat Virtual Experience,
a free event focused on virtualization and cloud computing. 
Attend in-depth sessions from your desk. Your couch. Anywhere.
http://p.sf.net/sfu/redhat-sfdev2dev
_______________________________________________
Judy-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/judy-devel


------------------------------------------------------------------------------
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
_______________________________________________
Judy-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/judy-devel

Reply via email to