Hi Scotty,

I rearranged your message a bit for (my own) clarity.


On Tue, 26 Jan 2016 18:40:28 -0800 (PST), Scotty C
<costo...@hotmail.com> wrote:

>running 64 bit linux mint OS on a 2 core laptop with 2 gb of ram. 

>the generated keys are random but i use one of the associated 
>fields for sorting during the initial write to the hard drive. what goes 
>in each of those files is totally random but dupes do not run across
>files. also, the number of keys is >1e25.

>my key is 128 bits with ~256 bits per record. so my 1 gb file contains
>~63 million records and ~32 million keys. about 8% will be dupes 
>leaving me with ~30 million keys. 

You have 63 million 256-bit records in a 1GB file?  The description of
your "records" is vague and can be interpreted more than 1 way, but my
math says:

  256 bits (128 key + 128 data) = 32M records
  384 bits (128 key + 256 data) = ~22M records
 

Anyway ... I'll assume for filtering only the keys are important.

If I understand correctly, you have a 128-bit "key space" and that you
expect to see some small percentage of the keys duplicated.   The file
is sorted on some other non-key value.

What is this other field on which the file is sorted?

WRT a set of duplicates: are you throwing away all duplicates? Keeping
the 1st one encountered?  Something else?


>i run a custom built hash. i use separate chaining with a vector of
>bignums. i am willing to let my chains run up to 5 keys per chain
>so i need a vector of 6 million pointers. that's 48 mb for the array.
>another 480 mb for the bignums. let's round that sum to .5 gb. 

This structure uses a lot more space than necessary.  Where did you
get the idea that a bignum is 10 bytes?

Unless something changed recently, Racket uses the GMP library for
multiple precision math.  GMP bignums require 2 allocations: a struct
plus an array.  Repesenting a full size 128-bit value on a 64-bit
machine takes 40-48 bytes depending on the library's int size, and not
including any allocator overhead.

In the worst case of every key being a bignum, your hash is using 2.6
to 3.1 GB ... not 0.5 GB.  Again, not including allocator overhead.

Since you are only comparing the hash entries for equality, you could
save a lot of space [at the expense of complexity] by defining a
{bucket x chain_size x 16} byte array and storing the 16-byte keys
directly.  

Even using an auxillary vector for tracking residency in the array,
you'd be down under 1GB for the same 6 million hash entries.

Even so, this seems like a bad approach to me.


> have another rather large bignum in memory that i use to reduce
>but not eliminate record duplication of about .5 gb. 

???

>i'm attempting to get this thing to run in 2 places so i need 2 hashes.
>add this up .5+.5+.5 is 1.5 gb and that gets me to about my memory 
>limit. 


I don't have time just now to sketch a split/merge solution ... I'll
get back to it later.  The file already being sorted on an unfiltered
field is a complication.  Offhand I don't think the sort order can be
maintained through the filtering ... meaning the filtered file would
need to be sorted again afterward.  

But even so it might be faster than your current method.

In the meantime, I would suggest you look up "merge sort" and it's
application to "external" sorting.  That will give you the basic idea
of what I'm proposing.

George

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to