subject:"\[racket\-users\] Re\: appending files"

[racket-users] Re: appending files

2016-01-31 Thread Scotty C

On Sunday, January 31, 2016 at 12:13:31 AM UTC-6, Scotty C wrote:
> > that's what i did. so new performance data. this is with bytes instead of 
> > strings for data on the hard drive but bignums in the hash still.
> > 
> > as a single large file and a hash with 203 buckets for 26.6 million 
> > records the data rate is 98408/sec.
> > 
> > when i split and go with 11 smaller files and hash with 59 buckets the 
> > data rate is 106281/sec.
> 
> hash is reworked, bytes based. same format though, vector of bytes. so time 
> test results:
> 
> single large file same # buckets as above data rate 175962/sec.
> 
> 11 smaller files same # buckets as above data rate 205971/sec.

throughput update. i had to hand code some of the stuff (places are just not 
working for me) but i just managed to hack my way through running this in 
parallel. i copied the original 26.6 million records to a new file. ran two 
slightly reworked copies of my duplicate removal code at a shell prompt like 
this:
racket ddd-parallel.rkt &
racket ddd-parallel1.rkt &
i'm not messing with the single large file anymore. so for twice the data the 
data rate is up to 356649/sec.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Re: appending files

2016-01-30 Thread Scotty C

> Yes.   You probably do need to convert the files.  Your originalat
> coding likely is not [easily] compatible with binary I/O.

that's what i did. so new performance data. this is with bytes instead of 
strings for data on the hard drive but bignums in the hash still.

as a single large file and a hash with 203 buckets for 26.6 million records 
the data rate is 98408/sec.

when i split and go with 11 smaller files and hash with 59 buckets the data 
rate is 106281/sec.

clearly it is quicker to read/write twice than to read/write once. but of 
course my laptop is pretty sad and my hash may end up being sad as well. with 
these revised and quicker rates, i'm ready to migrate the hash to bytes.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Re: appending files

2016-01-30 Thread Benjamin Greenman

Fixed, thanks for the report!

On Sat, Jan 30, 2016 at 8:31 PM, Scotty C  wrote:

> just found a small mistake in the documentation: can you find it?
>
> (numerator q) → integer?
>
>   q : rational?
>
> Coerces q to an exact number, finds the numerator of the number expressed
> in its simplest fractional form, and returns this number coerced to the
> exactness of q.
>
> (denominator q) → integer?
>
>   q : rational?
>
> Coerces q to an exact number, finds the numerator of the number expressed
> in its simplest fractional form, and returns this number coerced to the
> exactness of q.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to racket-users+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Re: appending files

2016-01-30 Thread Scotty C

just found a small mistake in the documentation: can you find it?

(numerator q) → integer?

  q : rational?

Coerces q to an exact number, finds the numerator of the number expressed in 
its simplest fractional form, and returns this number coerced to the exactness 
of q.

(denominator q) → integer?

  q : rational?

Coerces q to an exact number, finds the numerator of the number expressed in 
its simplest fractional form, and returns this number coerced to the exactness 
of q.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Re: appending files

2016-01-30 Thread Scotty C

> that's what i did. so new performance data. this is with bytes instead of 
> strings for data on the hard drive but bignums in the hash still.
> 
> as a single large file and a hash with 203 buckets for 26.6 million 
> records the data rate is 98408/sec.
> 
> when i split and go with 11 smaller files and hash with 59 buckets the 
> data rate is 106281/sec.

hash is reworked, bytes based. same format though, vector of bytes. so time 
test results:

single large file same # buckets as above data rate 175962/sec.

11 smaller files same # buckets as above data rate 205971/sec.

i played around with the # buckets parameter but what worked for bignums worked 
for bytes too. overall speed has nearly doubled. very nice, thanks to all who 
contributed some ideas. and to think, all i wanted to do was paste some files 
together.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Re: appending files

2016-01-29 Thread Scotty C

> > question for you all. right now i use modulo on my bignums. i know i
> > can't do that to a byte string. i'll figure something out. if any of
> > you know how to do this, can you post a method?
> > 
> 
> I'm not sure what your asking exactly.

i'm talking about getting the hash index of a key. see my key is a bignum and i 
get the hash index with (modulo key 611). so i either need to turn the key 
(which will be a byte string) into a number. stuff that right in where i have 
key. or i replace modulo.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Re: appending files

2016-01-29 Thread Jon Zeppieri

On Fri, Jan 29, 2016 at 7:04 PM, Scotty C  wrote:

> > > question for you all. right now i use modulo on my bignums. i know i
> > > can't do that to a byte string. i'll figure something out. if any of
> > > you know how to do this, can you post a method?
> > >
> >
> > I'm not sure what your asking exactly.
>
> i'm talking about getting the hash index of a key. see my key is a bignum
> and i get the hash index with (modulo key 611). so i either need to
> turn the key (which will be a byte string) into a number. stuff that right
> in where i have key. or i replace modulo.

Sorry -- I haven't been following this thread closely, so I apologize if
this question was already answered:

Did you implement your own hash tables, or are you using Racket's built-in
hashes? Because if you're using the built-in ones, there's no reason to do
this. You just use your object (a bignum, a byte string, or anything else)
as the key. (The hash code is generated by `equal-hash-code`, but you don't
have to call it yourself. The hash implementation does that.)

However, if you have implemented your own, you can still call
`equal-hash-code` on your byte string. That will give you an exact integer,
and you can then call modulo (or remainder) on that.

-Jon

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Re: appending files

2016-01-29 Thread Scotty C

> However, if you have implemented your own, you can still call 
> `equal-hash-code` 
yes, my own hash.
i think the equal-hash-code will work.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Re: appending files

2016-01-29 Thread Jon Zeppieri

On Fri, Jan 29, 2016 at 7:45 PM, Scotty C  wrote:

> > my plan right now is to rework my current hash so that it runs byte
> strings instead of bignums.
>
> i have a new issue. i wrote my data as char and end records with 'return.
> i use (read-line x 'return) and the first record is 15 char. when i use
> (read-line-bytes x 'return) i get 23 byte. i have to assume that my old
> assumption that an 8 bit char would write to disk as 8 bits is incorrect?
>
> from documentation on read-char
> Reads a single character from in—which may involve reading several bytes
> to UTF-8-decode them into a character
>
> i get the feeling that i will need to read the entire file as i used to
> read it taking each record and doing the following:
> convert the string record to a bignum record
> convert the bignum record into a byte string
> write the byte string to a new data file
>
> does that seem right?


I don't know for sure, because I don't know what your file format is like
(apologies again, if I missed it; I *did* realize after I sent my last
message that you are using a custom hash table). But, if you're just taking
the first N bytes of a record and using that as your key, then there's no
reason to go through a bignum. You can read bytes (as opposed to
characters) from an input port. Just use `read-bytes` or one of its
variants.

-Jon

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Re: appending files

2016-01-29 Thread Brandon Thomas

On Fri, 2016-01-29 at 13:00 -0800, Scotty C wrote:
> ok, had time to run my hash on my one test file
> '(611 1 1 19 24783208 4.19)
> this means
> # buckets
> % buckets empty
> non empty bucket # keys least
> non empty bucket # keys most
> total number of keys
> average number of keys per non empty bucket
> 
> it took 377 sec.
> original # records is 26570359 so 6.7% dupes.
> processing rate is 70478/sec
> 
> my plan right now is to rework my current hash so that it runs byte
> strings instead of bignums. i will probably be tomorrow afternoon
> before i have more stats.
> 
> question for you all. right now i use modulo on my bignums. i know i
> can't do that to a byte string. i'll figure something out. if any of
> you know how to do this, can you post a method?
> 

I'm not sure what your asking exactly. Because a byte string is an
array, all you need to do is jump to the byte your information is in
and then get your bit out. So let's just say you want to get the nth
bit in an array of 8-bit bytes. You first need to calculate which byte
your bit is in, which is clearly (bytes-ref data (floor (/ n 8))). Then
you get the bit out of it, which would be something (untested) like
(modulo (arithmetic-shift (bytes-ref data (floor (/ n 8))) (* -1
(modulo n 8))) 2). This is easy enough to modify for elements that are
a divisor of 8 in size.

Regards,
Brandon Thomas

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Re: appending files

2016-01-29 Thread Scotty C

> i get the feeling that i will need to read the entire file as i used to read 
> it taking each record and doing the following:
> convert the string record to a bignum record
> convert the bignum record into a byte string
> write the byte string to a new data file
> 
> does that seem right?

nevermind. this is indeed what i needed to do. the new file is 438.4 mb. the 
time to read, hash, write is now 317 seconds. processing rate is 83818/sec. the 
hash still uses bignums. the speed change is just from reading and writing 
bytes instead of strings.

the drop in filesize is about 30% the gain is speed is about 15%

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Re: appending files

2016-01-29 Thread George Neuner

On Fri, 29 Jan 2016 16:45:40 -0800 (PST), Scotty C
 wrote:

>i have a new issue. i wrote my data as char and end records with 'return. i 
>use (read-line x 'return) and the first record is 15 char. when i use
> (read-line-bytes x 'return) i get 23 byte. i have to assume that my old 
>assumption that an 8 bit char would write to disk as 8 bits is incorrect?

Yes and no.  Characters are not "bytes".  If you wrote the file
originally in text mode, then it is subject to locale settings in the
OS.  Racket can handle some different character representations, but
you have to select them deliberately.

If you want to work with byte strings you are going to do file I/O in
binary mode.

Also, note that if you display/print a byte string you are seeing
character representations of the bytes.  There will be extraneous '\'
escapes for values that don't represent printable characters.  The
small values you will be using all are unprintable ASCII control
characters.

>from documentation on read-char
>Reads a single character from in—which may involve reading several
>bytes to UTF-8-decode them into a character
>
>i get the feeling that i will need to read the entire file as i used to read it
>taking each record and doing the following:
>- convert the string record to a bignum record
>- convert the bignum record into a byte string
>- write the byte string to a new data file
>
>does that seem right?

Yes.   You probably do need to convert the files.  Your original
coding likely is not [easily] compatible with binary I/O.

George

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Re: appending files

2016-01-29 Thread Scotty C

> i get the feeling that i will need to read the entire file as i used to read 
> it taking each record and doing the following:
> convert the string record to a bignum record
> convert the bignum record into a byte string
> write the byte string to a new data file
> 
> does that seem right?

nevermind. this is indeed what i needed to do. the new file is 438.4 mb. the 
time to read, hash, write is now 83818 rec/sec. the hash still uses bignums. 
the speed change is just from reading and writing bytes instead of strings.


-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Re: appending files

2016-01-29 Thread George Neuner

On Thu, 28 Jan 2016 20:32:08 -0800 (PST), Scotty C
 wrote:

>> (current-memory-use)
>yup, tried that a while back didn't like what i saw. check this out:
>
>> (current-memory-use)
>581753864
>> (current-memory-use)
>586242568
>> (current-memory-use)
>591181736
>> (current-memory-use)
>595527064
>
>does that work for you?

Yes.  You have to remember that evaluating the function and printing
the result takes memory, and memory is reclaimed only by GC.  So
everything you do takes a little bit.

I'm not aware of any other way.  There is (dump-memory-stats) but it
is undocumented.  [another oppotunity to go spelunking in the code]

But you should be interested in comparing the magnitude of one
implementation vs another ... you're looking to save megabytes, not a
few kilobytes.

George

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Re: appending files

2016-01-29 Thread Scotty C

> my plan right now is to rework my current hash so that it runs byte strings 
> instead of bignums.

i have a new issue. i wrote my data as char and end records with 'return. i use 
(read-line x 'return) and the first record is 15 char. when i use 
(read-line-bytes x 'return) i get 23 byte. i have to assume that my old 
assumption that an 8 bit char would write to disk as 8 bits is incorrect?

from documentation on read-char
Reads a single character from in—which may involve reading several bytes to 
UTF-8-decode them into a character

i get the feeling that i will need to read the entire file as i used to read it 
taking each record and doing the following:
convert the string record to a bignum record
convert the bignum record into a byte string
write the byte string to a new data file

does that seem right?

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Re: appending files

2016-01-29 Thread Scotty C

ok, had time to run my hash on my one test file
'(611 1 1 19 24783208 4.19)
this means
# buckets
% buckets empty
non empty bucket # keys least
non empty bucket # keys most
total number of keys
average number of keys per non empty bucket

it took 377 sec.
original # records is 26570359 so 6.7% dupes.
processing rate is 70478/sec

my plan right now is to rework my current hash so that it runs byte strings 
instead of bignums. i will probably be tomorrow afternoon before i have more 
stats.

question for you all. right now i use modulo on my bignums. i know i can't do 
that to a byte string. i'll figure something out. if any of you know how to do 
this, can you post a method?

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Re: appending files

2016-01-28 Thread Scotty C

On Thursday, January 28, 2016 at 11:36:50 PM UTC-6, Brandon Thomas wrote:
> On Thu, 2016-01-28 at 20:32 -0800, Scotty C wrote:
> > > I think you understand perfectly.
> > i'm coming around
> > 
> > > You said the keys are 128-bit (16 byte) values.  You can store one
> > > key
> > > directly in a byte string of length 16.
> > yup
> > 
> > > So instead of using a vector of pointers to individual byte
> > > strings,
> > > you would allocate a single byte string of length 
> > >    {buckets x chain length x 16} 
> > > and index it directly as if it were a 3-dimensional array [using
> > > offset calculations as you would in C].
> > i'm going to actually run my code later (probably in the morning).
> > and then grab some stats from the hash. take a look at the stats.
> > give me some thoughts on the above implementation after looking at
> > the stats.
> > 
> > > Can you explain why the bignums are important.  For a simple task
> > > like
> > > filtering a file, they would seem to be both a waste of memory and
> > > a
> > > performance drain wrt storing the key values as byte strings.
> > ok, true story. about 10 years ago i'm a student taking an intro to
> > ai class we have been assigned the 3x3 tile puzzle and 2 algorithms,
> > ida* and a*. also, try it on a 4x4. so i whip these out and am
> > appalled by how slow the fn things are but intrigued by what they can
> > do and i'm just a puzzle kind of guy. btw, the 4x4 wasn't solvable by
> > my stuff at the time. so i head to the chief's office with my little
> > bit of code. tell him it's way to slow. what do you think? he takes 5
> > seconds to peruse the item and says "you're probably making too much
> > memory". side note, later i graded this class for that prof and the
> > same project. not a single student, including the ms and phd types,
> > did anything better than a vector of vectors of ints. so back to me,
> > i'm thinking, how do i cram 4 bits together so that there is
> > absolutely no wasted space. i start digging through the documentation
> > and i find the bitwise stuff. i see arithmetic shift and having no
> > idea whether it will work or not i type into the interface
> > (arithmetic-shift 1 100). if we blow up, well we blow up big but
> > we didn't. i flipped everything in my project to bignum at that
> > point. the bignum stuff is massively faster than vector of vectors of
> > ints and faster than just a single vector of ints. lots of bignum is
> > easy to implement, like the hash. making a child state, not so much.
> > i'm not married to bignums despite all this.
> > 
> > > I've been programming for over 30 years, professionally for 23.
> > i was a programmer. dot com bought us as a brick and mortar back in
> > '99 and the whole shebang blew up 2 years later. idiots. anyway, been
> > a stay at home dad for the most part since then.
> > 
> > > have an MS in Computer Science
> > me too. that's the other piece of the most part from up above.
> > 
> > > (current-memory-use)
> > yup, tried that a while back didn't like what i saw. check this out:
> > 
> > > (current-memory-use)
> > 581753864
> > > (current-memory-use)
> > 586242568
> > > (current-memory-use)
> > 591181736
> > > (current-memory-use)
> > 595527064
> > 
> > does that work for you?
> > 
> 
> For whatever you're storing, I still suggest using a disk based
> structure (preferably using one that's already optimised and built for
> you). I've done a bit of work on cache aware algortihms, where reducing
> memory footprint is really big (along with the memory juggling). Yes,
> if you try to store something that takes only a few bits and store each
> one into an integer, you'll have waisted space. In theory, you could
> use a bignum, and only shift it as many bits as you need, which is what
> you have done. The issue with that is that bignum's have extra overhead
> thats neccessary for it to do arithmetic. Obviously, there needs to be
> a way to match or beat bignums with primitive structures, since bignums
> are implemented with primitive structures. So, if you want to beat
> bignum for storage, you'll want to use some contiguous memory with
> fixed sized elements (like byte strings, or arrays of uint32_t's in C)
> - but using bit manipulation on each byte, such that you have multiple
> stored values in each one, directly beside eachother bitwise, like a
> bignum has, but without it's overhead.
> 
> Regards,
> Brandon Thomas

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Re: appending files

2016-01-28 Thread Scotty C

> You claim you want filtering to be as fast as possible.  If that were
> so, you would not pack multiple keys (or features thereof) into a
> bignum but rather would store the keys individually.
chasing pointers? no, you're thinking about doing some sort of byte-append and 
subbytes type of thing. only way that data in a hash would be small in memory 
and reasonably quick. care to elaborate?

> The 16-byte keys are 1/3 the size of even the _smallest_ bignum, and
where are you getting this? i've been digging all over the documentation and 
can't find a fn thing on how much space is required for any data type or it's 
overhead. what i do is open up htop and check memory then load drracket, a huge 
bignum and recheck. close drracket, check memory, restart drracket, load a huge 
vector and it's associated bignums and recheck. so 2 questions for you all
1)where is this info about data type memory requirements?
2)what is a better way of checking actual memory sucked up by my data 
structures?

> comparing two small byte strings is faster than anything you can do
> with the bignum.  With the right data structure can put a lot more
> keys into the same space you're using now and use them faster.
you knew this was coming, right? put this into your data structure of choice:
16 5 1 12 6 24 17 9 2 22 4 10 13 18 19 20 0 23 7 21 15 11 8 3 14
this is a particular 5x5 tile puzzle (#6 in 
www.aaai.org/Papers/AAAI/1996/AAAI96-178.pdf) with the blank in a position 
where 4 children can be made. make a child while extracting the value of the 
tile being swapped with the blank. compare child to parent for equality. repeat 
that for the other 3 children.

time testing that won't matter because we have different hardware. if you (any 
of you) think you have something that will best what i'm doing (bignums), bring 
it on. show me something cool and fast. let me check out your approach.

i experimented a bit with the byte strings yesterday. what i'm doing with the 
bignums can't be done with byte strings. i'd have to rewrite just about 
everything i've got.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Re: appending files

2016-01-28 Thread George Neuner

On Thu, 28 Jan 2016 07:56:05 -0800 (PST), Scotty C
 wrote:

>you knew this was coming, right? put this into your data structure of choice:
>
>16 5 1 12 6 24 17 9 2 22 4 10 13 18 19 20 0 23 7 21 15 11 8 3 14
>
>this is a particular 5x5 tile puzzle 
>(#6 in www.aaai.org/Papers/AAAI/1996/AAAI96-178.pdf) 
>with the blank in a position where 4 children can be made. make a child while
>extracting the value of the tile being swapped with the blank. compare child 
>to parent for equality. repeat that for the other 3 children.
>>
>i experimented a bit with the byte strings yesterday. what i'm doing with the 
>bignums can't be done with byte strings.

But the problem is that you didn't sufficiently describe the problem. 

Way back in this thread you implied that you had extremely large FILES
containing FIXED SIZE RECORDS, from which you needed 
to FILTER DUPLICATE records based on the value of a FIXED SIZE KEY
field.

That is the basic problem description that Brandon and Neil and I all
perceived.  I don't like speaking for others, but it is pretty obvious
from the content of their posts.  I will dig up the actual quotes if
you don't believe that this is what you implied.

And that is a problem which is not terribly well suited to hashing and
in fact does not require any hashing at all to solve.

And your revised problem as stated above still does not.  The obvious
solution is to encode the puzzle squares in a known order into
adjacent 5-bit fields.  That forms a nice 128-bit value with 3 unused
bits.  Those values can be compared directly for equality and we're
back to you don't need a hash at all.

Doesn't work for 6x6?  Well 36 6-bit values fit neatly into 216 bits
(27 bytes).   

7x7 and 8x8 fit into 448 bits (56 bytes).  

9x9 and 10x10 fit into 704 bits (88 bytes).

Make the puzzle size a parameter to your programs: the needed field
widths and the # of relevant bytes in the resultant "key" are trivial
to calculate.

And you still will not need hashing to remove duplicates. 8-)

George

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Re: appending files

2016-01-28 Thread George Neuner

On Thu, 28 Jan 2016 07:56:05 -0800 (PST), Scotty C
 wrote:

>> You claim you want filtering to be as fast as possible.  If that were
>> so, you would not pack multiple keys (or features thereof) into a
>> bignum but rather would store the keys individually.
>
>chasing pointers? no, you're thinking about doing some sort of
>byte-append and subbytes type of thing. only way that data in a
>hash would be small in memory and reasonably quick. care to 
>elaborate?

I think you understand perfectly.

You said the keys are 128-bit (16 byte) values.  You can store one key
directly in a byte string of length 16.  However, each individual byte
string has allocator overhead  [though less than a bignum, see below].
So instead of using a vector of pointers to individual byte strings,
you would allocate a single byte string of length 
   {buckets x chain length x 16} 
and index it directly as if it were a 3-dimensional array [using
offset calculations as you would in C].

There might be an easier approach using multi-dimensional arrays.
However, the array library is Typed Racket and there are issues mixing
"typed" code with "untyped" code.  I haven't looked into TR
extensively so I don't know offhand if these arrays are "packed type
homogenous" or simply "type homogenous"  [the difference would be
memory use, see below].

Can you explain why the bignums are important.  For a simple task like
filtering a file, they would seem to be both a waste of memory and a
performance drain wrt storing the key values as byte strings.

>> The 16-byte keys are 1/3 the size of even the _smallest_ bignum, 
>
>where are you getting this? i've been digging all over the documentation
>and can't find a fn thing on how much space is required for any data type
>or it's overhead. 
>
>so 2 questions for you all
>1)where is this info about data type memory requirements?

I just know a lot of this from books and lots of experience.

I've been programming for over 30 years, professionally for 23.  I
have an MS in Computer Science (and coursework for a 2nd MS but I left
school before doing that dissertation).  Among other things I have
worked professionally on both compiler internals and GC.  I know
(quite) a bit about how to implement Lisp and Scheme, and Racket [at
least the so-called "untyped" version] is a fairly typical tagged data
system.

You can figure out a lot of it by careful reading, in particular about
the C API and library (FFI) interfaces.  For things like bignums, you
have to look into the code and discover that Racket uses the GMP
library for multiple-precision math (and then look into GMP for the
details).

The size of heap object descriptors (i.e. allocator and type overhead)
necessarily is version, platform and compiler dependent.  You have to
figure it out for the particular implementation.

Some background on tagged data:

At the most basic level, there are only 2 types: "immediate" values
which are ALU register sized - 32 or 64 bit depending - and
"reference" values which are heap allocated objects accessed via
pointers.
[The terms "immediate" and "reference" come originally from Lisp.]

Immediate values include fixnums (small integers), characters and
bytes, booleans, and pointers.  Immediates are type tagged by stealing
a few bits from the (register width) value, so, fixnum integers and
pointers do not quite have their full range [e.g., in Racket fixnums
are 31 or 63 bits depending on platform].

A pointer to a reference value is an immediate value.

Reference values are essentially anything that is not an immediate
value.  Most importantly, floating point values are reference values
because all of their bits are relevant to the FPU and it isn't
possible to tag them by stealing bits as is done with immediates.

In general, heap objects always contain some number of generic "slots"
each of which is an immediate value.  However there is an important
exception:  for certain types, it is possible to create homogenous
"packed" vectors in which the elements are not generic "slots" but
rather are values of a specified type.  Racket permits creating packed
vectors from bytes, characters, fixnum integers, floats and doubles.  

For characters and bytes the packed vector is called a "string".  For
the others, their packed variants are distinct vector types.

The GMP library represents bignums as an information struct and a data
array (of C longs).  The array always contains at least 1 element,
grows as needed but never shrinks.  The struct contains a pointer and
2 longs.  

The size of pointers and longs are compiler dependent, so you need to
know which compiler was used.  On Linux Racket is compiled with GCC so
pointers and longs are the same size: either 32-bits or 64-bits.

The GMP documentation doesn't address allocation specifically ... it's
treated as a black box ... but the info struct is defined and its
members are laid out in a way that suggests the allocations might be
separate (i.e. there is a

[racket-users] Re: appending files

2016-01-28 Thread Scotty C

what's been bothering me was trying to get the data into 16 bytes in a byte 
string of that length. i couldn't get that to work so gave up and just shoved 
the data into 25 bytes. here's a bit of code. i think it's faster than my 
bignum stuff.

(define p (bytes 16 5 1 12 6 24 17 9 2 22 4 10 13 18 19 20 0 23 7 21 15 11 8 3 
14))
(define (c1 p) ;move blank left
  (let(
  (x (bytes-ref p 15))
  (c (bytes-copy p)))
  (bytes-set! c 15 0)
  (bytes-set! c 16 x)
  (bytes=? p c)
  c))

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Re: appending files

2016-01-28 Thread George Neuner

On Thu, 28 Jan 2016 11:49:09 -0800 (PST), Scotty C
 wrote:

>what's been bothering me was trying to get the data into 16 bytes in 
>a byte string of that length. i couldn't get that to work so gave up and
>just shoved the data into 25 bytes. here's a bit of code. i think it's
>faster than my bignum stuff.
>
>(define p (bytes 16 5 1 12 6 24 17 9 2 22 4 10 13 18 19 20 0 23 7
>21 15 11 8 3 14))
>(define (c1 p) ;move blank left
>  (let(
>  (x (bytes-ref p 15))
>  (c (bytes-copy p)))
>  (bytes-set! c 15 0)
>  (bytes-set! c 16 x)
>  (bytes=? p c)
>  c))

I don't doubt that it's faster.  I'd bet that it will be faster even
for larger sizes.  It also has the advantage of being able to encode
enormous 255x255 puzzles.  

However, memory space rears its ugly head again.  The allocation for a
25 byte string is on par with my estimate of the minimum allocation
for a bignum.  Packing the bits tight likely may save you ~16 bytes
per entry.  I stress _may_ because the allocator details are unknown
to me.

Use (current-memory-use) to figure it out.

George

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Re: appending files

2016-01-28 Thread George Neuner

On Wed, 27 Jan 2016 19:43:49 -0800 (PST), Scotty C
 wrote:

>> Then you're not using the hash in a conventional manner ... else the
>> filter entries would be unique 
>
>using it conventionally? absolutely. it is a hash with separate chaining. 

You snipped the part I was responding to, which was:

>>> In the worst case of every key being a bignum
>>no, every key is contained within a bignum which can contain many 
>>many keys.

You claim you want filtering to be as fast as possible.  If that were
so, you would not pack multiple keys (or features thereof) into a
bignum but rather would store the keys individually.

The 16-byte keys are 1/3 the size of even the _smallest_ bignum, and
comparing two small byte strings is faster than anything you can do
with the bignum.  With the right data structure can put a lot more
keys into the same space you're using now and use them faster.

George

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Re: appending files

2016-01-28 Thread Scotty C

> I think you understand perfectly.
i'm coming around

> You said the keys are 128-bit (16 byte) values.  You can store one key
> directly in a byte string of length 16.
yup

> So instead of using a vector of pointers to individual byte strings,
> you would allocate a single byte string of length 
>{buckets x chain length x 16} 
> and index it directly as if it were a 3-dimensional array [using
> offset calculations as you would in C].
i'm going to actually run my code later (probably in the morning). and then 
grab some stats from the hash. take a look at the stats. give me some thoughts 
on the above implementation after looking at the stats.

> Can you explain why the bignums are important.  For a simple task like
> filtering a file, they would seem to be both a waste of memory and a
> performance drain wrt storing the key values as byte strings.
ok, true story. about 10 years ago i'm a student taking an intro to ai class we 
have been assigned the 3x3 tile puzzle and 2 algorithms, ida* and a*. also, try 
it on a 4x4. so i whip these out and am appalled by how slow the fn things are 
but intrigued by what they can do and i'm just a puzzle kind of guy. btw, the 
4x4 wasn't solvable by my stuff at the time. so i head to the chief's office 
with my little bit of code. tell him it's way to slow. what do you think? he 
takes 5 seconds to peruse the item and says "you're probably making too much 
memory". side note, later i graded this class for that prof and the same 
project. not a single student, including the ms and phd types, did anything 
better than a vector of vectors of ints. so back to me, i'm thinking, how do i 
cram 4 bits together so that there is absolutely no wasted space. i start 
digging through the documentation and i find the bitwise stuff. i see 
arithmetic shift and having no idea whether it will work or not i type into the 
interface (arithmetic-shift 1 100). if we blow up, well we blow up big but 
we didn't. i flipped everything in my project to bignum at that point. the 
bignum stuff is massively faster than vector of vectors of ints and faster than 
just a single vector of ints. lots of bignum is easy to implement, like the 
hash. making a child state, not so much. i'm not married to bignums despite all 
this.

> I've been programming for over 30 years, professionally for 23.
i was a programmer. dot com bought us as a brick and mortar back in '99 and the 
whole shebang blew up 2 years later. idiots. anyway, been a stay at home dad 
for the most part since then.

> have an MS in Computer Science
me too. that's the other piece of the most part from up above.

> (current-memory-use)
yup, tried that a while back didn't like what i saw. check this out:

> (current-memory-use)
581753864
> (current-memory-use)
586242568
> (current-memory-use)
591181736
> (current-memory-use)
595527064

does that work for you?

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Re: appending files

2016-01-28 Thread Brandon Thomas

On Thu, 2016-01-28 at 20:32 -0800, Scotty C wrote:
> > I think you understand perfectly.
> i'm coming around
> 
> > You said the keys are 128-bit (16 byte) values.  You can store one
> > key
> > directly in a byte string of length 16.
> yup
> 
> > So instead of using a vector of pointers to individual byte
> > strings,
> > you would allocate a single byte string of length 
> >    {buckets x chain length x 16} 
> > and index it directly as if it were a 3-dimensional array [using
> > offset calculations as you would in C].
> i'm going to actually run my code later (probably in the morning).
> and then grab some stats from the hash. take a look at the stats.
> give me some thoughts on the above implementation after looking at
> the stats.
> 
> > Can you explain why the bignums are important.  For a simple task
> > like
> > filtering a file, they would seem to be both a waste of memory and
> > a
> > performance drain wrt storing the key values as byte strings.
> ok, true story. about 10 years ago i'm a student taking an intro to
> ai class we have been assigned the 3x3 tile puzzle and 2 algorithms,
> ida* and a*. also, try it on a 4x4. so i whip these out and am
> appalled by how slow the fn things are but intrigued by what they can
> do and i'm just a puzzle kind of guy. btw, the 4x4 wasn't solvable by
> my stuff at the time. so i head to the chief's office with my little
> bit of code. tell him it's way to slow. what do you think? he takes 5
> seconds to peruse the item and says "you're probably making too much
> memory". side note, later i graded this class for that prof and the
> same project. not a single student, including the ms and phd types,
> did anything better than a vector of vectors of ints. so back to me,
> i'm thinking, how do i cram 4 bits together so that there is
> absolutely no wasted space. i start digging through the documentation
> and i find the bitwise stuff. i see arithmetic shift and having no
> idea whether it will work or not i type into the interface
> (arithmetic-shift 1 100). if we blow up, well we blow up big but
> we didn't. i flipped everything in my project to bignum at that
> point. the bignum stuff is massively faster than vector of vectors of
> ints and faster than just a single vector of ints. lots of bignum is
> easy to implement, like the hash. making a child state, not so much.
> i'm not married to bignums despite all this.
> 
> > I've been programming for over 30 years, professionally for 23.
> i was a programmer. dot com bought us as a brick and mortar back in
> '99 and the whole shebang blew up 2 years later. idiots. anyway, been
> a stay at home dad for the most part since then.
> 
> > have an MS in Computer Science
> me too. that's the other piece of the most part from up above.
> 
> > (current-memory-use)
> yup, tried that a while back didn't like what i saw. check this out:
> 
> > (current-memory-use)
> 581753864
> > (current-memory-use)
> 586242568
> > (current-memory-use)
> 591181736
> > (current-memory-use)
> 595527064
> 
> does that work for you?
> 

For whatever you're storing, I still suggest using a disk based
structure (preferably using one that's already optimised and built for
you). I've done a bit of work on cache aware algortihms, where reducing
memory footprint is really big (along with the memory juggling). Yes,
if you try to store something that takes only a few bits and store each
one into an integer, you'll have waisted space. In theory, you could
use a bignum, and only shift it as many bits as you need, which is what
you have done. The issue with that is that bignum's have extra overhead
thats neccessary for it to do arithmetic. Obviously, there needs to be
a way to match or beat bignums with primitive structures, since bignums
are implemented with primitive structures. So, if you want to beat
bignum for storage, you'll want to use some contiguous memory with
fixed sized elements (like byte strings, or arrays of uint32_t's in C)
- but using bit manipulation on each byte, such that you have multiple
stored values in each one, directly beside eachother bitwise, like a
bignum has, but without it's overhead.

Regards,
Brandon Thomas

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Re: appending files

2016-01-28 Thread Scotty C

> Way back in this thread you implied that you had extremely large FILES
> containing FIXED SIZE RECORDS, from which you needed 
> to FILTER DUPLICATE records based on the value of a FIXED SIZE KEY
> field.
this is mostly correct. the data is state and state associated data on the 
fringe. hence the name of the algorithm: fringe search. states (keys) are fixed 
size. associated data (due to the operator sequence) is variable size. i didn't 
post that here. i sent you (george) an email directly wed at 9:49 am according 
to my sent email box. getting this piece of the algorithm to go faster, less 
memory, both? awesome.

my actual test file (just checked) is 633 mb. it is data from perhaps halfway 
through a search. the fringe for a 5x5 grows by about 9x each successive 
fringe. i say about 9x because as the fringes grow, the amount of redundancy 
will increase. when i hit the limits of my hardware and patience with this 
algorithm i was at 90% redundancy but that fringe file was huge. i still hadn't 
produced an answer for the problem and decided i needed to get the code to run 
in parallel. that was about 5 years ago. last spring i started reworking my old 
stuff to work with places and ran out of enthusiasm until about 2 weeks ago.

> Doesn't work for 6x6?  Well 36 6-bit values fit neatly into 216 bits
> (27 bytes).
the guy (korf) who did the paper on the 24 puzzle has attempted the 6x6 and 
failed. notice in the 24 puzzle paper that he was unable to solve one of those 
10 sample problems. 5x5 is what i'm after.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Re: appending files

2016-01-27 Thread George Neuner

On 1/27/2016 10:50 AM, Brandon Thomas wrote:

On Wed, 2016-01-27 at 04:01 -0500, George Neuner wrote:
> On Tue, 26 Jan 2016 23:00:01 -0500, Brandon Thomas
>  wrote:
>
> > Is there anything stopping you from restructuring
> > the data on disk and using the hash directly from there
>
> Scotty's hash table is much larger than he thinks it is and very
> likely is being paged to disk already.  Deliberately implementing it
> as a disk file is unlikely to improve anything.

That's a good point to keep in mind. But there are advantages,
including faster startup time, using less ram+swap, easier to keep the
file updated and it makes it easier to make a resize solution. There
are probably more, but basically it's the reasons why all large
(key,value) storage solutions I've herd of use an explicit file instead
of swap.

I miscalculated the scale of Scotty's hash structure - it's not as bad 
as I thought initially.  But even so, it is of a scale where it is 
unwieldy and bound to have virtual memory problems unless the machine is 
dedicated.

Hashing is latency sensitive - it was designed to be a memory resident 
technique.  Obviously it _can_ be done using file based buckets ... the 
effect is of querying an ISAM database for ever access.  The problem is 
that the latency increases by orders of magnitude: even from resident 
blocks, every access involves the file system API and the kernel.  You 
did mention (user space) caching, but that greatly complicates the solution.

Making the hash external I think is not a win - it definitely will 
handle much bigger files, but it will handle every file more slowly.  I 
think it is better to leverage the file system rather than fight it.

YMMV,
George

--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Re: appending files

2016-01-27 Thread George Neuner


On 1/27/2016 9:49 AM, scott coston wrote:

yes, me neither (11pm for me). i just rechecked my spreadsheet and i typed in 
128 where i should have typed 256. but it doesn't really matter as the my test 
file is small compared to where the file will be going.


Mine was worse - it's downright embarrassing. :-[When I was running 
the numbers somehow I put in  6e10  as the the scale - 60 billion 
instead of the 6 million you specified - and I didn't realize it until 
after I had sent the reply.


Your analysis of the hash size is quite reasonable (modulo allocator 
overhead).   I do think, though, that you can save considerable space 
with a byte array vs pointer array + bignums.




anyway, since you seem to have some ideas i have more info for you. take a look 
at this:
https://www.aaai.org/Papers/AAAI/1996/AAAI96-178.pdf


I'll take a look at it.

I haven't had to figure out an external merge filter for decades. It can 
be tricky when the number of files needs to be variable.  And also as I 
mentioned, the fact that you want it sorted on the non-filtered field is 
a complication.





From: racket-users@googlegroups.com <racket-users@googlegroups.com> on behalf of 
George Neuner <gneun...@comcast.net>
Sent: Wednesday, January 27, 2016 4:28 AM
To: racket-users@googlegroups.com
Subject: Re: [racket-users] Re: appending files

Sorry.   I shouldn't do math at 4am. Ignore the numbers.   However, it
is still correct that the byte array will use less space than an array
of bignums.
George



--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Re: appending files

2016-01-27 Thread Brandon Thomas

On Wed, 2016-01-27 at 17:49 -0500, George Neuner wrote:
> On 1/27/2016 10:50 AM, Brandon Thomas wrote:
> > On Wed, 2016-01-27 at 04:01 -0500, George Neuner wrote:
> > > On Tue, 26 Jan 2016 23:00:01 -0500, Brandon Thomas
> > >  wrote:
> > > 
> > > > Is there anything stopping you from restructuring
> > > > the data on disk and using the hash directly from there
> > > 
> > > Scotty's hash table is much larger than he thinks it is and very
> > > likely is being paged to disk already.  Deliberately implementing
> > > it
> > > as a disk file is unlikely to improve anything.
> > 
> > That's a good point to keep in mind. But there are advantages,
> > including faster startup time, using less ram+swap, easier to keep
> > the
> > file updated and it makes it easier to make a resize solution.
> > There
> > are probably more, but basically it's the reasons why all large
> > (key,value) storage solutions I've herd of use an explicit file
> > instead
> > of swap.
> 
> I miscalculated the scale of Scotty's hash structure - it's not as
> bad 
> as I thought initially.  But even so, it is of a scale where it is 
> unwieldy and bound to have virtual memory problems unless the machine
> is 
> dedicated.
> 
> Hashing is latency sensitive - it was designed to be a memory
> resident 
> technique.  Obviously it _can_ be done using file based buckets ...
> the 
> effect is of querying an ISAM database for ever access.  The problem
> is 
> that the latency increases by orders of magnitude: even from
> resident 
> blocks, every access involves the file system API and the
> kernel.  You 
> did mention (user space) caching, but that greatly complicates the
> solution.
> Making the hash external I think is not a win - it definitely will 
> handle much bigger files, but it will handle every file more
> slowly.  I 
> think it is better to leverage the file system rather than fight it.
> 
> YMMV,
> George

Yup, the canonical solutions to storing large amounts of data scalably
are rather complicated and are work to implement (I wouldn't say
unmanagable though). And they get much harder if you wanted, say, ACID
properties. This is why it's been factored into libraries, like Redis
for example. If it can fit into ram efficiently enough, that's ok. If
he's planning to scale though, maybe it's worth looking at using
Racket's ffi with a library like Redis or something. It might be the
least work and the cleanest solution.

Regards,
Brandon Thomas

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Re: appending files

2016-01-27 Thread George Neuner

On Wed, 27 Jan 2016 11:17:04 -0800 (PST), Scotty C
 wrote:

>On Wednesday, January 27, 2016 at 2:57:42 AM UTC-6, gneuner2 wrote:
>
>> What is this other field on which the file is sorted?
>this field is the cost in operators to arrive at the key value

Is it important to retain that sorting?  Or is it just informational?

>> In the worst case of every key being a bignum
>no, every key is contained within a bignum which can contain many many keys.

Then you're not using the hash in a conventional manner ... else the
filter entries would be unique ... and we really have no clue what
you're actually doing.  So any suggestions we give you are shots in
the dark.

>> Since you are only comparing the hash entries for equality, you could
>> save a lot of space [at the expense of complexity] by defining a
>> {bucket x chain_size x 16} byte array and storing the 16-byte keys
>> directly.
>i must be able to grow the chains. i can't make it fixed size like that.

You said previously:

>>> ... i use separate chaining with a vector of
>>> bignums. i am willing to let my chains run up to 5 keys per chain
>>> so i need a vector of 6 million pointers.

That reads as needing a maximum of  6M x 5  entries.  

Exstensible structures don't have to be a list - a small array works
very well when the number of entries is bounded [which you said it
was].

Again this is a case of we really have no clue what you are doing.

>> > have another rather large bignum in memory that i use to reduce
>> >but not eliminate record duplication of about .5 gb. 
>> ???
>
>ha, ok. this is what this bignum is for. cycle elimination. a sequence
>of operators (2 bit per) when strung together is a number like 4126740
>which represents the operator sequence (0 0 3 0 0 3 1 1 2 2 1). i
>change that bit in the bignum from 0 to 1. during data generation i 
>look up my about to be applied operator sequence in the bignum. if i 
>see a one, i skip data generation. i'm not really happy with the volume
>of memory this takes but it is an insanely fast lookup and keeps a ton of
>data off the hard drive.

>> In the meantime, I would suggest you look up "merge sort" and it's
>logarithmic? not happening

By splitting the too-large data file and trying to process it in
sections, you're already O(M*N) where M is the number of sections.
Logarithmic might improve things.

Rejecting an approach due solely to its BigO is naive.  The order
tells how it scales to the limits ... not how it behaves in any
particular situation.  In most cases there are constants and other
linear multipliers that dominate the running time.

It matter not just how many times you touch the data, but also the
cost of each touch.

E.g., people read that Quicksort is O(N*LogN)  and ignore that it is
O(N*N) under certain circumstances.  Mergesort has a higher constant
multiplier, but it is  O(N*LogN)  in _all_ cases ... including those
that throw Quicksort into fits.  Plus Quicksort can't be used
effectively when the data set exceeds RAM.  Mergesort and some others
can be used regardless of the data set size.

Also remember that it's the pattern of data access that's important -
not what is actually being done with it.

Even when data fits in memory, random access algorithms often are not
the best.  Prediction, caching and paging concerns often make the most
effective algorithm require complex rearrangement and staging of data
that is unnatural wrt the overall approach to the problem.

Disk file systems are optimized for sequential access.  When data is
file based, sequential access is to be preferred unless there's no
other choice.  

George

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Re: appending files

2016-01-27 Thread Brandon Thomas

On Wed, 2016-01-27 at 04:01 -0500, George Neuner wrote:
> On Tue, 26 Jan 2016 23:00:01 -0500, Brandon Thomas
>  wrote:
> 
> > Is there anything stopping you from restructuring
> > the data on disk and using the hash directly from there
> 
> Scotty's hash table is much larger than he thinks it is and very
> likely is being paged to disk already.  Deliberately implementing it
> as a disk file is unlikely to improve anything.
> 
> George
> 

That's a good point to keep in mind. But there are advantages,
including faster startup time, using less ram+swap, easier to keep the
file updated and it makes it easier to make a resize solution. There
are probably more, but basically it's the reasons why all large
(key,value) storage solutions I've herd of use an explicit file instead
of swap.

Regards,
Bradon Thomas

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Re: appending files

2016-01-27 Thread Brandon Thomas

On Tue, 2016-01-26 at 22:48 -0800, Scotty C wrote:
> ok brandon, that's a thought. build the hash on the hard drive at the
> time of data creation. you mention collision resolution. so let me
> build my hash on the hard drive using my 6 million buckets but
> increase the size of each bucket from 5 slots to 20. right?
> i can't exactly recreate my vector/bignum hash on the hard drive
> because i can't dynamically resize the buckets like i can the
> bignums. this gives me a 4 gb file whereas my original was 1 gb. i
> have enough space for that so that's not a problem.

Sure, a bigger file works. Your bignums are probably implemented using
linked lists, if I'm not mistaken. That's why they are O(1) resizable.
You could do the same thing on hdd. The problem is that you might need
to seek and read through the file more with linked lists. Since your
probably paging, your implementation also has this same problem. You
can get rid of this problem in both by using vectos though, as you
suggest. And you can have a proccess in the background that resizes the
file (by creating a new one) when your other one gets too full, so you
don't block operations.

> so as my buckets fill up they head towards the average of 5 data
> items per bucket. so on average here's what happens with each hd hash
> record. i go to my hd hash and read 3.5 (think about it) items and
> 90% of the time i don't find my data so i do a write. in my process i
> do an initial write, then a read, a write, a read, a write. compare:
> 3.5 vs 2 reads; 1 vs 3 writes. the reads are more costly and if i
> exceed 20 items in a bucket the hd hash breaks. what do you think? is
> it worth it?
> 

I think that implementation has room for improvement. Let's say you
look for a record, it does not exist, and you need to write it. So, you
seek to the location, read the bucket into ram (1 read). If the entry
isn't in the bucket, you seek to the location in the bucket and write
your entry (1 write). Of course to do this, you'd think you'd need to
make a function for your hash that searches for an entry, and uses a
fallback to write the record if it fails. That's a reasonable function
though. But you can also introduce the idea of using a cache, and have
a background proccess update the file (that way your hahs oeprations
aren't waiting on the hdd). Though, your operating systems cache file
cache might even be taking care of this for you, so you might want to
do some benchmarking.

There are probably several other optimizations you can make, and you
can probably find them by going through the source/docs of other
(key,value) storage solutions. Or if you wanted to cut down on work,
you can just use one of the other (key,value) storage solutions and use
it via libffi.

Regards,
Brandon Thomas

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Re: appending files

2016-01-27 Thread Scotty C

On Wednesday, January 27, 2016 at 2:57:42 AM UTC-6, gneuner2 wrote:

> What is this other field on which the file is sorted?
this field is the cost in operators to arrive at the key value

> WRT a set of duplicates: are you throwing away all duplicates? Keeping
> the 1st one encountered?  Something else?
keep first instance, chuck the rest

> This structure uses a lot more space than necessary.  Where did you
> get the idea that a bignum is 10 bytes?
not sure about the 10 bytes. if i shove 5 128 bit keys into a bignum is that 
about 80 bytes plus some overhead? so 80*600 is 480 mb not including 
overhead.

> In the worst case of every key being a bignum
no, every key is contained within a bignum which can contain many many keys.

> Since you are only comparing the hash entries for equality, you could
> save a lot of space [at the expense of complexity] by defining a
> {bucket x chain_size x 16} byte array and storing the 16-byte keys
> directly.
i must be able to grow the chains. i can't make it fixed size like that.

> > have another rather large bignum in memory that i use to reduce
> >but not eliminate record duplication of about .5 gb. 
> ???

ha, ok. this is what this bignum is for. cycle elimination. a sequence of 
operators (2 bit per) when strung together is a number like 4126740 which 
represents the operator sequence (0 0 3 0 0 3 1 1 2 2 1). i change that bit in 
the bignum from 0 to 1. during data generation i look up my about to be applied 
operator sequence in the bignum. if i see a one, i skip data generation. i'm 
not really happy with the volume of memory this takes but it is an insanely 
fast lookup and keeps a ton of data off the hard drive.

> In the meantime, I would suggest you look up "merge sort" and it's
logarithmic? not happening

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Re: appending files

2016-01-27 Thread Scotty C

> Is it important to retain that sorting?  Or is it just informational?
it's important

> Then you're not using the hash in a conventional manner ... else the
> filter entries would be unique ... and we really have no clue what
> you're actually doing.  So any suggestions we give you are shots in
> the dark.
using it conventionally? absolutely. it is a hash with separate chaining. will 
a bit of code help?

(define (put p)
  (define kh (hashfn p))
  (vector-set! tble kh (bitwise-ior (arithmetic-shift (vector-ref tble kh) 
shftl) p)))

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Re: appending files

2016-01-27 Thread George Neuner

Hi Scotty,

I rearranged your message a bit for (my own) clarity.

On Tue, 26 Jan 2016 18:40:28 -0800 (PST), Scotty C
 wrote:

>running 64 bit linux mint OS on a 2 core laptop with 2 gb of ram. 

>the generated keys are random but i use one of the associated 
>fields for sorting during the initial write to the hard drive. what goes 
>in each of those files is totally random but dupes do not run across
>files. also, the number of keys is >1e25.

>my key is 128 bits with ~256 bits per record. so my 1 gb file contains
>~63 million records and ~32 million keys. about 8% will be dupes 
>leaving me with ~30 million keys. 

You have 63 million 256-bit records in a 1GB file?  The description of
your "records" is vague and can be interpreted more than 1 way, but my
math says:

  256 bits (128 key + 128 data) = 32M records
  384 bits (128 key + 256 data) = ~22M records

Anyway ... I'll assume for filtering only the keys are important.

If I understand correctly, you have a 128-bit "key space" and that you
expect to see some small percentage of the keys duplicated.   The file
is sorted on some other non-key value.

What is this other field on which the file is sorted?

WRT a set of duplicates: are you throwing away all duplicates? Keeping
the 1st one encountered?  Something else?

>i run a custom built hash. i use separate chaining with a vector of
>bignums. i am willing to let my chains run up to 5 keys per chain
>so i need a vector of 6 million pointers. that's 48 mb for the array.
>another 480 mb for the bignums. let's round that sum to .5 gb. 

This structure uses a lot more space than necessary.  Where did you
get the idea that a bignum is 10 bytes?

Unless something changed recently, Racket uses the GMP library for
multiple precision math.  GMP bignums require 2 allocations: a struct
plus an array.  Repesenting a full size 128-bit value on a 64-bit
machine takes 40-48 bytes depending on the library's int size, and not
including any allocator overhead.

In the worst case of every key being a bignum, your hash is using 2.6
to 3.1 GB ... not 0.5 GB.  Again, not including allocator overhead.

Since you are only comparing the hash entries for equality, you could
save a lot of space [at the expense of complexity] by defining a
{bucket x chain_size x 16} byte array and storing the 16-byte keys
directly.  

Even using an auxillary vector for tracking residency in the array,
you'd be down under 1GB for the same 6 million hash entries.

Even so, this seems like a bad approach to me.

> have another rather large bignum in memory that i use to reduce
>but not eliminate record duplication of about .5 gb. 

???

>i'm attempting to get this thing to run in 2 places so i need 2 hashes.
>add this up .5+.5+.5 is 1.5 gb and that gets me to about my memory 
>limit. 

I don't have time just now to sketch a split/merge solution ... I'll
get back to it later.  The file already being sorted on an unfiltered
field is a complication.  Offhand I don't think the sort order can be
maintained through the filtering ... meaning the filtered file would
need to be sorted again afterward.  

But even so it might be faster than your current method.

In the meantime, I would suggest you look up "merge sort" and it's
application to "external" sorting.  That will give you the basic idea
of what I'm proposing.

George

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Re: appending files

2016-01-27 Thread George Neuner

Sorry.   I shouldn't do math at 4am. Ignore the numbers.   However, it 
is still correct that the byte array will use less space than an array 
of bignums.

George


On 1/27/2016 3:54 AM, George Neuner wrote:

i run a custom built hash. i use separate chaining with a vector of
bignums. i am willing to let my chains run up to 5 keys per chain
so i need a vector of 6 million pointers. that's 48 mb for the array.
another 480 mb for the bignums. let's round that sum to .5 gb.

This structure uses a lot more space than necessary.  Where did you
get the idea that a bignum is 10 bytes?

Unless something changed recently, Racket uses the GMP library for
multiple precision math.  GMP bignums require 2 allocations: a struct
plus an array.  Repesenting a full size 128-bit value on a 64-bit
machine takes 40-48 bytes depending on the library's int size, and not
including any allocator overhead.

In the worst case of every key being a bignum, your hash is using 2.6
to 3.1 GB ... not 0.5 GB.  Again, not including allocator overhead.

Since you are only comparing the hash entries for equality, you could
save a lot of space [at the expense of complexity] by defining a
{bucket x chain_size x 16} byte array and storing the 16-byte keys
directly.

Even using an auxillary vector for tracking residency in the array,
you'd be down under 1GB for the same 6 million hash entries.

Even so, this seems like a bad approach to me.



--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Re: appending files

2016-01-27 Thread George Neuner

On Tue, 26 Jan 2016 23:00:01 -0500, Brandon Thomas
 wrote:

>Is there anything stopping you from restructuring
>the data on disk and using the hash directly from there

Scotty's hash table is much larger than he thinks it is and very
likely is being paged to disk already.  Deliberately implementing it
as a disk file is unlikely to improve anything.

George

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Re: appending files

2016-01-26 Thread Scotty C

ok brandon, that's a thought. build the hash on the hard drive at the time of 
data creation. you mention collision resolution. so let me build my hash on the 
hard drive using my 6 million buckets but increase the size of each bucket from 
5 slots to 20. right? i can't exactly recreate my vector/bignum hash on the 
hard drive because i can't dynamically resize the buckets like i can the 
bignums. this gives me a 4 gb file whereas my original was 1 gb. i have enough 
space for that so that's not a problem. so as my buckets fill up they head 
towards the average of 5 data items per bucket. so on average here's what 
happens with each hd hash record. i go to my hd hash and read 3.5 (think about 
it) items and 90% of the time i don't find my data so i do a write. in my 
process i do an initial write, then a read, a write, a read, a write. compare: 
3.5 vs 2 reads; 1 vs 3 writes. the reads are more costly and if i exceed 20 
items in a bucket the hd hash breaks. what do you think? is it worth it?

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Re: appending files

2016-01-26 Thread Brandon Thomas

On Tue, 2016-01-26 at 18:40 -0800, Scotty C wrote:
> alright george, i'm open to new ideas. here's what i've got going.
> running 64 bit linux mint OS on a 2 core laptop with 2 gb of ram. my
> key is 128 bits with ~256 bits per record. so my 1 gb file contains
> ~63 million records and ~32 million keys. about 8% will be dupes
> leaving me with ~30 million keys. i run a custom built hash. i use
> separate chaining with a vector of bignums. i am willing to let my
> chains run up to 5 keys per chain so i need a vector of 6 million
> pointers. that's 48 mb for the array. another 480 mb for the bignums.
> let's round that sum to .5 gb. i have another rather large bignum in
> memory that i use to reduce but not eliminate record duplication of
> about .5 gb. i'm attempting to get this thing to run in 2 places so i
> need 2 hashes. add this up .5+.5+.5 is 1.5 gb and that gets me to
> about my memory limit. the generated keys are random but i use one of
> the associated fields for sorting during the initial write to the
> hard drive. what goes in each of those files is totally random but
> dupes do not run across files. also, the number of keys is >1e25.
> 

Sorry, I haven't read through the entire conversation, so I hope I'm
not missing anything. Is there anything stopping you from restructuring
the data on disk and using the hash directly from there (possibly with
the help of a cache if speed is important)? For example, let's say each
entry is 256 bits. Use something like "(file-position dbfile (* 32
(hash-custom key)))" to seek over to the appropriate entry on disk and
read just the entry you need (using whatever colliosion resolution).
Then you'll be using no auxillary memory (unless your caching, which
can just be a smaller ram hash table). Unless of course I'm just
missing something completly.

Regards,
Brandon Thomas

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Re: appending files

2016-01-26 Thread Scotty C

robby findler, you the man. i like the copy-port idea. i incorporated it and it 
is nice and fast and easily fit into the existing code.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Re: appending files

2016-01-26 Thread Scotty C

neil van dyke, i have used the system function before but had forgotten what it 
was called and couldn't find it as a result in the documentation. my problem 
with using the system function is that i need 2 versions of it: windoz and 
linux. the copy-port function is a write once use across multiple os solution. 
sweet.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Re: appending files

2016-01-26 Thread Scotty C

gneuner2 (george), you are over thinking this thing. my test data of 1 gb is 
but a small sample file. i can't even hash that small 1 gb at the time of data 
creation. the hashed data won't fit in ram. at the time i put the redundant 
data on the hard drive, i do some constant time sorting so that the redundant 
data on the hard drive is contained in roughly 200 usefully sorted files. some 
of these files will be small and can be hashed with a single read, hash and 
write. some will be massive (data won't fit in ram) and must be split further. 
this produces another another type of single read, hash and write. these split 
files can now be fully hashed which means a second read, hash and write. 
recombining the second level files is virtually instantaneous (copy-port) 
relative to the effort spent to get to that point. all of these operations are 
constant time. it would be nice to cut into that big fat hard drive induced C 
but i can't do it with a single read and write on the larger files.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Re: appending files

2016-01-26 Thread George Neuner


On 1/26/2016 2:51 PM, Scotty C wrote:

gneuner2 (george), you are over thinking this thing. my test data of 1 gb is 
but a small sample file. i can't even hash that small 1 gb at the time of data 
creation. the hashed data won't fit in ram. at the time i put the redundant 
data on the hard drive, i do some constant time sorting so that the redundant 
data on the hard drive is contained in roughly 200 usefully sorted files. some 
of these files will be small and can be hashed with a single read, hash and 
write. some will be massive (data won't fit in ram) and must be split further. 
this produces another another type of single read, hash and write. these split 
files can now be fully hashed which means a second read, hash and write. 
recombining the second level files is virtually instantaneous (copy-port) 
relative to the effort spent to get to that point. all of these operations are 
constant time. it would be nice to cut into that big fat hard drive induced C 
but i can't do it with a single read and write on the larger files.



I reject the term "over thinking".

You may have created something that works, but that doesn't mean it 
can't be improved.  Remember that programs running in just 10's of 
kilobytes routinely processed data files containing 10's of megabytes.  
Modern programmers have forgotten - or never knew - how to work with 
very large data sets that don't necessarily fit into memory.


For example:  it matters greatly what is being hashed and how.  With 
Racket hashes, using a string key retains the string for comparison - so 
hashing a file containing a large percentage of unique strings produces 
a very large hash ... in the worst case the hash may be bigger than the 
data file.  If the strings are suitably long, then, e.g., 
crypto-digesting them down to ~20..32 byte values and hashing digests 
instead of strings might allow the hash to be memory resident or let you 
process much larger files using the same amount of memory.



If the records are sorted on the field(s) that may be duplicated, then 
by splitting the file and using multi-file processing techniques 
duplicates can be filtered from all the sections without hashing and in 
a single pass through the whole of the data - leaving a single sorted 
file as output.   If your input data is unordered, then as you split the 
file you sort each section individually with an in-memory sort.  The 
re-merge step - which is where the filtering is performed - sorts the 
output.


This is very efficient and it works with stupendously huge data file 
because the merge step can (more or less) simultaneously process as many 
files (split sections) as data _records_ will fit into memory.



And there are times when you should give up and embrace a DBMS, if only 
for it's proficiency at disk based data handling.


YMMV,
George

--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Re: appending files

2016-01-26 Thread Neil Van Dyke

+1 on George Neuner's comments about how one can do smart processing of 
huge files in small space.  (I almost said something about that myself, 
but didn't have time to get into that kind of discussion, so I stuck to 
only the simpler file concatenation question.)


BTW, students who have 8GB RAM and 256GB of SSD in their cheap laptop, 
and who are looking for relevance of learning how to compute with 
resource constraints, should note the current buzzwords "big data", 
"Internet of Things", etc.  (Most any 12 year-old can learn to code a 
phone/tablet app, throw up an AJAX-y Web site, or make 3D game 
assets/mods.  One of the reasons that professional software engineers 
have to keep learning is so that we can do things that 12yos can't.)


Neil V.

--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Re: appending files

2016-01-26 Thread Scotty C

alright george, i'm open to new ideas. here's what i've got going. running 64 
bit linux mint OS on a 2 core laptop with 2 gb of ram. my key is 128 bits with 
~256 bits per record. so my 1 gb file contains ~63 million records and ~32 
million keys. about 8% will be dupes leaving me with ~30 million keys. i run a 
custom built hash. i use separate chaining with a vector of bignums. i am 
willing to let my chains run up to 5 keys per chain so i need a vector of 6 
million pointers. that's 48 mb for the array. another 480 mb for the bignums. 
let's round that sum to .5 gb. i have another rather large bignum in memory 
that i use to reduce but not eliminate record duplication of about .5 gb. i'm 
attempting to get this thing to run in 2 places so i need 2 hashes. add this up 
.5+.5+.5 is 1.5 gb and that gets me to about my memory limit. the generated 
keys are random but i use one of the associated fields for sorting during the 
initial write to the hard drive. what goes in each of those files is totally 
random but dupes do not run across files. also, the number of keys is >1e25.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

45 matches

Mail list logo