[racket-users] Re: appending files
On Sunday, January 31, 2016 at 12:13:31 AM UTC-6, Scotty C wrote: > > that's what i did. so new performance data. this is with bytes instead of > > strings for data on the hard drive but bignums in the hash still. > > > > as a single large file and a hash with 203 buckets for 26.6 million > > records the data rate is 98408/sec. > > > > when i split and go with 11 smaller files and hash with 59 buckets the > > data rate is 106281/sec. > > hash is reworked, bytes based. same format though, vector of bytes. so time > test results: > > single large file same # buckets as above data rate 175962/sec. > > 11 smaller files same # buckets as above data rate 205971/sec. throughput update. i had to hand code some of the stuff (places are just not working for me) but i just managed to hack my way through running this in parallel. i copied the original 26.6 million records to a new file. ran two slightly reworked copies of my duplicate removal code at a shell prompt like this: racket ddd-parallel.rkt & racket ddd-parallel1.rkt & i'm not messing with the single large file anymore. so for twice the data the data rate is up to 356649/sec. -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[racket-users] Re: appending files
> Yes. You probably do need to convert the files. Your originalat > coding likely is not [easily] compatible with binary I/O. that's what i did. so new performance data. this is with bytes instead of strings for data on the hard drive but bignums in the hash still. as a single large file and a hash with 203 buckets for 26.6 million records the data rate is 98408/sec. when i split and go with 11 smaller files and hash with 59 buckets the data rate is 106281/sec. clearly it is quicker to read/write twice than to read/write once. but of course my laptop is pretty sad and my hash may end up being sad as well. with these revised and quicker rates, i'm ready to migrate the hash to bytes. -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [racket-users] Re: appending files
Fixed, thanks for the report! On Sat, Jan 30, 2016 at 8:31 PM, Scotty Cwrote: > just found a small mistake in the documentation: can you find it? > > (numerator q) → integer? > > q : rational? > > Coerces q to an exact number, finds the numerator of the number expressed > in its simplest fractional form, and returns this number coerced to the > exactness of q. > > (denominator q) → integer? > > q : rational? > > Coerces q to an exact number, finds the numerator of the number expressed > in its simplest fractional form, and returns this number coerced to the > exactness of q. > > -- > You received this message because you are subscribed to the Google Groups > "Racket Users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to racket-users+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[racket-users] Re: appending files
just found a small mistake in the documentation: can you find it? (numerator q) → integer? q : rational? Coerces q to an exact number, finds the numerator of the number expressed in its simplest fractional form, and returns this number coerced to the exactness of q. (denominator q) → integer? q : rational? Coerces q to an exact number, finds the numerator of the number expressed in its simplest fractional form, and returns this number coerced to the exactness of q. -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[racket-users] Re: appending files
> that's what i did. so new performance data. this is with bytes instead of > strings for data on the hard drive but bignums in the hash still. > > as a single large file and a hash with 203 buckets for 26.6 million > records the data rate is 98408/sec. > > when i split and go with 11 smaller files and hash with 59 buckets the > data rate is 106281/sec. hash is reworked, bytes based. same format though, vector of bytes. so time test results: single large file same # buckets as above data rate 175962/sec. 11 smaller files same # buckets as above data rate 205971/sec. i played around with the # buckets parameter but what worked for bignums worked for bytes too. overall speed has nearly doubled. very nice, thanks to all who contributed some ideas. and to think, all i wanted to do was paste some files together. -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [racket-users] Re: appending files
> > question for you all. right now i use modulo on my bignums. i know i > > can't do that to a byte string. i'll figure something out. if any of > > you know how to do this, can you post a method? > > > > I'm not sure what your asking exactly. i'm talking about getting the hash index of a key. see my key is a bignum and i get the hash index with (modulo key 611). so i either need to turn the key (which will be a byte string) into a number. stuff that right in where i have key. or i replace modulo. -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [racket-users] Re: appending files
On Fri, Jan 29, 2016 at 7:04 PM, Scotty Cwrote: > > > question for you all. right now i use modulo on my bignums. i know i > > > can't do that to a byte string. i'll figure something out. if any of > > > you know how to do this, can you post a method? > > > > > > > I'm not sure what your asking exactly. > > i'm talking about getting the hash index of a key. see my key is a bignum > and i get the hash index with (modulo key 611). so i either need to > turn the key (which will be a byte string) into a number. stuff that right > in where i have key. or i replace modulo. Sorry -- I haven't been following this thread closely, so I apologize if this question was already answered: Did you implement your own hash tables, or are you using Racket's built-in hashes? Because if you're using the built-in ones, there's no reason to do this. You just use your object (a bignum, a byte string, or anything else) as the key. (The hash code is generated by `equal-hash-code`, but you don't have to call it yourself. The hash implementation does that.) However, if you have implemented your own, you can still call `equal-hash-code` on your byte string. That will give you an exact integer, and you can then call modulo (or remainder) on that. -Jon -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [racket-users] Re: appending files
> However, if you have implemented your own, you can still call > `equal-hash-code` yes, my own hash. i think the equal-hash-code will work. -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [racket-users] Re: appending files
On Fri, Jan 29, 2016 at 7:45 PM, Scotty Cwrote: > > my plan right now is to rework my current hash so that it runs byte > strings instead of bignums. > > i have a new issue. i wrote my data as char and end records with 'return. > i use (read-line x 'return) and the first record is 15 char. when i use > (read-line-bytes x 'return) i get 23 byte. i have to assume that my old > assumption that an 8 bit char would write to disk as 8 bits is incorrect? > > from documentation on read-char > Reads a single character from in—which may involve reading several bytes > to UTF-8-decode them into a character > > i get the feeling that i will need to read the entire file as i used to > read it taking each record and doing the following: > convert the string record to a bignum record > convert the bignum record into a byte string > write the byte string to a new data file > > does that seem right? I don't know for sure, because I don't know what your file format is like (apologies again, if I missed it; I *did* realize after I sent my last message that you are using a custom hash table). But, if you're just taking the first N bytes of a record and using that as your key, then there's no reason to go through a bignum. You can read bytes (as opposed to characters) from an input port. Just use `read-bytes` or one of its variants. -Jon -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [racket-users] Re: appending files
On Fri, 2016-01-29 at 13:00 -0800, Scotty C wrote: > ok, had time to run my hash on my one test file > '(611 1 1 19 24783208 4.19) > this means > # buckets > % buckets empty > non empty bucket # keys least > non empty bucket # keys most > total number of keys > average number of keys per non empty bucket > > it took 377 sec. > original # records is 26570359 so 6.7% dupes. > processing rate is 70478/sec > > my plan right now is to rework my current hash so that it runs byte > strings instead of bignums. i will probably be tomorrow afternoon > before i have more stats. > > question for you all. right now i use modulo on my bignums. i know i > can't do that to a byte string. i'll figure something out. if any of > you know how to do this, can you post a method? > I'm not sure what your asking exactly. Because a byte string is an array, all you need to do is jump to the byte your information is in and then get your bit out. So let's just say you want to get the nth bit in an array of 8-bit bytes. You first need to calculate which byte your bit is in, which is clearly (bytes-ref data (floor (/ n 8))). Then you get the bit out of it, which would be something (untested) like (modulo (arithmetic-shift (bytes-ref data (floor (/ n 8))) (* -1 (modulo n 8))) 2). This is easy enough to modify for elements that are a divisor of 8 in size. Regards, Brandon Thomas -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[racket-users] Re: appending files
> i get the feeling that i will need to read the entire file as i used to read > it taking each record and doing the following: > convert the string record to a bignum record > convert the bignum record into a byte string > write the byte string to a new data file > > does that seem right? nevermind. this is indeed what i needed to do. the new file is 438.4 mb. the time to read, hash, write is now 317 seconds. processing rate is 83818/sec. the hash still uses bignums. the speed change is just from reading and writing bytes instead of strings. the drop in filesize is about 30% the gain is speed is about 15% -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[racket-users] Re: appending files
On Fri, 29 Jan 2016 16:45:40 -0800 (PST), Scotty Cwrote: >i have a new issue. i wrote my data as char and end records with 'return. i >use (read-line x 'return) and the first record is 15 char. when i use > (read-line-bytes x 'return) i get 23 byte. i have to assume that my old >assumption that an 8 bit char would write to disk as 8 bits is incorrect? Yes and no. Characters are not "bytes". If you wrote the file originally in text mode, then it is subject to locale settings in the OS. Racket can handle some different character representations, but you have to select them deliberately. If you want to work with byte strings you are going to do file I/O in binary mode. Also, note that if you display/print a byte string you are seeing character representations of the bytes. There will be extraneous '\' escapes for values that don't represent printable characters. The small values you will be using all are unprintable ASCII control characters. >from documentation on read-char >Reads a single character from in—which may involve reading several >bytes to UTF-8-decode them into a character > >i get the feeling that i will need to read the entire file as i used to read it >taking each record and doing the following: >- convert the string record to a bignum record >- convert the bignum record into a byte string >- write the byte string to a new data file > >does that seem right? Yes. You probably do need to convert the files. Your original coding likely is not [easily] compatible with binary I/O. George -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[racket-users] Re: appending files
> i get the feeling that i will need to read the entire file as i used to read > it taking each record and doing the following: > convert the string record to a bignum record > convert the bignum record into a byte string > write the byte string to a new data file > > does that seem right? nevermind. this is indeed what i needed to do. the new file is 438.4 mb. the time to read, hash, write is now 83818 rec/sec. the hash still uses bignums. the speed change is just from reading and writing bytes instead of strings. -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[racket-users] Re: appending files
On Thu, 28 Jan 2016 20:32:08 -0800 (PST), Scotty Cwrote: >> (current-memory-use) >yup, tried that a while back didn't like what i saw. check this out: > >> (current-memory-use) >581753864 >> (current-memory-use) >586242568 >> (current-memory-use) >591181736 >> (current-memory-use) >595527064 > >does that work for you? Yes. You have to remember that evaluating the function and printing the result takes memory, and memory is reclaimed only by GC. So everything you do takes a little bit. I'm not aware of any other way. There is (dump-memory-stats) but it is undocumented. [another oppotunity to go spelunking in the code] But you should be interested in comparing the magnitude of one implementation vs another ... you're looking to save megabytes, not a few kilobytes. George -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[racket-users] Re: appending files
> my plan right now is to rework my current hash so that it runs byte strings > instead of bignums. i have a new issue. i wrote my data as char and end records with 'return. i use (read-line x 'return) and the first record is 15 char. when i use (read-line-bytes x 'return) i get 23 byte. i have to assume that my old assumption that an 8 bit char would write to disk as 8 bits is incorrect? from documentation on read-char Reads a single character from in—which may involve reading several bytes to UTF-8-decode them into a character i get the feeling that i will need to read the entire file as i used to read it taking each record and doing the following: convert the string record to a bignum record convert the bignum record into a byte string write the byte string to a new data file does that seem right? -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[racket-users] Re: appending files
ok, had time to run my hash on my one test file '(611 1 1 19 24783208 4.19) this means # buckets % buckets empty non empty bucket # keys least non empty bucket # keys most total number of keys average number of keys per non empty bucket it took 377 sec. original # records is 26570359 so 6.7% dupes. processing rate is 70478/sec my plan right now is to rework my current hash so that it runs byte strings instead of bignums. i will probably be tomorrow afternoon before i have more stats. question for you all. right now i use modulo on my bignums. i know i can't do that to a byte string. i'll figure something out. if any of you know how to do this, can you post a method? -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [racket-users] Re: appending files
On Thursday, January 28, 2016 at 11:36:50 PM UTC-6, Brandon Thomas wrote: > On Thu, 2016-01-28 at 20:32 -0800, Scotty C wrote: > > > I think you understand perfectly. > > i'm coming around > > > > > You said the keys are 128-bit (16 byte) values. You can store one > > > key > > > directly in a byte string of length 16. > > yup > > > > > So instead of using a vector of pointers to individual byte > > > strings, > > > you would allocate a single byte string of length > > > {buckets x chain length x 16} > > > and index it directly as if it were a 3-dimensional array [using > > > offset calculations as you would in C]. > > i'm going to actually run my code later (probably in the morning). > > and then grab some stats from the hash. take a look at the stats. > > give me some thoughts on the above implementation after looking at > > the stats. > > > > > Can you explain why the bignums are important. For a simple task > > > like > > > filtering a file, they would seem to be both a waste of memory and > > > a > > > performance drain wrt storing the key values as byte strings. > > ok, true story. about 10 years ago i'm a student taking an intro to > > ai class we have been assigned the 3x3 tile puzzle and 2 algorithms, > > ida* and a*. also, try it on a 4x4. so i whip these out and am > > appalled by how slow the fn things are but intrigued by what they can > > do and i'm just a puzzle kind of guy. btw, the 4x4 wasn't solvable by > > my stuff at the time. so i head to the chief's office with my little > > bit of code. tell him it's way to slow. what do you think? he takes 5 > > seconds to peruse the item and says "you're probably making too much > > memory". side note, later i graded this class for that prof and the > > same project. not a single student, including the ms and phd types, > > did anything better than a vector of vectors of ints. so back to me, > > i'm thinking, how do i cram 4 bits together so that there is > > absolutely no wasted space. i start digging through the documentation > > and i find the bitwise stuff. i see arithmetic shift and having no > > idea whether it will work or not i type into the interface > > (arithmetic-shift 1 100). if we blow up, well we blow up big but > > we didn't. i flipped everything in my project to bignum at that > > point. the bignum stuff is massively faster than vector of vectors of > > ints and faster than just a single vector of ints. lots of bignum is > > easy to implement, like the hash. making a child state, not so much. > > i'm not married to bignums despite all this. > > > > > I've been programming for over 30 years, professionally for 23. > > i was a programmer. dot com bought us as a brick and mortar back in > > '99 and the whole shebang blew up 2 years later. idiots. anyway, been > > a stay at home dad for the most part since then. > > > > > have an MS in Computer Science > > me too. that's the other piece of the most part from up above. > > > > > (current-memory-use) > > yup, tried that a while back didn't like what i saw. check this out: > > > > > (current-memory-use) > > 581753864 > > > (current-memory-use) > > 586242568 > > > (current-memory-use) > > 591181736 > > > (current-memory-use) > > 595527064 > > > > does that work for you? > > > > For whatever you're storing, I still suggest using a disk based > structure (preferably using one that's already optimised and built for > you). I've done a bit of work on cache aware algortihms, where reducing > memory footprint is really big (along with the memory juggling). Yes, > if you try to store something that takes only a few bits and store each > one into an integer, you'll have waisted space. In theory, you could > use a bignum, and only shift it as many bits as you need, which is what > you have done. The issue with that is that bignum's have extra overhead > thats neccessary for it to do arithmetic. Obviously, there needs to be > a way to match or beat bignums with primitive structures, since bignums > are implemented with primitive structures. So, if you want to beat > bignum for storage, you'll want to use some contiguous memory with > fixed sized elements (like byte strings, or arrays of uint32_t's in C) > - but using bit manipulation on each byte, such that you have multiple > stored values in each one, directly beside eachother bitwise, like a > bignum has, but without it's overhead. > > Regards, > Brandon Thomas -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[racket-users] Re: appending files
> You claim you want filtering to be as fast as possible. If that were > so, you would not pack multiple keys (or features thereof) into a > bignum but rather would store the keys individually. chasing pointers? no, you're thinking about doing some sort of byte-append and subbytes type of thing. only way that data in a hash would be small in memory and reasonably quick. care to elaborate? > The 16-byte keys are 1/3 the size of even the _smallest_ bignum, and where are you getting this? i've been digging all over the documentation and can't find a fn thing on how much space is required for any data type or it's overhead. what i do is open up htop and check memory then load drracket, a huge bignum and recheck. close drracket, check memory, restart drracket, load a huge vector and it's associated bignums and recheck. so 2 questions for you all 1)where is this info about data type memory requirements? 2)what is a better way of checking actual memory sucked up by my data structures? > comparing two small byte strings is faster than anything you can do > with the bignum. With the right data structure can put a lot more > keys into the same space you're using now and use them faster. you knew this was coming, right? put this into your data structure of choice: 16 5 1 12 6 24 17 9 2 22 4 10 13 18 19 20 0 23 7 21 15 11 8 3 14 this is a particular 5x5 tile puzzle (#6 in www.aaai.org/Papers/AAAI/1996/AAAI96-178.pdf) with the blank in a position where 4 children can be made. make a child while extracting the value of the tile being swapped with the blank. compare child to parent for equality. repeat that for the other 3 children. time testing that won't matter because we have different hardware. if you (any of you) think you have something that will best what i'm doing (bignums), bring it on. show me something cool and fast. let me check out your approach. i experimented a bit with the byte strings yesterday. what i'm doing with the bignums can't be done with byte strings. i'd have to rewrite just about everything i've got. -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[racket-users] Re: appending files
On Thu, 28 Jan 2016 07:56:05 -0800 (PST), Scotty Cwrote: >you knew this was coming, right? put this into your data structure of choice: > >16 5 1 12 6 24 17 9 2 22 4 10 13 18 19 20 0 23 7 21 15 11 8 3 14 > >this is a particular 5x5 tile puzzle >(#6 in www.aaai.org/Papers/AAAI/1996/AAAI96-178.pdf) >with the blank in a position where 4 children can be made. make a child while >extracting the value of the tile being swapped with the blank. compare child >to parent for equality. repeat that for the other 3 children. >> >i experimented a bit with the byte strings yesterday. what i'm doing with the >bignums can't be done with byte strings. But the problem is that you didn't sufficiently describe the problem. Way back in this thread you implied that you had extremely large FILES containing FIXED SIZE RECORDS, from which you needed to FILTER DUPLICATE records based on the value of a FIXED SIZE KEY field. That is the basic problem description that Brandon and Neil and I all perceived. I don't like speaking for others, but it is pretty obvious from the content of their posts. I will dig up the actual quotes if you don't believe that this is what you implied. And that is a problem which is not terribly well suited to hashing and in fact does not require any hashing at all to solve. And your revised problem as stated above still does not. The obvious solution is to encode the puzzle squares in a known order into adjacent 5-bit fields. That forms a nice 128-bit value with 3 unused bits. Those values can be compared directly for equality and we're back to you don't need a hash at all. Doesn't work for 6x6? Well 36 6-bit values fit neatly into 216 bits (27 bytes). 7x7 and 8x8 fit into 448 bits (56 bytes). 9x9 and 10x10 fit into 704 bits (88 bytes). Make the puzzle size a parameter to your programs: the needed field widths and the # of relevant bytes in the resultant "key" are trivial to calculate. And you still will not need hashing to remove duplicates. 8-) George -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[racket-users] Re: appending files
On Thu, 28 Jan 2016 07:56:05 -0800 (PST), Scotty Cwrote: >> You claim you want filtering to be as fast as possible. If that were >> so, you would not pack multiple keys (or features thereof) into a >> bignum but rather would store the keys individually. > >chasing pointers? no, you're thinking about doing some sort of >byte-append and subbytes type of thing. only way that data in a >hash would be small in memory and reasonably quick. care to >elaborate? I think you understand perfectly. You said the keys are 128-bit (16 byte) values. You can store one key directly in a byte string of length 16. However, each individual byte string has allocator overhead [though less than a bignum, see below]. So instead of using a vector of pointers to individual byte strings, you would allocate a single byte string of length {buckets x chain length x 16} and index it directly as if it were a 3-dimensional array [using offset calculations as you would in C]. There might be an easier approach using multi-dimensional arrays. However, the array library is Typed Racket and there are issues mixing "typed" code with "untyped" code. I haven't looked into TR extensively so I don't know offhand if these arrays are "packed type homogenous" or simply "type homogenous" [the difference would be memory use, see below]. Can you explain why the bignums are important. For a simple task like filtering a file, they would seem to be both a waste of memory and a performance drain wrt storing the key values as byte strings. >> The 16-byte keys are 1/3 the size of even the _smallest_ bignum, > >where are you getting this? i've been digging all over the documentation >and can't find a fn thing on how much space is required for any data type >or it's overhead. > >so 2 questions for you all >1)where is this info about data type memory requirements? I just know a lot of this from books and lots of experience. I've been programming for over 30 years, professionally for 23. I have an MS in Computer Science (and coursework for a 2nd MS but I left school before doing that dissertation). Among other things I have worked professionally on both compiler internals and GC. I know (quite) a bit about how to implement Lisp and Scheme, and Racket [at least the so-called "untyped" version] is a fairly typical tagged data system. You can figure out a lot of it by careful reading, in particular about the C API and library (FFI) interfaces. For things like bignums, you have to look into the code and discover that Racket uses the GMP library for multiple-precision math (and then look into GMP for the details). The size of heap object descriptors (i.e. allocator and type overhead) necessarily is version, platform and compiler dependent. You have to figure it out for the particular implementation. Some background on tagged data: At the most basic level, there are only 2 types: "immediate" values which are ALU register sized - 32 or 64 bit depending - and "reference" values which are heap allocated objects accessed via pointers. [The terms "immediate" and "reference" come originally from Lisp.] Immediate values include fixnums (small integers), characters and bytes, booleans, and pointers. Immediates are type tagged by stealing a few bits from the (register width) value, so, fixnum integers and pointers do not quite have their full range [e.g., in Racket fixnums are 31 or 63 bits depending on platform]. A pointer to a reference value is an immediate value. Reference values are essentially anything that is not an immediate value. Most importantly, floating point values are reference values because all of their bits are relevant to the FPU and it isn't possible to tag them by stealing bits as is done with immediates. In general, heap objects always contain some number of generic "slots" each of which is an immediate value. However there is an important exception: for certain types, it is possible to create homogenous "packed" vectors in which the elements are not generic "slots" but rather are values of a specified type. Racket permits creating packed vectors from bytes, characters, fixnum integers, floats and doubles. For characters and bytes the packed vector is called a "string". For the others, their packed variants are distinct vector types. The GMP library represents bignums as an information struct and a data array (of C longs). The array always contains at least 1 element, grows as needed but never shrinks. The struct contains a pointer and 2 longs. The size of pointers and longs are compiler dependent, so you need to know which compiler was used. On Linux Racket is compiled with GCC so pointers and longs are the same size: either 32-bits or 64-bits. The GMP documentation doesn't address allocation specifically ... it's treated as a black box ... but the info struct is defined and its members are laid out in a way that suggests the allocations might be separate (i.e. there is a
[racket-users] Re: appending files
what's been bothering me was trying to get the data into 16 bytes in a byte string of that length. i couldn't get that to work so gave up and just shoved the data into 25 bytes. here's a bit of code. i think it's faster than my bignum stuff. (define p (bytes 16 5 1 12 6 24 17 9 2 22 4 10 13 18 19 20 0 23 7 21 15 11 8 3 14)) (define (c1 p) ;move blank left (let( (x (bytes-ref p 15)) (c (bytes-copy p))) (bytes-set! c 15 0) (bytes-set! c 16 x) (bytes=? p c) c)) -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[racket-users] Re: appending files
On Thu, 28 Jan 2016 11:49:09 -0800 (PST), Scotty Cwrote: >what's been bothering me was trying to get the data into 16 bytes in >a byte string of that length. i couldn't get that to work so gave up and >just shoved the data into 25 bytes. here's a bit of code. i think it's >faster than my bignum stuff. > >(define p (bytes 16 5 1 12 6 24 17 9 2 22 4 10 13 18 19 20 0 23 7 >21 15 11 8 3 14)) >(define (c1 p) ;move blank left > (let( > (x (bytes-ref p 15)) > (c (bytes-copy p))) > (bytes-set! c 15 0) > (bytes-set! c 16 x) > (bytes=? p c) > c)) I don't doubt that it's faster. I'd bet that it will be faster even for larger sizes. It also has the advantage of being able to encode enormous 255x255 puzzles. However, memory space rears its ugly head again. The allocation for a 25 byte string is on par with my estimate of the minimum allocation for a bignum. Packing the bits tight likely may save you ~16 bytes per entry. I stress _may_ because the allocator details are unknown to me. Use (current-memory-use) to figure it out. George -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[racket-users] Re: appending files
On Wed, 27 Jan 2016 19:43:49 -0800 (PST), Scotty Cwrote: >> Then you're not using the hash in a conventional manner ... else the >> filter entries would be unique > >using it conventionally? absolutely. it is a hash with separate chaining. You snipped the part I was responding to, which was: >>> In the worst case of every key being a bignum >>no, every key is contained within a bignum which can contain many >>many keys. You claim you want filtering to be as fast as possible. If that were so, you would not pack multiple keys (or features thereof) into a bignum but rather would store the keys individually. The 16-byte keys are 1/3 the size of even the _smallest_ bignum, and comparing two small byte strings is faster than anything you can do with the bignum. With the right data structure can put a lot more keys into the same space you're using now and use them faster. George -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[racket-users] Re: appending files
> I think you understand perfectly. i'm coming around > You said the keys are 128-bit (16 byte) values. You can store one key > directly in a byte string of length 16. yup > So instead of using a vector of pointers to individual byte strings, > you would allocate a single byte string of length >{buckets x chain length x 16} > and index it directly as if it were a 3-dimensional array [using > offset calculations as you would in C]. i'm going to actually run my code later (probably in the morning). and then grab some stats from the hash. take a look at the stats. give me some thoughts on the above implementation after looking at the stats. > Can you explain why the bignums are important. For a simple task like > filtering a file, they would seem to be both a waste of memory and a > performance drain wrt storing the key values as byte strings. ok, true story. about 10 years ago i'm a student taking an intro to ai class we have been assigned the 3x3 tile puzzle and 2 algorithms, ida* and a*. also, try it on a 4x4. so i whip these out and am appalled by how slow the fn things are but intrigued by what they can do and i'm just a puzzle kind of guy. btw, the 4x4 wasn't solvable by my stuff at the time. so i head to the chief's office with my little bit of code. tell him it's way to slow. what do you think? he takes 5 seconds to peruse the item and says "you're probably making too much memory". side note, later i graded this class for that prof and the same project. not a single student, including the ms and phd types, did anything better than a vector of vectors of ints. so back to me, i'm thinking, how do i cram 4 bits together so that there is absolutely no wasted space. i start digging through the documentation and i find the bitwise stuff. i see arithmetic shift and having no idea whether it will work or not i type into the interface (arithmetic-shift 1 100). if we blow up, well we blow up big but we didn't. i flipped everything in my project to bignum at that point. the bignum stuff is massively faster than vector of vectors of ints and faster than just a single vector of ints. lots of bignum is easy to implement, like the hash. making a child state, not so much. i'm not married to bignums despite all this. > I've been programming for over 30 years, professionally for 23. i was a programmer. dot com bought us as a brick and mortar back in '99 and the whole shebang blew up 2 years later. idiots. anyway, been a stay at home dad for the most part since then. > have an MS in Computer Science me too. that's the other piece of the most part from up above. > (current-memory-use) yup, tried that a while back didn't like what i saw. check this out: > (current-memory-use) 581753864 > (current-memory-use) 586242568 > (current-memory-use) 591181736 > (current-memory-use) 595527064 does that work for you? -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [racket-users] Re: appending files
On Thu, 2016-01-28 at 20:32 -0800, Scotty C wrote: > > I think you understand perfectly. > i'm coming around > > > You said the keys are 128-bit (16 byte) values. You can store one > > key > > directly in a byte string of length 16. > yup > > > So instead of using a vector of pointers to individual byte > > strings, > > you would allocate a single byte string of length > > {buckets x chain length x 16} > > and index it directly as if it were a 3-dimensional array [using > > offset calculations as you would in C]. > i'm going to actually run my code later (probably in the morning). > and then grab some stats from the hash. take a look at the stats. > give me some thoughts on the above implementation after looking at > the stats. > > > Can you explain why the bignums are important. For a simple task > > like > > filtering a file, they would seem to be both a waste of memory and > > a > > performance drain wrt storing the key values as byte strings. > ok, true story. about 10 years ago i'm a student taking an intro to > ai class we have been assigned the 3x3 tile puzzle and 2 algorithms, > ida* and a*. also, try it on a 4x4. so i whip these out and am > appalled by how slow the fn things are but intrigued by what they can > do and i'm just a puzzle kind of guy. btw, the 4x4 wasn't solvable by > my stuff at the time. so i head to the chief's office with my little > bit of code. tell him it's way to slow. what do you think? he takes 5 > seconds to peruse the item and says "you're probably making too much > memory". side note, later i graded this class for that prof and the > same project. not a single student, including the ms and phd types, > did anything better than a vector of vectors of ints. so back to me, > i'm thinking, how do i cram 4 bits together so that there is > absolutely no wasted space. i start digging through the documentation > and i find the bitwise stuff. i see arithmetic shift and having no > idea whether it will work or not i type into the interface > (arithmetic-shift 1 100). if we blow up, well we blow up big but > we didn't. i flipped everything in my project to bignum at that > point. the bignum stuff is massively faster than vector of vectors of > ints and faster than just a single vector of ints. lots of bignum is > easy to implement, like the hash. making a child state, not so much. > i'm not married to bignums despite all this. > > > I've been programming for over 30 years, professionally for 23. > i was a programmer. dot com bought us as a brick and mortar back in > '99 and the whole shebang blew up 2 years later. idiots. anyway, been > a stay at home dad for the most part since then. > > > have an MS in Computer Science > me too. that's the other piece of the most part from up above. > > > (current-memory-use) > yup, tried that a while back didn't like what i saw. check this out: > > > (current-memory-use) > 581753864 > > (current-memory-use) > 586242568 > > (current-memory-use) > 591181736 > > (current-memory-use) > 595527064 > > does that work for you? > For whatever you're storing, I still suggest using a disk based structure (preferably using one that's already optimised and built for you). I've done a bit of work on cache aware algortihms, where reducing memory footprint is really big (along with the memory juggling). Yes, if you try to store something that takes only a few bits and store each one into an integer, you'll have waisted space. In theory, you could use a bignum, and only shift it as many bits as you need, which is what you have done. The issue with that is that bignum's have extra overhead thats neccessary for it to do arithmetic. Obviously, there needs to be a way to match or beat bignums with primitive structures, since bignums are implemented with primitive structures. So, if you want to beat bignum for storage, you'll want to use some contiguous memory with fixed sized elements (like byte strings, or arrays of uint32_t's in C) - but using bit manipulation on each byte, such that you have multiple stored values in each one, directly beside eachother bitwise, like a bignum has, but without it's overhead. Regards, Brandon Thomas -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[racket-users] Re: appending files
> Way back in this thread you implied that you had extremely large FILES > containing FIXED SIZE RECORDS, from which you needed > to FILTER DUPLICATE records based on the value of a FIXED SIZE KEY > field. this is mostly correct. the data is state and state associated data on the fringe. hence the name of the algorithm: fringe search. states (keys) are fixed size. associated data (due to the operator sequence) is variable size. i didn't post that here. i sent you (george) an email directly wed at 9:49 am according to my sent email box. getting this piece of the algorithm to go faster, less memory, both? awesome. my actual test file (just checked) is 633 mb. it is data from perhaps halfway through a search. the fringe for a 5x5 grows by about 9x each successive fringe. i say about 9x because as the fringes grow, the amount of redundancy will increase. when i hit the limits of my hardware and patience with this algorithm i was at 90% redundancy but that fringe file was huge. i still hadn't produced an answer for the problem and decided i needed to get the code to run in parallel. that was about 5 years ago. last spring i started reworking my old stuff to work with places and ran out of enthusiasm until about 2 weeks ago. > Doesn't work for 6x6? Well 36 6-bit values fit neatly into 216 bits > (27 bytes). the guy (korf) who did the paper on the 24 puzzle has attempted the 6x6 and failed. notice in the 24 puzzle paper that he was unable to solve one of those 10 sample problems. 5x5 is what i'm after. -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [racket-users] Re: appending files
On 1/27/2016 10:50 AM, Brandon Thomas wrote: On Wed, 2016-01-27 at 04:01 -0500, George Neuner wrote: > On Tue, 26 Jan 2016 23:00:01 -0500, Brandon Thomas >wrote: > > > Is there anything stopping you from restructuring > > the data on disk and using the hash directly from there > > Scotty's hash table is much larger than he thinks it is and very > likely is being paged to disk already. Deliberately implementing it > as a disk file is unlikely to improve anything. That's a good point to keep in mind. But there are advantages, including faster startup time, using less ram+swap, easier to keep the file updated and it makes it easier to make a resize solution. There are probably more, but basically it's the reasons why all large (key,value) storage solutions I've herd of use an explicit file instead of swap. I miscalculated the scale of Scotty's hash structure - it's not as bad as I thought initially. But even so, it is of a scale where it is unwieldy and bound to have virtual memory problems unless the machine is dedicated. Hashing is latency sensitive - it was designed to be a memory resident technique. Obviously it _can_ be done using file based buckets ... the effect is of querying an ISAM database for ever access. The problem is that the latency increases by orders of magnitude: even from resident blocks, every access involves the file system API and the kernel. You did mention (user space) caching, but that greatly complicates the solution. Making the hash external I think is not a win - it definitely will handle much bigger files, but it will handle every file more slowly. I think it is better to leverage the file system rather than fight it. YMMV, George -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [racket-users] Re: appending files
On 1/27/2016 9:49 AM, scott coston wrote: yes, me neither (11pm for me). i just rechecked my spreadsheet and i typed in 128 where i should have typed 256. but it doesn't really matter as the my test file is small compared to where the file will be going. Mine was worse - it's downright embarrassing. :-[When I was running the numbers somehow I put in 6e10 as the the scale - 60 billion instead of the 6 million you specified - and I didn't realize it until after I had sent the reply. Your analysis of the hash size is quite reasonable (modulo allocator overhead). I do think, though, that you can save considerable space with a byte array vs pointer array + bignums. anyway, since you seem to have some ideas i have more info for you. take a look at this: https://www.aaai.org/Papers/AAAI/1996/AAAI96-178.pdf I'll take a look at it. I haven't had to figure out an external merge filter for decades. It can be tricky when the number of files needs to be variable. And also as I mentioned, the fact that you want it sorted on the non-filtered field is a complication. From: racket-users@googlegroups.com <racket-users@googlegroups.com> on behalf of George Neuner <gneun...@comcast.net> Sent: Wednesday, January 27, 2016 4:28 AM To: racket-users@googlegroups.com Subject: Re: [racket-users] Re: appending files Sorry. I shouldn't do math at 4am. Ignore the numbers. However, it is still correct that the byte array will use less space than an array of bignums. George -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [racket-users] Re: appending files
On Wed, 2016-01-27 at 17:49 -0500, George Neuner wrote: > On 1/27/2016 10:50 AM, Brandon Thomas wrote: > > On Wed, 2016-01-27 at 04:01 -0500, George Neuner wrote: > > > On Tue, 26 Jan 2016 23:00:01 -0500, Brandon Thomas > > >wrote: > > > > > > > Is there anything stopping you from restructuring > > > > the data on disk and using the hash directly from there > > > > > > Scotty's hash table is much larger than he thinks it is and very > > > likely is being paged to disk already. Deliberately implementing > > > it > > > as a disk file is unlikely to improve anything. > > > > That's a good point to keep in mind. But there are advantages, > > including faster startup time, using less ram+swap, easier to keep > > the > > file updated and it makes it easier to make a resize solution. > > There > > are probably more, but basically it's the reasons why all large > > (key,value) storage solutions I've herd of use an explicit file > > instead > > of swap. > > I miscalculated the scale of Scotty's hash structure - it's not as > bad > as I thought initially. But even so, it is of a scale where it is > unwieldy and bound to have virtual memory problems unless the machine > is > dedicated. > > Hashing is latency sensitive - it was designed to be a memory > resident > technique. Obviously it _can_ be done using file based buckets ... > the > effect is of querying an ISAM database for ever access. The problem > is > that the latency increases by orders of magnitude: even from > resident > blocks, every access involves the file system API and the > kernel. You > did mention (user space) caching, but that greatly complicates the > solution. > Making the hash external I think is not a win - it definitely will > handle much bigger files, but it will handle every file more > slowly. I > think it is better to leverage the file system rather than fight it. > > YMMV, > George Yup, the canonical solutions to storing large amounts of data scalably are rather complicated and are work to implement (I wouldn't say unmanagable though). And they get much harder if you wanted, say, ACID properties. This is why it's been factored into libraries, like Redis for example. If it can fit into ram efficiently enough, that's ok. If he's planning to scale though, maybe it's worth looking at using Racket's ffi with a library like Redis or something. It might be the least work and the cleanest solution. Regards, Brandon Thomas -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[racket-users] Re: appending files
On Wed, 27 Jan 2016 11:17:04 -0800 (PST), Scotty Cwrote: >On Wednesday, January 27, 2016 at 2:57:42 AM UTC-6, gneuner2 wrote: > >> What is this other field on which the file is sorted? >this field is the cost in operators to arrive at the key value Is it important to retain that sorting? Or is it just informational? >> In the worst case of every key being a bignum >no, every key is contained within a bignum which can contain many many keys. Then you're not using the hash in a conventional manner ... else the filter entries would be unique ... and we really have no clue what you're actually doing. So any suggestions we give you are shots in the dark. >> Since you are only comparing the hash entries for equality, you could >> save a lot of space [at the expense of complexity] by defining a >> {bucket x chain_size x 16} byte array and storing the 16-byte keys >> directly. >i must be able to grow the chains. i can't make it fixed size like that. You said previously: >>> ... i use separate chaining with a vector of >>> bignums. i am willing to let my chains run up to 5 keys per chain >>> so i need a vector of 6 million pointers. That reads as needing a maximum of 6M x 5 entries. Exstensible structures don't have to be a list - a small array works very well when the number of entries is bounded [which you said it was]. Again this is a case of we really have no clue what you are doing. >> > have another rather large bignum in memory that i use to reduce >> >but not eliminate record duplication of about .5 gb. >> ??? > >ha, ok. this is what this bignum is for. cycle elimination. a sequence >of operators (2 bit per) when strung together is a number like 4126740 >which represents the operator sequence (0 0 3 0 0 3 1 1 2 2 1). i >change that bit in the bignum from 0 to 1. during data generation i >look up my about to be applied operator sequence in the bignum. if i >see a one, i skip data generation. i'm not really happy with the volume >of memory this takes but it is an insanely fast lookup and keeps a ton of >data off the hard drive. >> In the meantime, I would suggest you look up "merge sort" and it's >logarithmic? not happening By splitting the too-large data file and trying to process it in sections, you're already O(M*N) where M is the number of sections. Logarithmic might improve things. Rejecting an approach due solely to its BigO is naive. The order tells how it scales to the limits ... not how it behaves in any particular situation. In most cases there are constants and other linear multipliers that dominate the running time. It matter not just how many times you touch the data, but also the cost of each touch. E.g., people read that Quicksort is O(N*LogN) and ignore that it is O(N*N) under certain circumstances. Mergesort has a higher constant multiplier, but it is O(N*LogN) in _all_ cases ... including those that throw Quicksort into fits. Plus Quicksort can't be used effectively when the data set exceeds RAM. Mergesort and some others can be used regardless of the data set size. Also remember that it's the pattern of data access that's important - not what is actually being done with it. Even when data fits in memory, random access algorithms often are not the best. Prediction, caching and paging concerns often make the most effective algorithm require complex rearrangement and staging of data that is unnatural wrt the overall approach to the problem. Disk file systems are optimized for sequential access. When data is file based, sequential access is to be preferred unless there's no other choice. George -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [racket-users] Re: appending files
On Wed, 2016-01-27 at 04:01 -0500, George Neuner wrote: > On Tue, 26 Jan 2016 23:00:01 -0500, Brandon Thomas >wrote: > > > Is there anything stopping you from restructuring > > the data on disk and using the hash directly from there > > Scotty's hash table is much larger than he thinks it is and very > likely is being paged to disk already. Deliberately implementing it > as a disk file is unlikely to improve anything. > > George > That's a good point to keep in mind. But there are advantages, including faster startup time, using less ram+swap, easier to keep the file updated and it makes it easier to make a resize solution. There are probably more, but basically it's the reasons why all large (key,value) storage solutions I've herd of use an explicit file instead of swap. Regards, Bradon Thomas -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [racket-users] Re: appending files
On Tue, 2016-01-26 at 22:48 -0800, Scotty C wrote: > ok brandon, that's a thought. build the hash on the hard drive at the > time of data creation. you mention collision resolution. so let me > build my hash on the hard drive using my 6 million buckets but > increase the size of each bucket from 5 slots to 20. right? > i can't exactly recreate my vector/bignum hash on the hard drive > because i can't dynamically resize the buckets like i can the > bignums. this gives me a 4 gb file whereas my original was 1 gb. i > have enough space for that so that's not a problem. Sure, a bigger file works. Your bignums are probably implemented using linked lists, if I'm not mistaken. That's why they are O(1) resizable. You could do the same thing on hdd. The problem is that you might need to seek and read through the file more with linked lists. Since your probably paging, your implementation also has this same problem. You can get rid of this problem in both by using vectos though, as you suggest. And you can have a proccess in the background that resizes the file (by creating a new one) when your other one gets too full, so you don't block operations. > so as my buckets fill up they head towards the average of 5 data > items per bucket. so on average here's what happens with each hd hash > record. i go to my hd hash and read 3.5 (think about it) items and > 90% of the time i don't find my data so i do a write. in my process i > do an initial write, then a read, a write, a read, a write. compare: > 3.5 vs 2 reads; 1 vs 3 writes. the reads are more costly and if i > exceed 20 items in a bucket the hd hash breaks. what do you think? is > it worth it? > I think that implementation has room for improvement. Let's say you look for a record, it does not exist, and you need to write it. So, you seek to the location, read the bucket into ram (1 read). If the entry isn't in the bucket, you seek to the location in the bucket and write your entry (1 write). Of course to do this, you'd think you'd need to make a function for your hash that searches for an entry, and uses a fallback to write the record if it fails. That's a reasonable function though. But you can also introduce the idea of using a cache, and have a background proccess update the file (that way your hahs oeprations aren't waiting on the hdd). Though, your operating systems cache file cache might even be taking care of this for you, so you might want to do some benchmarking. There are probably several other optimizations you can make, and you can probably find them by going through the source/docs of other (key,value) storage solutions. Or if you wanted to cut down on work, you can just use one of the other (key,value) storage solutions and use it via libffi. Regards, Brandon Thomas -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[racket-users] Re: appending files
On Wednesday, January 27, 2016 at 2:57:42 AM UTC-6, gneuner2 wrote: > What is this other field on which the file is sorted? this field is the cost in operators to arrive at the key value > WRT a set of duplicates: are you throwing away all duplicates? Keeping > the 1st one encountered? Something else? keep first instance, chuck the rest > This structure uses a lot more space than necessary. Where did you > get the idea that a bignum is 10 bytes? not sure about the 10 bytes. if i shove 5 128 bit keys into a bignum is that about 80 bytes plus some overhead? so 80*600 is 480 mb not including overhead. > In the worst case of every key being a bignum no, every key is contained within a bignum which can contain many many keys. > Since you are only comparing the hash entries for equality, you could > save a lot of space [at the expense of complexity] by defining a > {bucket x chain_size x 16} byte array and storing the 16-byte keys > directly. i must be able to grow the chains. i can't make it fixed size like that. > > have another rather large bignum in memory that i use to reduce > >but not eliminate record duplication of about .5 gb. > ??? ha, ok. this is what this bignum is for. cycle elimination. a sequence of operators (2 bit per) when strung together is a number like 4126740 which represents the operator sequence (0 0 3 0 0 3 1 1 2 2 1). i change that bit in the bignum from 0 to 1. during data generation i look up my about to be applied operator sequence in the bignum. if i see a one, i skip data generation. i'm not really happy with the volume of memory this takes but it is an insanely fast lookup and keeps a ton of data off the hard drive. > In the meantime, I would suggest you look up "merge sort" and it's logarithmic? not happening -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[racket-users] Re: appending files
> Is it important to retain that sorting? Or is it just informational? it's important > Then you're not using the hash in a conventional manner ... else the > filter entries would be unique ... and we really have no clue what > you're actually doing. So any suggestions we give you are shots in > the dark. using it conventionally? absolutely. it is a hash with separate chaining. will a bit of code help? (define (put p) (define kh (hashfn p)) (vector-set! tble kh (bitwise-ior (arithmetic-shift (vector-ref tble kh) shftl) p))) -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[racket-users] Re: appending files
Hi Scotty, I rearranged your message a bit for (my own) clarity. On Tue, 26 Jan 2016 18:40:28 -0800 (PST), Scotty Cwrote: >running 64 bit linux mint OS on a 2 core laptop with 2 gb of ram. >the generated keys are random but i use one of the associated >fields for sorting during the initial write to the hard drive. what goes >in each of those files is totally random but dupes do not run across >files. also, the number of keys is >1e25. >my key is 128 bits with ~256 bits per record. so my 1 gb file contains >~63 million records and ~32 million keys. about 8% will be dupes >leaving me with ~30 million keys. You have 63 million 256-bit records in a 1GB file? The description of your "records" is vague and can be interpreted more than 1 way, but my math says: 256 bits (128 key + 128 data) = 32M records 384 bits (128 key + 256 data) = ~22M records Anyway ... I'll assume for filtering only the keys are important. If I understand correctly, you have a 128-bit "key space" and that you expect to see some small percentage of the keys duplicated. The file is sorted on some other non-key value. What is this other field on which the file is sorted? WRT a set of duplicates: are you throwing away all duplicates? Keeping the 1st one encountered? Something else? >i run a custom built hash. i use separate chaining with a vector of >bignums. i am willing to let my chains run up to 5 keys per chain >so i need a vector of 6 million pointers. that's 48 mb for the array. >another 480 mb for the bignums. let's round that sum to .5 gb. This structure uses a lot more space than necessary. Where did you get the idea that a bignum is 10 bytes? Unless something changed recently, Racket uses the GMP library for multiple precision math. GMP bignums require 2 allocations: a struct plus an array. Repesenting a full size 128-bit value on a 64-bit machine takes 40-48 bytes depending on the library's int size, and not including any allocator overhead. In the worst case of every key being a bignum, your hash is using 2.6 to 3.1 GB ... not 0.5 GB. Again, not including allocator overhead. Since you are only comparing the hash entries for equality, you could save a lot of space [at the expense of complexity] by defining a {bucket x chain_size x 16} byte array and storing the 16-byte keys directly. Even using an auxillary vector for tracking residency in the array, you'd be down under 1GB for the same 6 million hash entries. Even so, this seems like a bad approach to me. > have another rather large bignum in memory that i use to reduce >but not eliminate record duplication of about .5 gb. ??? >i'm attempting to get this thing to run in 2 places so i need 2 hashes. >add this up .5+.5+.5 is 1.5 gb and that gets me to about my memory >limit. I don't have time just now to sketch a split/merge solution ... I'll get back to it later. The file already being sorted on an unfiltered field is a complication. Offhand I don't think the sort order can be maintained through the filtering ... meaning the filtered file would need to be sorted again afterward. But even so it might be faster than your current method. In the meantime, I would suggest you look up "merge sort" and it's application to "external" sorting. That will give you the basic idea of what I'm proposing. George -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [racket-users] Re: appending files
Sorry. I shouldn't do math at 4am. Ignore the numbers. However, it is still correct that the byte array will use less space than an array of bignums. George On 1/27/2016 3:54 AM, George Neuner wrote: i run a custom built hash. i use separate chaining with a vector of bignums. i am willing to let my chains run up to 5 keys per chain so i need a vector of 6 million pointers. that's 48 mb for the array. another 480 mb for the bignums. let's round that sum to .5 gb. This structure uses a lot more space than necessary. Where did you get the idea that a bignum is 10 bytes? Unless something changed recently, Racket uses the GMP library for multiple precision math. GMP bignums require 2 allocations: a struct plus an array. Repesenting a full size 128-bit value on a 64-bit machine takes 40-48 bytes depending on the library's int size, and not including any allocator overhead. In the worst case of every key being a bignum, your hash is using 2.6 to 3.1 GB ... not 0.5 GB. Again, not including allocator overhead. Since you are only comparing the hash entries for equality, you could save a lot of space [at the expense of complexity] by defining a {bucket x chain_size x 16} byte array and storing the 16-byte keys directly. Even using an auxillary vector for tracking residency in the array, you'd be down under 1GB for the same 6 million hash entries. Even so, this seems like a bad approach to me. -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[racket-users] Re: appending files
On Tue, 26 Jan 2016 23:00:01 -0500, Brandon Thomaswrote: >Is there anything stopping you from restructuring >the data on disk and using the hash directly from there Scotty's hash table is much larger than he thinks it is and very likely is being paged to disk already. Deliberately implementing it as a disk file is unlikely to improve anything. George -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[racket-users] Re: appending files
ok brandon, that's a thought. build the hash on the hard drive at the time of data creation. you mention collision resolution. so let me build my hash on the hard drive using my 6 million buckets but increase the size of each bucket from 5 slots to 20. right? i can't exactly recreate my vector/bignum hash on the hard drive because i can't dynamically resize the buckets like i can the bignums. this gives me a 4 gb file whereas my original was 1 gb. i have enough space for that so that's not a problem. so as my buckets fill up they head towards the average of 5 data items per bucket. so on average here's what happens with each hd hash record. i go to my hd hash and read 3.5 (think about it) items and 90% of the time i don't find my data so i do a write. in my process i do an initial write, then a read, a write, a read, a write. compare: 3.5 vs 2 reads; 1 vs 3 writes. the reads are more costly and if i exceed 20 items in a bucket the hd hash breaks. what do you think? is it worth it? -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [racket-users] Re: appending files
On Tue, 2016-01-26 at 18:40 -0800, Scotty C wrote: > alright george, i'm open to new ideas. here's what i've got going. > running 64 bit linux mint OS on a 2 core laptop with 2 gb of ram. my > key is 128 bits with ~256 bits per record. so my 1 gb file contains > ~63 million records and ~32 million keys. about 8% will be dupes > leaving me with ~30 million keys. i run a custom built hash. i use > separate chaining with a vector of bignums. i am willing to let my > chains run up to 5 keys per chain so i need a vector of 6 million > pointers. that's 48 mb for the array. another 480 mb for the bignums. > let's round that sum to .5 gb. i have another rather large bignum in > memory that i use to reduce but not eliminate record duplication of > about .5 gb. i'm attempting to get this thing to run in 2 places so i > need 2 hashes. add this up .5+.5+.5 is 1.5 gb and that gets me to > about my memory limit. the generated keys are random but i use one of > the associated fields for sorting during the initial write to the > hard drive. what goes in each of those files is totally random but > dupes do not run across files. also, the number of keys is >1e25. > Sorry, I haven't read through the entire conversation, so I hope I'm not missing anything. Is there anything stopping you from restructuring the data on disk and using the hash directly from there (possibly with the help of a cache if speed is important)? For example, let's say each entry is 256 bits. Use something like "(file-position dbfile (* 32 (hash-custom key)))" to seek over to the appropriate entry on disk and read just the entry you need (using whatever colliosion resolution). Then you'll be using no auxillary memory (unless your caching, which can just be a smaller ram hash table). Unless of course I'm just missing something completly. Regards, Brandon Thomas -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[racket-users] Re: appending files
robby findler, you the man. i like the copy-port idea. i incorporated it and it is nice and fast and easily fit into the existing code. -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[racket-users] Re: appending files
neil van dyke, i have used the system function before but had forgotten what it was called and couldn't find it as a result in the documentation. my problem with using the system function is that i need 2 versions of it: windoz and linux. the copy-port function is a write once use across multiple os solution. sweet. -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[racket-users] Re: appending files
gneuner2 (george), you are over thinking this thing. my test data of 1 gb is but a small sample file. i can't even hash that small 1 gb at the time of data creation. the hashed data won't fit in ram. at the time i put the redundant data on the hard drive, i do some constant time sorting so that the redundant data on the hard drive is contained in roughly 200 usefully sorted files. some of these files will be small and can be hashed with a single read, hash and write. some will be massive (data won't fit in ram) and must be split further. this produces another another type of single read, hash and write. these split files can now be fully hashed which means a second read, hash and write. recombining the second level files is virtually instantaneous (copy-port) relative to the effort spent to get to that point. all of these operations are constant time. it would be nice to cut into that big fat hard drive induced C but i can't do it with a single read and write on the larger files. -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [racket-users] Re: appending files
On 1/26/2016 2:51 PM, Scotty C wrote: gneuner2 (george), you are over thinking this thing. my test data of 1 gb is but a small sample file. i can't even hash that small 1 gb at the time of data creation. the hashed data won't fit in ram. at the time i put the redundant data on the hard drive, i do some constant time sorting so that the redundant data on the hard drive is contained in roughly 200 usefully sorted files. some of these files will be small and can be hashed with a single read, hash and write. some will be massive (data won't fit in ram) and must be split further. this produces another another type of single read, hash and write. these split files can now be fully hashed which means a second read, hash and write. recombining the second level files is virtually instantaneous (copy-port) relative to the effort spent to get to that point. all of these operations are constant time. it would be nice to cut into that big fat hard drive induced C but i can't do it with a single read and write on the larger files. I reject the term "over thinking". You may have created something that works, but that doesn't mean it can't be improved. Remember that programs running in just 10's of kilobytes routinely processed data files containing 10's of megabytes. Modern programmers have forgotten - or never knew - how to work with very large data sets that don't necessarily fit into memory. For example: it matters greatly what is being hashed and how. With Racket hashes, using a string key retains the string for comparison - so hashing a file containing a large percentage of unique strings produces a very large hash ... in the worst case the hash may be bigger than the data file. If the strings are suitably long, then, e.g., crypto-digesting them down to ~20..32 byte values and hashing digests instead of strings might allow the hash to be memory resident or let you process much larger files using the same amount of memory. If the records are sorted on the field(s) that may be duplicated, then by splitting the file and using multi-file processing techniques duplicates can be filtered from all the sections without hashing and in a single pass through the whole of the data - leaving a single sorted file as output. If your input data is unordered, then as you split the file you sort each section individually with an in-memory sort. The re-merge step - which is where the filtering is performed - sorts the output. This is very efficient and it works with stupendously huge data file because the merge step can (more or less) simultaneously process as many files (split sections) as data _records_ will fit into memory. And there are times when you should give up and embrace a DBMS, if only for it's proficiency at disk based data handling. YMMV, George -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [racket-users] Re: appending files
+1 on George Neuner's comments about how one can do smart processing of huge files in small space. (I almost said something about that myself, but didn't have time to get into that kind of discussion, so I stuck to only the simpler file concatenation question.) BTW, students who have 8GB RAM and 256GB of SSD in their cheap laptop, and who are looking for relevance of learning how to compute with resource constraints, should note the current buzzwords "big data", "Internet of Things", etc. (Most any 12 year-old can learn to code a phone/tablet app, throw up an AJAX-y Web site, or make 3D game assets/mods. One of the reasons that professional software engineers have to keep learning is so that we can do things that 12yos can't.) Neil V. -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[racket-users] Re: appending files
alright george, i'm open to new ideas. here's what i've got going. running 64 bit linux mint OS on a 2 core laptop with 2 gb of ram. my key is 128 bits with ~256 bits per record. so my 1 gb file contains ~63 million records and ~32 million keys. about 8% will be dupes leaving me with ~30 million keys. i run a custom built hash. i use separate chaining with a vector of bignums. i am willing to let my chains run up to 5 keys per chain so i need a vector of 6 million pointers. that's 48 mb for the array. another 480 mb for the bignums. let's round that sum to .5 gb. i have another rather large bignum in memory that i use to reduce but not eliminate record duplication of about .5 gb. i'm attempting to get this thing to run in 2 places so i need 2 hashes. add this up .5+.5+.5 is 1.5 gb and that gets me to about my memory limit. the generated keys are random but i use one of the associated fields for sorting during the initial write to the hard drive. what goes in each of those files is totally random but dupes do not run across files. also, the number of keys is >1e25. -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.