Thanks. I couldn't follow the examples so I worked through it on my own. Sharing it here for anyone else who's following. I switched to 9 bits instead of 19 to simplify the example
NB. 300 needs 9 bits to be expressed $ #: 300 9 NB. Show 4 9 bit numbers $ #: 300 301 302 303 4 9 Writing those 4 numbers would take 16 bytes[1] since J ints are 4 bytes as well as C We can compress it down to 5 bytes if we pack all the bits together: ] packed=. #. _8[\ , #: 300 301 302 303 150 75 101 210 240 $ packed 5 Those 5 bytes could be written to a file and re-read If I knew in advance that those numbers were 9 bits, it can be reversed #. _9[\, #: packed 300 301 302 303 0 We could also pack it into 8 bytes (two 32 bit integers), but I don't see an advantage of that over packing into a single byte (8 bit) ] packed=. #. _32[\ , #: 300 301 302 303 2.52152e9 4.02653e9 #. _9[\, #: packed 300 301 302 303 0 0 0 0 This may be elementary for some but I haven't worked at this level in years so I enjoyed playing with it [1] - http://www.jsoftware.com/help/learning/27.htm On Wed, Dec 31, 2014 at 11:51 AM, 'Pascal Jasmin' via Programming <[email protected]> wrote: > Also, if you have only 4 19 bit values, you can pack them up tighter > > _2 (16&#.)\ #. |: ?. 4 19 $ 2 > 109 59 161 30 136 80 64 129 56 1 > > Non multiples of 8 would just generally leave some trailing 0s on decode, > which could simply be understood as invalid. > > > ----- Original Message ----- > From: 'Pascal Jasmin' via Programming <[email protected]> > To: "[email protected]" <[email protected]> > Cc: > Sent: Wednesday, December 31, 2014 11:40 AM > Subject: Re: [Jprogramming] variable integer coding (challenge) > > base64 is essentially taking 3 bytes and storing them as 4. The inverse can > take 4 6bit values and pack them into 3 bytes. The J code for base64 is > pretty generic to see how to do it for any base. 7 bit values could be > packed 8 at a time in 7 bytes. > > There is a different way to pack 8 x bit values into x bytes. Each byte > would store the i'th bit of the 8 values. This may be even faster than > base64 type coding. > > #. |: ?. 8 19 $ 2 > 104 214 59 183 174 23 31 225 134 149 66 31 85 19 133 1 51 148 28 > > > ----- Original Message ----- > From: Joe Bogner <[email protected]> > To: [email protected] > Cc: > Sent: Wednesday, December 31, 2014 10:02 AM > Subject: Re: [Jprogramming] variable integer coding (challenge) > > Can 19 bits be stored in memory or a file on modern computers without > taking up 3 bytes (24 bits)? The only idea that came to mind would some bit > packing scheme with multiple items. That seems complicated unless there was > a need to store billions of these numbers > > On Dec 31, 2014 9:28 AM, "Joe Bogner" <[email protected]> wrote: > >> Thanks. I missed the point of the variable breakup point but exploring >> different options below illustrated it for me. >> >> Changing the breakup point can encode the numbers into different bit >> lengths (other than 8) which can then be written to the file/memory. >> >> For example, 1021 encoded in 511 breakups would take 18 bits >> >> $ #: 511 toVint 1021 >> 2 9 >> >> Whereas encoding it as 255 would require 40 bits, which is more than a >> 32-bit int. This is where I was failing to see the point of encoding >> larger numbers >> >> NB. 5 bytes, 40 bits >> $ #: 255 toVint 1021 >> 5 8 >> >> Without trying the binary representation, I was expecting the breakup >> point to take the same space. For example, below I would have thought >> that I'd need to write a >> >> NB. 6 bytes >> 255 toVint 511 toVint 1021 >> 255 255 1 255 255 0 >> >> >> If I understand this correctly, I can see how variable ints could >> shave off a few bits if the range of numbers is known in advance, >> which is often the case. >> >> >> Regarding packed strings I've been thinking about it recently as well. >> I would like to see how it could be used to pack a table of strings >> (rows and columns). >> >> The following string could be a packed representation of a 2x2 table >> >> FrankUSATomGermany >> >> The binary format could be: >> >> [# of rows] [# of columns] [row1-col1 length] [row1-col2 length].... >> >> The header for the string could be >> 2 2 5 3 3 7 >> >> Or maybe if you wanted to support a variable number of columns (unsure >> of the usecase of this) >> >> [# of rows] [row-1 # of columns] [row1-col1 length] [row1-col2 length] >> 2 2 5 3 2 3 7 >> >> I don't know well it'd perform relative to boxed tables or inverted >> tables. I would imagine the memory would be reduced for storage but >> I'd have to test to see how much memory would be used on different >> operations. I'd be interested in the performance of operations like >> searching (e.g. find all the rows with a country of Germany) >> >> It might also be a safer way of storing user-input text without >> worrying about escaping delimiters. >> >> On Tue, Dec 30, 2014 at 1:18 PM, 'Pascal Jasmin' via Programming >> <[email protected]> wrote: >> > I also provided some adverbs >> > >> > toVint =: 1 : '[: (({."1 ,. 1:) #&, m ,. {:"1) (0, m) #: ]' >> > fromVint=: 1 : 'm&~: +/;.2 ]' >> > >> > these let you set the breakup point to what you wish >> > >> > 15 toVint 33 12 17 >> > 15 15 3 12 15 2 >> > 511 toVint 1233 >> > 511 511 211 >> > >> > 511 fromVint 511 toVint 1233 444 >> > 1233 444 >> > It can provide simple compression. If you have a list of many strings >> where almost all of them will fit into a 6bit alphabet, you can still >> encode a full 7 bit or 8bit alphabet. On average, you'd save space using >> 6bit alphabet, depending on your context. >> > >> > The variable string functions I showed, still let you do a substring >> search on the packed data. Part of it is disk and memory compression >> (don't expand until you need to, and a string takes much less memory than a >> list of numbers < 256), but part of it is not being forced in design to >> determine what the impossible cutoff number is. >> > >> > A big advantage related to J is having binary data records. If you >> encode the length of the binary data, you no longer have to box them if a >> trailing 0 or space would be difficult to know whether it is a fill or >> intentional. >> > >> > Beyond binary data, there is dumb stuff such as storing addresses. Most >> people have just 1, Most fit within 255 bytes. Usually no one cares about >> address info except when they do or already have the client record. Do you >> really want a separate table with 1 to many join with a possible 10000 >> character fixed size per address just because you're unsure if it will ever >> be possible for you to store an address to an African village as a set of >> driving instructions and then a description of the house relative to 3 >> other houses and landmarks just in case one of the other houses gets >> painted. >> > >> > Another neat feature about the coding method is that if you have a lot >> of inefficient codes, ie. storing 1020 as 255 255 255 255 0, it at least >> compresses well. >> > >> > A minor thing is that the encoding still lets you do search by < or >. >> If you are storing the number of orders per customer. The first byte tells >> you whether it is > 254 or less than any other number below 255 (probably >> enough info to select customers of interest). Another minor thing is that >> you don't have to use the encoding codes until there is an overflow. It >> may be easier to just switch functions to decode variable data, then to >> reshape terrabytes of data. >> > ________________________________ >> > From: Joe Bogner <[email protected]> >> > To: [email protected] >> > Sent: Tuesday, December 30, 2014 10:45 AM >> > Subject: Re: [Jprogramming] variable integer coding (challenge) >> > >> > >> > Pascal, >> > >> > Thanks for sharing. Can you also share a potential use case? >> > >> > I had to remind myself about the size of integers (4 bytes) on 32-bit. >> > >> > Will this only save space on numbers less than 1019? >> > >> > NB. 4 bytes >> > toVint2 1019 >> > 255 255 255 254 >> > >> > NB. 5 bytes >> > toVint2 1020 >> > 255 255 255 255 0 >> > >> > Are you looking for the savings in terms of disk space (saving it to a >> > file) or working memory? >> > >> > >> > >> > >> > >> > >> > On Mon, Dec 29, 2014 at 12:12 PM, 'Pascal Jasmin' via Programming >> > <[email protected]> wrote: >> >> A common design decision is how large to make a field. For many >> numbers, we think that a byte will be large enough to hold practical >> values, but one day we will find out that it is not. This is essentially >> the Y2k problem, or year 2032 issue. >> >> >> >> A simple variable integer coding (tied to byte ranges) >> >> >> >> toVint =: [: ; (255&| ,~ 255 #~ <.@%&255) each >> >> >> >> toVint 566 44 >> >> 255 255 56 44 >> >> >> >> It now takes 3 bytes to store 566, and one byte to store 44. >> >> >> >> the challenge is to write fromVint such that >> >> >> >> >> >> 566 44 -: fromVint 255 255 56 44 >> >> >> >> as an extra challenge, is there a way to write toVint or fromVint such >> that it is faster than using intermediate boxing? >> >> ---------------------------------------------------------------------- >> >> For information about J forums see http://www.jsoftware.com/forums.htm > >> > ---------------------------------------------------------------------- >> > For information about J forums see http://www.jsoftware.com/forums.htm >> > ---------------------------------------------------------------------- >> > For information about J forums see http://www.jsoftware.com/forums.htm >> > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
