Re: [Jprogramming] variable integer coding (challenge)

Joe Bogner Wed, 31 Dec 2014 12:15:40 -0800

Thanks. I couldn't follow the examples so I worked through it on my
own. Sharing it here for anyone else who's following. I switched to 9
bits instead of 19 to simplify the example


   NB. 300 needs 9 bits to be expressed
   $ #: 300
9

   NB. Show 4 9 bit numbers
   $ #: 300 301 302 303
4 9

Writing those 4 numbers would take 16 bytes[1] since J ints are 4
bytes as well as C

We can compress it down to 5 bytes if we pack all the bits together:

   ] packed=. #. _8[\ , #: 300 301 302 303
150 75 101 210 240

   $ packed
5

Those 5 bytes could be written to a file and re-read

If I knew in advance that those numbers were 9 bits, it can be reversed

   #.  _9[\, #: packed
300 301 302 303 0

We could also pack it into 8 bytes (two 32 bit integers), but I don't
see an advantage of that over packing into a single byte (8 bit)

   ] packed=. #. _32[\ , #: 300 301 302 303
2.52152e9 4.02653e9

   #.  _9[\, #: packed
300 301 302 303 0 0 0 0


This may be elementary for some but I haven't worked at this level in
years so I enjoyed playing with it




[1] - http://www.jsoftware.com/help/learning/27.htm

On Wed, Dec 31, 2014 at 11:51 AM, 'Pascal Jasmin' via Programming
<[email protected]> wrote:
> Also, if you have only 4 19 bit values, you can pack them up tighter
>
> _2 (16&#.)\ #. |: ?. 4 19 $ 2
> 109 59 161 30 136 80 64 129 56 1
>
> Non multiples of 8 would just generally leave some trailing 0s on decode, 
> which could simply be understood as invalid.
>
>
> ----- Original Message -----
> From: 'Pascal Jasmin' via Programming <[email protected]>
> To: "[email protected]" <[email protected]>
> Cc:
> Sent: Wednesday, December 31, 2014 11:40 AM
> Subject: Re: [Jprogramming] variable integer coding (challenge)
>
> base64 is essentially taking 3 bytes and storing them as 4.  The inverse can 
> take 4 6bit values and pack them into 3 bytes.  The J code for base64 is 
> pretty generic to see how to do it for any base.  7 bit values could be 
> packed 8 at a time in 7 bytes.
>
> There is a different way to pack 8 x bit values into x bytes.  Each byte 
> would store the i'th bit of the 8 values.  This may be even faster than 
> base64 type coding.
>
>   #. |: ?. 8 19 $ 2
> 104 214 59 183 174 23 31 225 134 149 66 31 85 19 133 1 51 148 28
>
>
> ----- Original Message -----
> From: Joe Bogner <[email protected]>
> To: [email protected]
> Cc:
> Sent: Wednesday, December 31, 2014 10:02 AM
> Subject: Re: [Jprogramming] variable integer coding (challenge)
>
> Can 19 bits be stored in memory or a file on modern computers without
> taking up 3 bytes (24 bits)? The only idea that came to mind would some bit
> packing scheme with multiple items. That seems complicated unless there was
> a need to store billions of these numbers
>
> On Dec 31, 2014 9:28 AM, "Joe Bogner" <[email protected]> wrote:
>
>> Thanks. I missed the point of the variable breakup point but exploring
>> different options below illustrated it for me.
>>
>> Changing the breakup point can encode the numbers into different bit
>> lengths (other than 8) which can then be written to the file/memory.
>>
>> For example, 1021 encoded in 511 breakups would take 18 bits
>>
>>    $ #: 511 toVint 1021
>> 2 9
>>
>> Whereas encoding it as 255 would require 40 bits, which is more than a
>> 32-bit int. This is where I was failing to see the point of encoding
>> larger numbers
>>
>> NB. 5 bytes, 40 bits
>>     $ #: 255 toVint 1021
>> 5 8
>>
>> Without trying the binary representation, I was expecting the breakup
>> point to take the same space. For example, below I would have thought
>> that I'd need to write a
>>
>> NB. 6 bytes
>>    255 toVint 511 toVint 1021
>> 255 255 1 255 255 0
>>
>>
>> If I understand this correctly, I can see how variable ints could
>> shave off a few bits if the range of numbers is known in advance,
>> which is often the case.
>>
>>
>> Regarding packed strings I've been thinking about it recently as well.
>> I would like to see how it could be used to pack a table of strings
>> (rows and columns).
>>
>> The following string could be a packed representation of a 2x2 table
>>
>> FrankUSATomGermany
>>
>> The binary format could be:
>>
>> [# of rows] [# of columns] [row1-col1 length] [row1-col2 length]....
>>
>> The header for the string could be
>> 2 2 5 3 3 7
>>
>> Or maybe if you wanted to support a variable number of columns (unsure
>> of the usecase of this)
>>
>> [# of rows] [row-1 # of columns] [row1-col1 length] [row1-col2 length]
>> 2 2 5 3 2 3 7
>>
>> I don't know well it'd perform relative to boxed tables or inverted
>> tables. I would imagine the memory would be reduced for storage but
>> I'd have to test to see how much memory would be used on different
>> operations. I'd be interested in the performance of operations like
>> searching (e.g. find all the rows with a country of Germany)
>>
>> It might also be a safer way of storing user-input text without
>> worrying about escaping delimiters.
>>
>> On Tue, Dec 30, 2014 at 1:18 PM, 'Pascal Jasmin' via Programming
>> <[email protected]> wrote:
>> > I also provided some adverbs
>> >
>> > toVint =: 1 : '[: (({."1 ,. 1:) #&, m ,. {:"1) (0, m) #: ]'
>> > fromVint=: 1 : 'm&~: +/;.2 ]'
>> >
>> > these let you set the breakup point to what you wish
>> >
>> >   15 toVint 33 12 17
>> > 15 15 3 12 15 2
>> >   511 toVint 1233
>> > 511 511 211
>> >
>> >   511 fromVint 511 toVint 1233 444
>> > 1233 444
>> > It can provide simple compression.  If you have a list of many strings
>> where almost all of them will fit into a 6bit alphabet, you can still
>> encode a full 7 bit or 8bit alphabet.  On average, you'd save space using
>> 6bit alphabet, depending on your context.
>> >
>> > The variable string functions I showed, still let you do a substring
>> search on the packed data.  Part of it is disk and memory compression
>> (don't expand until you need to, and a string takes much less memory than a
>> list of numbers < 256), but part of it is not being forced in design to
>> determine what the impossible cutoff number is.
>> >
>> > A big advantage related to J is having binary data records.  If you
>> encode the length of the binary data, you no longer have to box them if a
>> trailing 0 or space would be difficult to know whether it is a fill or
>> intentional.
>> >
>> > Beyond binary data, there is dumb stuff such as storing addresses.  Most
>> people have just 1,  Most fit within 255 bytes.  Usually no one cares about
>> address info except when they do or already have the client record.  Do you
>> really want a separate table with 1 to many join with a possible 10000
>> character fixed size per address just because you're unsure if it will ever
>> be possible for you to store an address to an African village as a set of
>> driving instructions and then a description of the house relative to 3
>> other houses and landmarks just in case one of the other houses gets
>> painted.
>> >
>> > Another neat feature about the coding method is that if you have a lot
>> of inefficient codes, ie. storing 1020 as 255 255 255 255 0, it at least
>> compresses well.
>> >
>> > A minor thing is that the encoding still lets you do search by < or >.
>> If you are storing the number of orders per customer.  The first byte tells
>> you whether it is > 254 or less than any other number below 255 (probably
>> enough info to select customers of interest).  Another minor thing is that
>> you don't have to use the encoding codes until there is an overflow.  It
>> may be easier to just switch functions to decode variable data, then to
>> reshape terrabytes of data.
>> > ________________________________
>> > From: Joe Bogner <[email protected]>
>> > To: [email protected]
>> > Sent: Tuesday, December 30, 2014 10:45 AM
>> > Subject: Re: [Jprogramming] variable integer coding (challenge)
>> >
>> >
>> > Pascal,
>> >
>> > Thanks for sharing. Can you also share a potential use case?
>> >
>> > I had to remind myself about the size of integers (4 bytes) on 32-bit.
>> >
>> > Will this only save space on numbers less than 1019?
>> >
>> > NB. 4 bytes
>> > toVint2 1019
>> > 255 255 255 254
>> >
>> > NB. 5 bytes
>> > toVint2 1020
>> > 255 255 255 255 0
>> >
>> > Are you looking for the savings in terms of disk space (saving it to a
>> > file) or working memory?
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Mon, Dec 29, 2014 at 12:12 PM, 'Pascal Jasmin' via Programming
>> > <[email protected]> wrote:
>> >> A common design decision is how large to make a field.  For many
>> numbers, we think that a byte will be large enough to hold practical
>> values, but one day we will find out that it is not.  This is essentially
>> the Y2k problem, or year 2032 issue.
>> >>
>> >> A simple variable integer coding (tied to byte ranges)
>> >>
>> >> toVint =: [: ; (255&| ,~ 255 #~ <.@%&255) each
>> >>
>> >>   toVint 566 44
>> >> 255 255 56 44
>> >>
>> >> It now takes 3 bytes to store 566, and one byte to store 44.
>> >>
>> >> the challenge is to write fromVint such that
>> >>
>> >>
>> >> 566 44 -: fromVint 255 255 56 44
>> >>
>> >> as an extra challenge, is there a way to write toVint or fromVint such
>> that it is faster than using intermediate boxing?
>> >> ----------------------------------------------------------------------
>> >> For information about J forums see http://www.jsoftware.com/forums.htm
>
>> > ----------------------------------------------------------------------
>> > For information about J forums see http://www.jsoftware.com/forums.htm
>> > ----------------------------------------------------------------------
>> > For information about J forums see http://www.jsoftware.com/forums.htm
>>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] variable integer coding (challenge)

Reply via email to