Re: [Jprogramming] variable integer coding (challenge)

Raul Miller Wed, 31 Dec 2014 07:44:03 -0800

The answer to that is contextual. But on intel architectures, it's
often going to be: no.


Thanks,

-- 
Raul

On Wed, Dec 31, 2014 at 10:02 AM, Joe Bogner <[email protected]> wrote:
> Can 19 bits be stored in memory or a file on modern computers without
> taking up 3 bytes (24 bits)? The only idea that came to mind would some bit
> packing scheme with multiple items. That seems complicated unless there was
> a need to store billions of these numbers
> On Dec 31, 2014 9:28 AM, "Joe Bogner" <[email protected]> wrote:
>
>> Thanks. I missed the point of the variable breakup point but exploring
>> different options below illustrated it for me.
>>
>> Changing the breakup point can encode the numbers into different bit
>> lengths (other than 8) which can then be written to the file/memory.
>>
>> For example, 1021 encoded in 511 breakups would take 18 bits
>>
>>    $ #: 511 toVint 1021
>> 2 9
>>
>> Whereas encoding it as 255 would require 40 bits, which is more than a
>> 32-bit int. This is where I was failing to see the point of encoding
>> larger numbers
>>
>> NB. 5 bytes, 40 bits
>>     $ #: 255 toVint 1021
>> 5 8
>>
>> Without trying the binary representation, I was expecting the breakup
>> point to take the same space. For example, below I would have thought
>> that I'd need to write a
>>
>> NB. 6 bytes
>>    255 toVint 511 toVint 1021
>> 255 255 1 255 255 0
>>
>>
>> If I understand this correctly, I can see how variable ints could
>> shave off a few bits if the range of numbers is known in advance,
>> which is often the case.
>>
>>
>> Regarding packed strings I've been thinking about it recently as well.
>> I would like to see how it could be used to pack a table of strings
>> (rows and columns).
>>
>> The following string could be a packed representation of a 2x2 table
>>
>> FrankUSATomGermany
>>
>> The binary format could be:
>>
>> [# of rows] [# of columns] [row1-col1 length] [row1-col2 length]....
>>
>> The header for the string could be
>> 2 2 5 3 3 7
>>
>> Or maybe if you wanted to support a variable number of columns (unsure
>> of the usecase of this)
>>
>> [# of rows] [row-1 # of columns] [row1-col1 length] [row1-col2 length]
>> 2 2 5 3 2 3 7
>>
>> I don't know well it'd perform relative to boxed tables or inverted
>> tables. I would imagine the memory would be reduced for storage but
>> I'd have to test to see how much memory would be used on different
>> operations. I'd be interested in the performance of operations like
>> searching (e.g. find all the rows with a country of Germany)
>>
>> It might also be a safer way of storing user-input text without
>> worrying about escaping delimiters.
>>
>> On Tue, Dec 30, 2014 at 1:18 PM, 'Pascal Jasmin' via Programming
>> <[email protected]> wrote:
>> > I also provided some adverbs
>> >
>> > toVint =: 1 : '[: (({."1 ,. 1:) #&, m ,. {:"1) (0, m) #: ]'
>> > fromVint=: 1 : 'm&~: +/;.2 ]'
>> >
>> > these let you set the breakup point to what you wish
>> >
>> >   15 toVint 33 12 17
>> > 15 15 3 12 15 2
>> >   511 toVint 1233
>> > 511 511 211
>> >
>> >   511 fromVint 511 toVint 1233 444
>> > 1233 444
>> > It can provide simple compression.  If you have a list of many strings
>> where almost all of them will fit into a 6bit alphabet, you can still
>> encode a full 7 bit or 8bit alphabet.  On average, you'd save space using
>> 6bit alphabet, depending on your context.
>> >
>> > The variable string functions I showed, still let you do a substring
>> search on the packed data.  Part of it is disk and memory compression
>> (don't expand until you need to, and a string takes much less memory than a
>> list of numbers < 256), but part of it is not being forced in design to
>> determine what the impossible cutoff number is.
>> >
>> > A big advantage related to J is having binary data records.  If you
>> encode the length of the binary data, you no longer have to box them if a
>> trailing 0 or space would be difficult to know whether it is a fill or
>> intentional.
>> >
>> > Beyond binary data, there is dumb stuff such as storing addresses.  Most
>> people have just 1,  Most fit within 255 bytes.  Usually no one cares about
>> address info except when they do or already have the client record.  Do you
>> really want a separate table with 1 to many join with a possible 10000
>> character fixed size per address just because you're unsure if it will ever
>> be possible for you to store an address to an African village as a set of
>> driving instructions and then a description of the house relative to 3
>> other houses and landmarks just in case one of the other houses gets
>> painted.
>> >
>> > Another neat feature about the coding method is that if you have a lot
>> of inefficient codes, ie. storing 1020 as 255 255 255 255 0, it at least
>> compresses well.
>> >
>> > A minor thing is that the encoding still lets you do search by < or >.
>> If you are storing the number of orders per customer.  The first byte tells
>> you whether it is > 254 or less than any other number below 255 (probably
>> enough info to select customers of interest).  Another minor thing is that
>> you don't have to use the encoding codes until there is an overflow.  It
>> may be easier to just switch functions to decode variable data, then to
>> reshape terrabytes of data.
>> > ________________________________
>> > From: Joe Bogner <[email protected]>
>> > To: [email protected]
>> > Sent: Tuesday, December 30, 2014 10:45 AM
>> > Subject: Re: [Jprogramming] variable integer coding (challenge)
>> >
>> >
>> > Pascal,
>> >
>> > Thanks for sharing. Can you also share a potential use case?
>> >
>> > I had to remind myself about the size of integers (4 bytes) on 32-bit.
>> >
>> > Will this only save space on numbers less than 1019?
>> >
>> > NB. 4 bytes
>> > toVint2 1019
>> > 255 255 255 254
>> >
>> > NB. 5 bytes
>> > toVint2 1020
>> > 255 255 255 255 0
>> >
>> > Are you looking for the savings in terms of disk space (saving it to a
>> > file) or working memory?
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Mon, Dec 29, 2014 at 12:12 PM, 'Pascal Jasmin' via Programming
>> > <[email protected]> wrote:
>> >> A common design decision is how large to make a field.  For many
>> numbers, we think that a byte will be large enough to hold practical
>> values, but one day we will find out that it is not.  This is essentially
>> the Y2k problem, or year 2032 issue.
>> >>
>> >> A simple variable integer coding (tied to byte ranges)
>> >>
>> >> toVint =: [: ; (255&| ,~ 255 #~ <.@%&255) each
>> >>
>> >>   toVint 566 44
>> >> 255 255 56 44
>> >>
>> >> It now takes 3 bytes to store 566, and one byte to store 44.
>> >>
>> >> the challenge is to write fromVint such that
>> >>
>> >>
>> >> 566 44 -: fromVint 255 255 56 44
>> >>
>> >> as an extra challenge, is there a way to write toVint or fromVint such
>> that it is faster than using intermediate boxing?
>> >> ----------------------------------------------------------------------
>> >> For information about J forums see http://www.jsoftware.com/forums.htm
>> > ----------------------------------------------------------------------
>> > For information about J forums see http://www.jsoftware.com/forums.htm
>> > ----------------------------------------------------------------------
>> > For information about J forums see http://www.jsoftware.com/forums.htm
>>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] variable integer coding (challenge)

Reply via email to