Re: [Pharo-users] Bitwise operations in ByteArray (slicing, etc.)

Richard O'Keefe Tue, 06 Mar 2018 05:47:19 -0800

Re "Strings are wrong", the influences on my thinking went like this.


(1) As an undergraduate, I did exercises with the Euler compiler
    (written in Burroughs Algol) and the XPL compiler (written in XPL).
    XPL had immutable strings, rather like AWK or Java, and a PL/I-ish
    set of operations on them.  Burroughs Algol didn't have strings,
    but did have pointers, that you could use to stream through arrays.
    It turned out to be notably easier to do tokenising in Algol than
    in XPL.

(2) A friend of mine, while a masters student, made some pocket money
    by writing a "batch editor" (same application domain as sed, but
    different command set) for IBM mainframes under MVS, which he did
    using PL/I.  PL/I fans used to point to its string operations as
    a good collection, but he had to fight PL/I every step of the way
    to get the string operations he needed.  When I met UNIX V7 on a
    PDP-11, it took me 15 pages of C to accomplish what he needed
    more than 50 pages of PL/I to do.  And my editor had undo...
    The trick was being able to make just the right text abstractions
    that I needed.

(3) A company I worked at had a program that generated assembly code
    for several different machines from a common specification.  The
    original version manipulated instructions as *trees*.  A new hire
    was asked to write a new generator because the old one was
    getting harder to maintain and we wanted it faster.  His program
    represented instructions as *strings*.  It turned out to be
    slower and harder to maintain than the original.

(4) I encountered Eugene Myers' AVL DAG data structure (which quite
    efficiently lets you maintain many versions of a large text) and
    the Rope data structure (basically a lazy concatenation tree)
    that accompanies the Boehm garbage collector.  Both of them get
    a lot of efficiency out of NOT doing concatenations.

(5) Learning Java, I found that Java had taken a bullet to the head
    which Smalltalk had dodged, and which had been very clearly
    explained in the coloured books.  Consider the selectors
    #printOn: and #printString.  There are two ways to implement
    them: the smart (Smalltalk) and the spectacularly stupid (Java).
    Smart way:  #printOn: is basic, #printString is derived.
    Stupid way: #printString is basic and #printOn: is derived.
    Why is this smart?  If you are generating several MB of text
    *in order to write it to an external sink*, then #printOn:
    does the job very efficiently.  But #printString not only
    turns over large amounts of memory, giving the GC more work
    than it really needs, it is fatally easy to fall into
    quadratic time behaviour.

(5) There's the insufficiently famous epigram by Alan Perlis:
    "The string is a stark data structure,
     and everywhere it is passed there is much duplication of process.
     It is a perfect vehicle for hiding information."
    One way to understand this is that it's quite easy to put
    information *into* a string, but it can be quite hard to get
    it out again.  I have even run into examples where the
    getting-it-out-reliably part was accidentally impossible.
    What the heck, let's take Dates as an example (not of impossibility).
    What does '180612' mean?  What does dateString copyFrom: 3 to: 4
    give you?  But aData monthNumber is hard to misunderstand.

(6) Then there's the whole scripting attack (Little Bobby Tables)
    thing.

The rule of thumb that I use is that if you are just accepting a
datum, maybe stashing it somewhere, and eventually giving it back,
then a string is fine.  For example, considering the huge variety
in the way people's names may be structured, strings are pretty
much the ideal way to represent them.  You *shouldn't* be extracting
"the middle initial" because some people don't *have* one and some
people have more than one.  (I am sick of my name being rejected as
invalid by American-designed programs that insist that a surname
cannot have an apostrophe or two capital letters.  No Irish need
apply.)  But if you want to process structured information, you
want to get the data out of string form and into some kind of tree
as soon as you can.  In fact, you don't want a string form in the
first place.  I have seen people manipulating XML with string
operations, and it's not a pretty sight.

And when we're talking about something like (REAL) object-oriented
programming in a language like Pharo, we really should be thinking
in terms of objects other than strings. For example, if we take
VMS as an example, [.FOO]BAR.TXT;1 and <.FOO>BAR.TXT.1 are the
same *filename* while they are different *strings*, and resolving
the filename relative to DSK:[ICK.ACK] isn't terribly similar to
string concatenation.  (Alphabetic case?  Let's not go there.)
This is precisely why there are classes like FileLocator and Path.



On 6 March 2018 at 16:33, Esteban A. Maringolo <emaring...@gmail.com> wrote:

> Hi Richard,
>
> Certainly a BitStream is beyond what I initially thought.
>
> But it could be of some use to play with it, is it available somewhere
> with a friendly open source license?
>
> I might give it a try using it for my generator (tests are already
> passing). Your example is almost complete, but if I understand it
> you're adding the size of the checksum instead of the checksum itself.
> Nonetheless I think its readability is superior, and I guess that it
> is better perfomance and memorywise.
>
> Also it could be useful for Base58Encoder I'm expermenting with.
> Encoding is "simple" but requires the use of a really large integer,
> I'm stuck at the decoding part now.
> After that there is a Base32 encoder (to implement a Bech32 encoder as
> well).
>
> So I might use it in my road of experimentation with these matters.
> Unless I diverge from this and abandon it as it normally happens. :/
>
> Regarding this part:
> > This reminds me of a lesson I learned many years
> > ago: STRINGS ARE WRONG.  (Thank you, designers of
> > Burroughs Extended Algol!)  When trees aren't the
> > answer, streams often are.
>
> Can you provide more context to this? I wouldn't mind if it is in a
> separate thread or a blog post of your own.
>
> Thanks in advance.
>
> Esteban A. Maringolo
>
> 2018-03-05 21:55 GMT-03:00 Richard O'Keefe <rao...@gmail.com>:
> > I note that the specification in question does not
> > deal with arbitrary bit strings but with "entropies"
> > that are 128 to 256 bits long and a multiple of 32 bits.
> > 4 to 8 bits are copied from the front to the end.
> > (So selecting *this* bit field can be done by taking
> > the first byte of a ByteArray.)  This makes the sequence
> > 132 to 264 bits.  This is then chopped into 11 bit
> > subsequences.  These are not arbitrary subsequences and
> > they are not taken in arbitrary order.  They are a stream.
> >
> > My own Smalltalk library include BitInputStream and
> > BitOutputStream, wrapping byte streams.  So we could
> > do something like
> >
> >    ent := aByteArray size * 8.  "BIP-39 ENT"
> >    cs  := ent // 32.            "BIP-39 CS"
> >    foo := ByteArray new: (ent + cs) // 8.
> >    o   := BitOutputStream on: foo writeStream.
> >    i   := BitInputStream on: aByteArray readStream.
> >    1 to: ent do: [:x | o nextPut: i next].
> >    i reset.
> >    o nextUnsigned: cs put: (i nextUnsigned: cs).
> >    i close.
> >    o close.
> >    i := BitInputStream on: foo readStream.
> >    ans := (1 to: (ent + cs) // 11) collect: [:x |
> >       WordList at: 1 + (i nextUnsigned: 11)].
> >
> > Stare at this for a bit, and you realise that you don't
> > actually need the working byte array foo.
> > ByteArray
> >   methods for: 'bitcoin'
> >     mnemonic
> >       |ent csn i t|
> >       (ent between: 128 and: 256)
> >         ifFalse: [self error: 'wrong size for BIP-39'].
> >       cs  := ent // 32.
> >       n   := (ent + cs) // 11.
> >       i   := BitInputStream on: (ReadStream on: self).
> >       t   := i nextUnsigned: cs.
> >       i reset.
> >       ^(1 to: n) collect: [:index |
> >         WordList at: 1 + (index = n
> >           ifTrue:  [((i nextUnsigned: 11 - cs) bitShift: cs) bitOr: t]
> >           ifFalse: [i nextUnsigned: 11])]
> >
> > My BitInputStream and BitOutputStream classes are,
> > um, not really mature.  They aren't *completely*
> > naive, but they could be a lot better, and in
> > particular, BitInputStream>>nextUnsigned: and
> > BitOutputStream>>nextUnsigned:put: are definitely
> > suboptimal.  I put this out there just to suggest
> > that there is a completely different way of thinking
> > about the problem.  (Actually, this isn't *entirely*
> > unlike using Erlang bit syntax.)
> >
> > Bit*Streams are useful enough to justify primitive
> > support.  (Which my classes don't have yet.  I did
> > say they are not mature...)
> >
> > This reminds me of a lesson I learned many years
> > ago: STRINGS ARE WRONG.  (Thank you, designers of
> > Burroughs Extended Algol!)  When trees aren't the
> > answer, streams often are.
> >
> >
> >
> >
> > On 6 March 2018 at 07:21, Esteban A. Maringolo <emaring...@gmail.com>
> wrote:
> >>
> >> 2018-03-05 14:02 GMT-03:00 Stephane Ducasse <stepharo.s...@gmail.com>:
> >> > On Sun, Mar 4, 2018 at 9:43 PM, Esteban A. Maringolo
> >> > <emaring...@gmail.com> wrote:
> >> >> 2018-03-04 17:15 GMT-03:00 Sven Van Caekenberghe <s...@stfx.eu>:
> >> >>> Bits are actually numbered from right to left (seen from how they
> are
> >> >>> printed).
> >> >>
> >> >> I understand bit operations, used it extensively with IP address eons
> >> >> ago.
> >> >>
> >> >> But if a spec says: "Take the first n bits from the hash", it means
> >> >> the first significant bits.
> >> >> so in 2r100101111 the first 3 bits are "100" and not "111".
> >> >
> >> > naive question: why?
> >>
> >> Because it says so.
> >> "A checksum is generated by taking the first ENT / 32 bits of its
> >> SHA256 hash. This checksum is appended to the end of the initial
> >> entropy.
> >> Next, these concatenated bits are split into groups of 11 bits, each
> >> encoding a number from 0-2047, serving as an index into a wordlist.
> >> Finally, we convert these numbers into words and use the joined words
> >> as a mnemonic sentence." [1].
> >>
> >> > To me it looks like a lousy specification.
> >>
> >> It might be, I can't really tell.
> >>
> >> But such manipulation could be useful if you are knitting different
> >> parts of a binary packet whose boundaries are not at byte level, but
> >> bit instead. So you can take "these 5 bits, concatenate with this
> >> other 7, add 13 zero bits, then 1 followed by the payload". I'm
> >> assuming a non real case here though, my use case was fulfilled
> >> already.
> >>
> >> Regards!
> >>
> >> --
> >> Esteban A. Maringolo
> >>
> >> [1]
> >> https://github.com/bitcoin/bips/blob/master/bip-0039.
> mediawiki#generating-the-mnemonic
> >>
> >
>
>

Re: [Pharo-users] Bitwise operations in ByteArray (slicing, etc.)

Reply via email to