Philippe, quote the entire section:

 

In some circumstances, the use of padding ("=") in base-encoded data

   is not required or used.  In the general case, when assumptions about

   the size of transported data cannot be made, padding is required to

   yield correct decoded data.

 

   Implementations MUST include appropriate pad characters at the end of

   encoded data unless the specification referring to this document

   explicitly states otherwise.

 

The first para clarifies that padding is required when the length is not 
otherwise known. Only if the length is provided or predefined can the padding 
be dropped.

The second para clarifies it must be included unless the higher level protocol 
states otherwise, in which case it is likely using another mechanism to define 
length.

 

It doesn’t seem to me to be as open ended as you implied in your initial mails, 
but well-defined depending on whether base64 is being used as spec’d in the 
RFC, or being explicitly modified to suit an embedding protocol.

And certainly the first sentence in this section isn’t intended to be taken 
without the context of the rest of the section.

 

tex

 

 

 

From: Philippe Verdy [mailto:verd...@wanadoo.fr] 
Sent: Monday, October 15, 2018 4:14 AM
To: Tex Texin
Cc: Adam Borowski; unicode Unicode Discussion
Subject: Re: Base64 encoding applied to different unicode texts always yields 
different base64 texts ... true or false?

 

Look into https://tools.ietf.org/html/rfc4648, section 3.2, alinea 1, 1st 
sentence, it is explicitly stated :

 

In some circumstances, the use of padding ("=") in base-encoded data is not 
required or used.

 

Le lun. 15 oct. 2018 à 03:56, Tex <texte...@xencraft.com> a écrit :

Philippe,

 

Where is the use of whitespace or the idea that 1-byte pieces do not need all 
the equal sign paddings documented?

I read the rfc 3501 you pointed at, I don’t see it there.

 

Are these part of any standards? Or are you claiming these are practices 
despite the standards? If so, are these just tolerated by parsers, or are they 
actually generated by encoders?

 

What would be the rationale for supporting unnecessary whitespace? If 
linebreaks are forced at some line length they can presumably be removed at 
that length and not treated as part of the encoding.

Maybe we differ on define where the encoding begins and ends, and where higher 
level protocols prescribe how they are embedded within the protocol.

 

Tex

 

 

 

 

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Philippe Verdy 
via Unicode
Sent: Sunday, October 14, 2018 1:41 AM
To: Adam Borowski
Cc: unicode Unicode Discussion
Subject: Re: Base64 encoding applied to different unicode texts always yields 
different base64 texts ... true or false?

 

Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is enough 
to indicate the end of an octets-span. The extra = after it do not add any 
other octet. and as well you're allowed to insert whitespaces anywhere in the 
encoded stream (this is what ensures that the Base64-encoded octets-stream will 
not be altered if line breaks are forced anywhere (notably within the body of 
emails).

 

So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR, LF, 
NEL) in the middle is non-significant and ignorable on decoding (their 
"encoded" bit length is 0 and they don't terminate an octets-span, unlike "=" 
which discards extra bits remaining from the encoded stream before that are not 
on 8-bit boundaries).

 

Also:

- For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol 
before "=" can vary in its 4 lowest bits (which are then ignored/discarded by 
the "=" symbol)

- For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X" symbol 
before "=" can vary in its 2 lowest bits (which are then ignored/discarded by 
the "=" symbol)

 

So you can use Base64 by encoding each octet in separate pieces, as one Base64 
symbol followed by an "=" symbol, and even insert any number of whitespaces 
between them: there's a infinite number of valid Base64 encodings for 
representing the same octets-stream payload.

 

Base64 allows encoding any octets streams but not directly any bits-streams : 
it assumes that the effective bits-stream has a binary length multiple of 8. To 
encode a bits-stream with an exact number of bits (not multiple of 8), you need 
to encode an extra payload to indicate the effective number of bits to keep at 
end of the encoded octets-stream (or at start):

- Base64 does not specify how you convert a bitstream of arbitrary length to an 
octets-stream;

- for that purpose, you may need to pad the bits-stream at start or at end with 
1 to 6 bits (so that it the resulting bitstream has a length multiple of 8, 
then encodable with Base64 which takes only octets on input).

- these extra padding bits are not significant for the original bitstream, but 
are significant for the Base64 encoder/decoder, they will be discarded by the 
bitstream decoder built on top of the Base64 decoder, but not by the Base64 
decoder itself.

 

You need to encode somewhere with the bitstream encoder how many padding bits 
(0 to 7) are present at start or end of the octets-stream; this can be done:

- as a separate payload (not encoded by Base64), or

- by prepending 3 bits at start of the bits-stream then padded at end with 1 to 
7 random bits to get a bit-length multiple of 8 suitable for Base64 encoding.

- by appending 3 bits at end of the  bits-stream, just after 1 to 7 random bits 
needed to get a bit-length multiple of 8 suitable for Base64 encoding.

Finally your bits-stream decoder will be able to use this padding count to 
discard these random padding bits (and possibly realign the stream on different 
byte-boundaries when the effective bitlength bits-stream payload is not a 
multiple of 8 and padding bits were added)

 

Base64 also does not specify how bits of the original bits-stream payload are 
packed into the octets-stream input suitable for Base64-encoding, notably it 
does not specify their order and endian-ness. The same remark applies as well 
for MIME, HTTP. So lot of network protocols and file formats need to how to 
properly encode which possible option is used to encode bits-streams of 
arbitrary length, or need to specify which default choice to apply if this 
option is not encoded, or which option must be used (with no possible 
variation). And this also adds to the number of distinct encodings that are 
possible but are still equivalent for the same effective bits-stream payload.

 

All these allowed variations are from the encoder perspective. For 
interoperability, the decoder has to be flexible and to support various options 
to be compatible with different implementations of the encoder, notably when 
the encoder was run on a different system. And this is the case for the MIME 
transport by mail, or for HTTP and FTP transports, or file/media storage 
formats even if the file is stored on the same system, because it may actually 
be a copy stored locally but coming from another system where the file was 
actually encoded).

 

Now if we come back to the encoding of plain-text payloads, Unicode just 
specifies the allowed range (from 0 to 0x10FFFF) for scalar values of code 
points (it actually does not mandate an exact bit-length because the range does 
not fully fit exactly to 21 bits and an encoder can still pack multiple code 
points together into more compact code units.

 

However Unicode provides and standardizes several encodings (UTF-8/16/32) which 
use code units whose size is directly suitable as input for an octets-stream, 
so that they are directly encodable with Base64, without having to specify an 
extra layer for the bits-stream encoder/decoder.

 

But many other encodings are still possible (and can be conforming to Unicode, 
provided they preserve each Unicode scalar value, or at least the code point 
identity because an encoder/decoder is not required to support non-character 
code points such as surrogates or U+FFFE), where Base64 may be used for 
internally generated octets-streams.

 

 

Le dim. 14 oct. 2018 à 03:47, Adam Borowski via Unicode <unicode@unicode.org> a 
écrit :

On Sun, Oct 14, 2018 at 01:37:35AM +0200, Philippe Verdy via Unicode wrote:
> Le sam. 13 oct. 2018 à 18:58, Steffen Nurpmeso via Unicode <
> unicode@unicode.org> a écrit :
> > The only variance is described as:
> >
> >   Care must be taken to use the proper octets for line breaks if base64
> >   encoding is applied directly to text material that has not been
> >   converted to canonical form.  In particular, text line breaks must be
> >   converted into CRLF sequences prior to base64 encoding.  The
> >   important thing to note is that this may be done directly by the
> >   encoder rather than in a prior canonicalization step in some
> >   implementations.
> >
> > This is MIME, it specifies (in the same RFC):
> 
> I've not spoken aboutr the encoding of new lines **in the actual encoded
> text**:
> -  if their existing text-encoding ever gets converted to Base64 as if the
> whole text was an opaque binary object, their initial text-encoding will be
> preserved (so yes it will preserve the way these embedded newlines are
> encoded as CR, LF, CR+LF, NL...)
> 
> I spoke about newlines used in the transport syntax to split the initial
> binary object (which may actually contain text but it does not matter).
> MIME defines this operation and even requires splitting the binary object
> in fragments with maximum binary size so that these binary fragments can be
> converted with Base64 into lines with maximum length. In the MIME Base64
> representation you can insert newlines anywhere between fragments encoded
> separately.

There's another kind of fragmentation that can make the encoding differ (but
still decode to the same payload):

The data stream gets split into 3-byte internal, 4-byte external packets.
Any packet may contain less than those 3 bytes, in which cases it is padded
with = characters:
3 bytes XXXX
2 bytes XXX=
1 byte  XX==

Usually, such smaller packets happen only at the end of a message, but to
support encoding a stream piecewise, they are allowed at any point.

For example:
"meow"     is bWVvdw==
"me""ow"   is bWU=b3c=
yet both carry the same payload.

> Base64 is used exactly to support this flexibility in transport (or
> storage) without altering any bit of the initial content once it is
> decoded.

Right, any such variations are in packaging only.


ᛗᛖᛟᚹ
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ 10 people enter a bar: 1 who understands binary,
⢿⡄⠘⠷⠚⠋⠀ 1 who doesn't, D who prefer to write it as hex,
⠈⠳⣄⠀⠀⠀⠀ and 1 who narrowly avoided an off-by-one error.

Reply via email to