Correct behavior when unparse does not ending on a byte boundary

Steve Lawrence Thu, 17 Dec 2020 06:07:56 -0800

I was looking at DAFFODIL-1565 thinking it could be closed with all the
recent streaming additions. But as I thought about it more, we have a
clearly asymmetrical behavior with parse and unparse that relates to
what this bug is talking about, so now I'm not so sure.


Say we have an element that parses a single bit with no alignment, e.g.:

  <xs:element name="onebit" type="xs:int"
    dfdl:representation="binary"
    dfdl:lengthKind="explicit"
    dfdl:lengthUnits="bits"
    dfdl:length="1"
    dfdl:alignmentUnits="bits"
    dfdl:alignment="1" />

Now say we parse a file that has a single 0xFF byte as its contents,
using the --stream option in the CLI, e.g.:

  $ daffodil parse --stream -s onebit.dfdl.xsd ff_byte.bin

The result is eight infosets, each with <onebyte>1</onebyte>. This is
because with the --stream option, when parse completes the next parse
continues at the exact bit position where the previous parse left off.

Now say we pipe this result to a call to daffodil unparse, i.e.:

  $ daffodil parse --stream -s onebit.dfdl.xsd ff_byte.bin | daffodil
unparse --stream onebit.dfdl.xsd -o res.bin

In this case, res.bin unexpectedly contains the following hex:

  80 80 80 80 80 80 80 80

So it contains eight bytes where the first bit of each byte is 1. This
is because at the end of each unparse call, we flush out the fragment
byte if it exists (in this case, it does--the single 1 bit) and in order
to do that we must write out a whole byte.

So the round trip is not symmetrical--parse a single byte, unparse to 8
bytes. This implies we are doing something wrong.

I think the change we need is either 1) starting a new parse should
automatically align to a byte boundary, or 2) the end of unparse should
not write fragment bytes unless we know no more unparses will occur.

My first instinct is option 2 feels like the correct behavior, but has
API implications, which I think is at the heart of DAFFODIL-1565.

For example, we would now need a way to carry state between unparse
calls that keeps track of things like bitPosition, fragment byte,
fragment length, and bitOrder. We also need some way to tell whatever
stores this state that we are actually done and that fragment data
should be flushed to the underlying stream.

For symmetry to the parse API, the logical name for this state carrier
is OutputSourceDataOutputStream. The API would probably look something
like this:

  val os = new OutputStream(...)
  val osdos = new OutputSourceDataOutputStream(os)
  dp.unparse(infoset1, osdos) // leaves state in osdos
  dp.unparse(infoset2, osdos) // uses osdos state for initialization
  osdos.close() // flushes fragment bytes stored in osdos

So the OutputSourceDataOutputStream wraps the underlying
OutputStream/WritableByteChannel/whatever we unaprse to, as well as
stores fragment information. This way if osdos is used in a subsequent
unparse() call, it can unparse where the previous left off.

The close() method tells the OutputSourceDataOutputStream to write the
fragment byte (if it exists) to the underlying stream. It also says that
this OSDOS cannot be used in any future calls to unparse().

Some last thoughts about this approach:

1) Say the OSDOS has a fragment byte and close() is called. We must
write a full byte because the underlying OutputStream can only accepts
full bytes. So that must mean we need to pad this fragment byte to a
full byte. What value do we use for this padding? The obvious choice is
probably the dfdl:fillByte property, but the OSDOS isn't tied to a
particular schema with a particular fillByte. For example, you could do
this:

  dp1.unparse(infoset1, osdos)
  dp2.unparse(infoset2, osdos)

If each dp1 and dp2 have a different fillByte values, which do we use,
if either? Do we just use the fill byte from the last data processor
that wrote to this stream (so fillByte from dp2?). Or is this a special
case, and we just always pad with zeros?

2) This now affects alignment. I believe we optimize alignment with the
assumption that starting a parse/unparse is always byte aligned. If a
parse/unparse can start at any bitPosition based on the previous
parse/unparse. So this should essentially change our alignment
optimizations to say that the root element alignment is unknown rather
known to be at the beginning of data.

Thoughts?

Correct behavior when unparse does not ending on a byte boundary

Reply via email to