Re: Correct behavior when unparse does not ending on a byte boundary

Beckerle, Mike Thu, 17 Dec 2020 07:06:09 -0800

What you call option 2 is definitely the right behavior, so this is a 
significant bug in the unparser streaming API.


There should never be aligning to a byte boundary automatically except when the 
output stream is closed.

The fillByte to use for a close is certainly the last fillbyte of the last 
unparse call's schema. We should just capture this in the osdos (we probably 
already do, because of buffering output. Note that if the osdos is positiioned 
somewhere in the middle of a byte, and you begin another unparse, which begins 
with alignment to a byte boundary, the fill byte used in that case is the NEW 
fill byte of the new unparse call.

You are correct that the alignment assumed at the start is only the starting 
bit position in the OSDOS, not zero (perfect alignment).

This is also true for parse. I.e, I think optimizations are wrong there as 
well, because the root element doesn't begin at bit 1.

So I think this is a second bug with parser. We've gotten away with this 
because most formats are byte-centric I guess.

But since DFDL/Daffodil is supposed to be the tool that frees people from this 
byte-centric stuff - I claim these bugs are critical priority.

We should create a group of API unit test that matches your example of just a 
single bit message being parsed and unparsed as a stream, with closes at 
various points and two different schemas with different fill bytes.


-mikeb


________________________________
From: Steve Lawrence <slawre...@apache.org>
Sent: Thursday, December 17, 2020 9:07 AM
To: dev@daffodil.apache.org <dev@daffodil.apache.org>
Subject: Correct behavior when unparse does not ending on a byte boundary

I was looking at DAFFODIL-1565 thinking it could be closed with all the
recent streaming additions. But as I thought about it more, we have a
clearly asymmetrical behavior with parse and unparse that relates to
what this bug is talking about, so now I'm not so sure.

Say we have an element that parses a single bit with no alignment, e.g.:

  <xs:element name="onebit" type="xs:int"
    dfdl:representation="binary"
    dfdl:lengthKind="explicit"
    dfdl:lengthUnits="bits"
    dfdl:length="1"
    dfdl:alignmentUnits="bits"
    dfdl:alignment="1" />

Now say we parse a file that has a single 0xFF byte as its contents,
using the --stream option in the CLI, e.g.:

  $ daffodil parse --stream -s onebit.dfdl.xsd ff_byte.bin

The result is eight infosets, each with <onebyte>1</onebyte>. This is
because with the --stream option, when parse completes the next parse
continues at the exact bit position where the previous parse left off.

Now say we pipe this result to a call to daffodil unparse, i.e.:

  $ daffodil parse --stream -s onebit.dfdl.xsd ff_byte.bin | daffodil
unparse --stream onebit.dfdl.xsd -o res.bin

In this case, res.bin unexpectedly contains the following hex:

  80 80 80 80 80 80 80 80

So it contains eight bytes where the first bit of each byte is 1. This
is because at the end of each unparse call, we flush out the fragment
byte if it exists (in this case, it does--the single 1 bit) and in order
to do that we must write out a whole byte.

So the round trip is not symmetrical--parse a single byte, unparse to 8
bytes. This implies we are doing something wrong.

I think the change we need is either 1) starting a new parse should
automatically align to a byte boundary, or 2) the end of unparse should
not write fragment bytes unless we know no more unparses will occur.

My first instinct is option 2 feels like the correct behavior, but has
API implications, which I think is at the heart of DAFFODIL-1565.

For example, we would now need a way to carry state between unparse
calls that keeps track of things like bitPosition, fragment byte,
fragment length, and bitOrder. We also need some way to tell whatever
stores this state that we are actually done and that fragment data
should be flushed to the underlying stream.

For symmetry to the parse API, the logical name for this state carrier
is OutputSourceDataOutputStream. The API would probably look something
like this:

  val os = new OutputStream(...)
  val osdos = new OutputSourceDataOutputStream(os)
  dp.unparse(infoset1, osdos) // leaves state in osdos
  dp.unparse(infoset2, osdos) // uses osdos state for initialization
  osdos.close() // flushes fragment bytes stored in osdos

So the OutputSourceDataOutputStream wraps the underlying
OutputStream/WritableByteChannel/whatever we unaprse to, as well as
stores fragment information. This way if osdos is used in a subsequent
unparse() call, it can unparse where the previous left off.

The close() method tells the OutputSourceDataOutputStream to write the
fragment byte (if it exists) to the underlying stream. It also says that
this OSDOS cannot be used in any future calls to unparse().

Some last thoughts about this approach:

1) Say the OSDOS has a fragment byte and close() is called. We must
write a full byte because the underlying OutputStream can only accepts
full bytes. So that must mean we need to pad this fragment byte to a
full byte. What value do we use for this padding? The obvious choice is
probably the dfdl:fillByte property, but the OSDOS isn't tied to a
particular schema with a particular fillByte. For example, you could do
this:

  dp1.unparse(infoset1, osdos)
  dp2.unparse(infoset2, osdos)

If each dp1 and dp2 have a different fillByte values, which do we use,
if either? Do we just use the fill byte from the last data processor
that wrote to this stream (so fillByte from dp2?). Or is this a special
case, and we just always pad with zeros?

2) This now affects alignment. I believe we optimize alignment with the
assumption that starting a parse/unparse is always byte aligned. If a
parse/unparse can start at any bitPosition based on the previous
parse/unparse. So this should essentially change our alignment
optimizations to say that the root element alignment is unknown rather
known to be at the beginning of data.

Thoughts?

Re: Correct behavior when unparse does not ending on a byte boundary

Reply via email to