I was looking at DAFFODIL-1565 thinking it could be closed with all the recent streaming additions. But as I thought about it more, we have a clearly asymmetrical behavior with parse and unparse that relates to what this bug is talking about, so now I'm not so sure.
Say we have an element that parses a single bit with no alignment, e.g.: <xs:element name="onebit" type="xs:int" dfdl:representation="binary" dfdl:lengthKind="explicit" dfdl:lengthUnits="bits" dfdl:length="1" dfdl:alignmentUnits="bits" dfdl:alignment="1" /> Now say we parse a file that has a single 0xFF byte as its contents, using the --stream option in the CLI, e.g.: $ daffodil parse --stream -s onebit.dfdl.xsd ff_byte.bin The result is eight infosets, each with <onebyte>1</onebyte>. This is because with the --stream option, when parse completes the next parse continues at the exact bit position where the previous parse left off. Now say we pipe this result to a call to daffodil unparse, i.e.: $ daffodil parse --stream -s onebit.dfdl.xsd ff_byte.bin | daffodil unparse --stream onebit.dfdl.xsd -o res.bin In this case, res.bin unexpectedly contains the following hex: 80 80 80 80 80 80 80 80 So it contains eight bytes where the first bit of each byte is 1. This is because at the end of each unparse call, we flush out the fragment byte if it exists (in this case, it does--the single 1 bit) and in order to do that we must write out a whole byte. So the round trip is not symmetrical--parse a single byte, unparse to 8 bytes. This implies we are doing something wrong. I think the change we need is either 1) starting a new parse should automatically align to a byte boundary, or 2) the end of unparse should not write fragment bytes unless we know no more unparses will occur. My first instinct is option 2 feels like the correct behavior, but has API implications, which I think is at the heart of DAFFODIL-1565. For example, we would now need a way to carry state between unparse calls that keeps track of things like bitPosition, fragment byte, fragment length, and bitOrder. We also need some way to tell whatever stores this state that we are actually done and that fragment data should be flushed to the underlying stream. For symmetry to the parse API, the logical name for this state carrier is OutputSourceDataOutputStream. The API would probably look something like this: val os = new OutputStream(...) val osdos = new OutputSourceDataOutputStream(os) dp.unparse(infoset1, osdos) // leaves state in osdos dp.unparse(infoset2, osdos) // uses osdos state for initialization osdos.close() // flushes fragment bytes stored in osdos So the OutputSourceDataOutputStream wraps the underlying OutputStream/WritableByteChannel/whatever we unaprse to, as well as stores fragment information. This way if osdos is used in a subsequent unparse() call, it can unparse where the previous left off. The close() method tells the OutputSourceDataOutputStream to write the fragment byte (if it exists) to the underlying stream. It also says that this OSDOS cannot be used in any future calls to unparse(). Some last thoughts about this approach: 1) Say the OSDOS has a fragment byte and close() is called. We must write a full byte because the underlying OutputStream can only accepts full bytes. So that must mean we need to pad this fragment byte to a full byte. What value do we use for this padding? The obvious choice is probably the dfdl:fillByte property, but the OSDOS isn't tied to a particular schema with a particular fillByte. For example, you could do this: dp1.unparse(infoset1, osdos) dp2.unparse(infoset2, osdos) If each dp1 and dp2 have a different fillByte values, which do we use, if either? Do we just use the fill byte from the last data processor that wrote to this stream (so fillByte from dp2?). Or is this a special case, and we just always pad with zeros? 2) This now affects alignment. I believe we optimize alignment with the assumption that starting a parse/unparse is always byte aligned. If a parse/unparse can start at any bitPosition based on the previous parse/unparse. So this should essentially change our alignment optimizations to say that the root element alignment is unknown rather known to be at the beginning of data. Thoughts?