On 4/17/21 8:40 AM, Yi-yo Chiang wrote: > On Sat, Apr 17, 2021 at 7:32 PM Rob Landley <r...@landley.net > <mailto:r...@landley.net>> wrote: > > On 4/17/21 4:43 AM, Yi-yo Chiang wrote: > > On Sat, Apr 17, 2021 at 2:56 PM Rob Landley <r...@landley.net > <mailto:r...@landley.net> > > <mailto:r...@landley.net <mailto:r...@landley.net>>> wrote: > > > > On 4/16/21 1:44 PM, Yi-yo Chiang wrote: > > > I'm not sure what Elliot's goal is? I assume he's trying to > extract a > > > concatenated ramdisk, and I still see a problem in the current > solution. > > > > > > The buffer-format > > > > > > > (https://www.kernel.org/doc/Documentation/early-userspace/buffer-format.txt) > > says: > > > > > > initramfs := ("\0" | cpio_archive | cpio_gzip_archive)* > > > > > > In other words, both `cat a.cpio b.cpio >merged.cpio` and `(cat > a.cpio && echo > > > -n -e '\0\0\0' && cat b.cpio) >merged.cpio` are valid initramfs. > > > > It also implies that two compressed files can be concatenated and > separated by > > arbirary runs of nulls, or you can have a compressed file and a > non-compressed > > file concatenated, or... > > > > > > Correct. Upon further inspection, it's actually "arbitrary NULLs could > prepend a > > GZIP(cpio_archive)", > > I'm not currently handling that case, and I'm not sure where is the right > place > to handle it? (Should gzip handle it, or should cpio call out to gzip?) > > And then you have to care that the _compressor_ stops gracefully at the > end of > its compressed data isn't reading/discarding extra from its input... > > > I just read more into the kernel initramfs.c and decompressor_*.c, and seems > like even the kernel doesn't handle this all that well. > For example, the gzip decompressor (inflate) stops gracefully at the end of > compressed data, but lz4 decompressor doesn't and errors when there is data > past > the end of compressed data.
It's possible to make this work right, but not _easy_ to do so, because of the read buffer issues. > Back to the original question, I think handling concatenated uncompressed cpio > is good enough. In theory that's in now. > I can teach my cpio to call out to decompressors, but this is new design > that > needs to be thought through. Does it automatically do it, is there a new > flag? > Is this decompression side only and the compression side still needs its > output > piped? > > I highly doubt zcat support initramfs-style concatenated .gz. I wrote my own lib/deflate.c from scratch (keep meaning to finish the compressor side but my todo list runneth over and toybox is not my day job), so I'm pretty sure I can make it handle multiple concatenated files. And I vaguely recall that zlib's version of handling non-compressed data was to send it through to the output verbatim. (Which means if you have a tarball containing a gzip file things could get ugly.) > AFAICT, in order > to deal with "(cat a.cpio.gz && echo -n -e '\0\0' && cat > b.cpio.gz)>initramfs.img", right now we need to use tools such as binwalk && > dd > to slice the initramfs.img into its individual components, and then pipe the > sliced chunks into zcat, lz4cat ... whatever-cat. It sure sounds useful for > cpio > to have an option or flag (like tar) to let it auto detect the compression > method and call the compression library. Toybox tar doesn't use compression libraries for this, it forks another process and feeds data through a pipe. (Which gets us SMP automatically, and means we can use compression types we don't internally implement.) That said, I could easily add a --showsize option that prints the number of bytes of input consumed to the three compressors toybox implements. (Not that this is useful because I don't _require_ using the toybox compressors/decompressors, for interoperability reasons, so can't depend on a feature I'd add.) The problem isn't figuring out where the data _starts_, it's figuring out where it _ends_. Long ago I had a design for a parallel bzip2 decompressor that would search ahead in the code for the next bzip2 start of block signature and dispatch each chunk to a thread pool (and then only keep the results when the previous block said "we ended here" and that was one of the starting points; ones that got bypassed because false positive in the middle of a block would just have their output discarded). But at this point bzip2 is obsolete enough I'm unlikely to bother even when I make it that far down on my todo list. That said, it shouldn't be too hard to do something similar for gzip? (The start of block has to be byte aligned.) And _specifically_ this would be "figure out where the next compression start is, keep the last X kilobytes of data we fed to the decompressor pipe, and start the next one at the last compressor start signature match near the ending point where the decompressor gave up. Which still doesn't help you with _decompressed_ data between compressed runs... What might be useful is to special case gzip, andn handle that with the internal deflate code, which can accurately measure the number of bytes consumed and preserve any data after that. And then just say that's the ONLY compression type that concatenation works for. I could also just teach gzip that you can concatenate .gz files, and say we support concatenating file.cpio.gz or file.cpio but don't mix them. Rob _______________________________________________ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net