Hi, I'm grappling with a design flaw I just uncovered in stream filters, and need some advice on how best to fix it. The problem exists since the introduction of stream filters, and has 3 parts. 2 of them can probably be fixed safely in PHP 5.2+, but I think the third may require an internal redesign of stream filters, and so would probably have to be PHP 5.3+, even though it is a clear bugfix (Ilia, your opinion appreciated on this).
The first part of the bug that I encountered is best described here: http://bugs.php.net/bug.php?id=46026. However, it is a deeper problem than this, as the attempts to cache data is dangerous any time a stream filter is attached to a stream. I should also note that the patch in this bug contains feature additions that would have to wait for PHP 5.3. I ran into this problem because I was trying to use stream filters to read in a bz2-compressed file within a zip archive in the phar extension. This was failing, and I first tracked the problem down to an attempt by php_stream_filter_append to read in a bunch of data and cache it, which caused more stuff to be passed into the bz2 decompress filter than it could handle, making it barf. After fixing this problem, I ran into the problem described in the bug above because of php_stream_fill_read_buffer doing the same thing when I tried to read the data, because I requested it return 176 decompressed bytes, and so php_stream_read passed in 176 bytes to the decompress filter. Only 144 of those bytes were actually bz2-compressed data, and so the filter barfed upon trying to decompress the remaining data (same as bug #46026, found differently). You can probably tell from my explanation that this is an extraordinarily complex problem. There's 3 inter-related problems here: 1) bz2 (and zlib) stream filter should stop trying to decompress when it reaches the stream end regardless of how many bytes it is told to decompress (easy to fix) 2) it is never safe to cache read data when a read stream filter is appended, as there is no safe way to determine in advance how much of the stream can be safely filtered. (would be easy to fix if it weren't for #3) 3) there is no clear way to request that a certain number of filtered bytes be returned from a stream, versus how many unfiltered bytes should be passed into the stream. (very hard to fix without design change) I need some advice on #3 from the original designers of stream filters and streams, as well as any experts who have dealt with this kind of problem in other contexts. In this situation, should we expect stream filters to always stop filtering if they reach the end of valid input? Even in this situation, there is potential that less data is available than passed in. A clear example would be if we requested only 170 bytes. 144 of those bytes would be passed in as the complete compressed data, and bz2.decompress would decompress all of it to 176 bytes. 170 of those bytes would be returned from php_stream_read, and 6 would have to be placed in a cache for future reads. Thus, there would need to be some way of marking the cache as valid because of this logic path: <?php $a = fopen('blah.zip'); fseek($a, 132); // fills read buffer with unfiltered data stream_filter_append($a, 'bzip2.decompress'); // clears read buffer cache $b = fread($a, 170); // fills read buffer cache with 6 bytes fseek($a, 3, SEEK_CUR); // this should seek within the filtered data read buffer cache stream_filter_append($a, 'zlib.inflate'); ?> The question is what should happen when we append the second filter 'zlib.inflate' to filter the filtered data? If we clear the read buffer as we did in the first case, it will result in lost data. So, let's assume we preserve the read buffer. Then, if we perform: <?php $c = fread($a, 7); ?> and assume the remaining 3 bytes expand to 8 bytes, how should the read buffer cache be handled? Should the first 3 bytes still be the filtered bzip2 decompressed data, and the last 3 replaced with the 8 bytes of decompressed zlib data? Basically, I am wondering if perhaps we need to implement a read buffer cache for each stream filter. This could solve our problem, I think. The data would be stored like so: stream: 170 bytes of unfiltered data, and a pointer to byte 145 as the next byte for php_stream_read() bzip2.decompress filter: 176 bytes of decompressed bzip2 data, and a pointer to byte 171 as the next byte for php_stream_read() zlib.inflate filter: 8 bytes of decompressed zlib data, and a pointer to byte 8 as the next byte for php_stream_read() This way, we would essentially have a stack of stream data. If the zlib filter were then removed, we could "back up" to the bzip2 filter and so on. This will allow proper read cache filling, and remove the weird ambiguities that are apparent in a filtered stream. I don't think we would need to worry about backwards compatibility here, as the most common use case would be unaffected by this change, and the use case it would fix has never actually worked. I haven't got a patch for this yet, but it would be easy to do if the logic is sound. Thanks, Greg -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php