Hi,

I'm grappling with a design flaw I just uncovered in stream filters, and
need some advice on how best to fix it.  The problem exists since the
introduction of stream filters, and has 3 parts.  2 of them can probably
be fixed safely in PHP 5.2+, but I think the third may require an
internal redesign of stream filters, and so would probably have to be
PHP 5.3+, even though it is a clear bugfix (Ilia, your opinion
appreciated on this).

The first part of the bug that I encountered is best described here:
http://bugs.php.net/bug.php?id=46026.  However, it is a deeper problem
than this, as the attempts to cache data is dangerous any time a stream
filter is attached to a stream.  I should also note that the patch in
this bug contains feature additions that would have to wait for PHP 5.3.

I ran into this problem because I was trying to use stream filters to
read in a bz2-compressed file within a zip archive in the phar
extension.  This was failing, and I first tracked the problem down to an
attempt by php_stream_filter_append to read in a bunch of data and cache
it, which caused more stuff to be passed into the bz2 decompress filter
than it could handle, making it barf.  After fixing this problem, I ran
into the problem described in the bug above because of
php_stream_fill_read_buffer doing the same thing when I tried to read
the data, because I requested it return 176 decompressed bytes, and so
php_stream_read passed in 176 bytes to the decompress filter.  Only 144
of those bytes were actually bz2-compressed data, and so the filter
barfed upon trying to decompress the remaining data (same as bug #46026,
found differently).

You can probably tell from my explanation that this is an
extraordinarily complex problem.  There's 3 inter-related problems here:

1) bz2 (and zlib) stream filter should stop trying to decompress when it
reaches the stream end regardless of how many bytes it is told to
decompress (easy to fix)
2) it is never safe to cache read data when a read stream filter is
appended, as there is no safe way to determine in advance how much of
the stream can be safely filtered. (would be easy to fix if it weren't
for #3)
3) there is no clear way to request that a certain number of filtered
bytes be returned from a stream, versus how many unfiltered bytes should
be passed into the stream. (very hard to fix without design change)

I need some advice on #3 from the original designers of stream filters
and streams, as well as any experts who have dealt with this kind of
problem in other contexts.  In this situation, should we expect stream
filters to always stop filtering if they reach the end of valid input? 
Even in this situation, there is potential that less data is available
than passed in.  A clear example would be if we requested only 170
bytes.  144 of those bytes would be passed in as the complete compressed
data, and bz2.decompress would decompress all of it to 176 bytes.  170
of those bytes would be returned from php_stream_read, and 6 would have
to be placed in a cache for future reads.  Thus, there would need to be
some way of marking the cache as valid because of this logic path:

<?php
$a = fopen('blah.zip');
fseek($a, 132); // fills read buffer with unfiltered data
stream_filter_append($a, 'bzip2.decompress'); // clears read buffer cache
$b = fread($a, 170); // fills read buffer cache with 6 bytes
fseek($a, 3, SEEK_CUR); // this should seek within the filtered data
read buffer cache
stream_filter_append($a, 'zlib.inflate');
?>

The question is what should happen when we append the second filter
'zlib.inflate' to filter the filtered data?  If we clear the read buffer
as we did in the first case, it will result in lost data.  So, let's
assume we preserve the read buffer.  Then, if we perform:

<?php
$c = fread($a, 7);
?>

and assume the remaining 3 bytes expand to 8 bytes, how should the read
buffer cache be handled?  Should the first 3 bytes still be the filtered
bzip2 decompressed data, and the last 3 replaced with the 8 bytes of
decompressed zlib data?

Basically, I am wondering if perhaps we need to implement a read buffer
cache for each stream filter.  This could solve our problem, I think. 
The data would be stored like so:

stream: 170 bytes of unfiltered data, and a pointer to byte 145 as the
next byte for php_stream_read()
bzip2.decompress filter: 176 bytes of decompressed bzip2 data, and a
pointer to byte 171 as the next byte for php_stream_read()
zlib.inflate filter: 8 bytes of decompressed zlib data, and a pointer to
byte 8 as the next byte for php_stream_read()

This way, we would essentially have a stack of stream data.  If the zlib
filter were then removed, we could "back up" to the bzip2 filter and so
on.  This will allow proper read cache filling, and remove the weird
ambiguities that are apparent in a filtered stream.  I don't think we
would need to worry about backwards compatibility here, as the most
common use case would be unaffected by this change, and the use case it
would fix has never actually worked.

I haven't got a patch for this yet, but it would be easy to do if the
logic is sound.

Thanks,
Greg



-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to