[jira] [Commented] (ARROW-16351) [C++][Python] Implement seek() for BufferedInputStream

2022-04-27 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528589#comment-17528589
 ] 

Antoine Pitrou commented on ARROW-16351:


We could quite easily expose {{RandomAccessFile::GetStream}} in Python, I 
think, that would allow addressing the use case in a supported manner.


> [C++][Python] Implement seek() for BufferedInputStream
> --
>
> Key: ARROW-16351
> URL: https://issues.apache.org/jira/browse/ARROW-16351
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Affects Versions: 7.0.0
>Reporter: Frank Luan
>Priority: Major
>
> I would like to use seek() in a buffered input stream for the following usage 
> scenario:
>  * Open a S3 file (e.g. 1GB)
>  * Jump to an offset (e.g. skip 500MB)
>  * Do a bunch of small (8 bytes) reads
> So that I get the performance of buffered input by avoiding lots of small 
> reads (which are expensive and slow if using S3) and also seek to a position.
> Currently I need to hack it using a mix of RandomAccessFile and 
> BufferedInputStream, like
> {{with _fs.open_input_file(url) as f:}}
> {{    f.seek(offset)}}
> {{    f = fs._wrap_input_stream(f, url, None, self._buffer_size)}}
> {{    x = }}{{{}f.read(8){}}}{{{}{}}}
> I'm wondering if there is any fundamental reason why seek is not implemented 
> for the buffered input stream? Looks like .NET implements it: 
> [https://docs.microsoft.com/en-us/dotnet/api/system.io.bufferedstream.seek?view=net-6.0]
> Or, what I actually need is to open a S3 file with an offset. Would this be 
> easier to do, or is it already supported in current API?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16351) [C++][Python] Implement seek() for BufferedInputStream

2022-04-26 Thread Yibo Cai (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528489#comment-17528489
 ] 

Yibo Cai commented on ARROW-16351:
--

BufferedInputStream wraps a InputStream which implements only the Readable 
interface, not Seekable. In general, I think it's reasonable as 
BufferedInputStream is only suitable for sequetial read, not random access.
cc [~apitrou]

> [C++][Python] Implement seek() for BufferedInputStream
> --
>
> Key: ARROW-16351
> URL: https://issues.apache.org/jira/browse/ARROW-16351
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Affects Versions: 7.0.0
>Reporter: Frank Luan
>Priority: Major
>
> I would like to use seek() in a buffered input stream for the following usage 
> scenario:
>  * Open a S3 file (e.g. 1GB)
>  * Jump to an offset (e.g. skip 500MB)
>  * Do a bunch of small (8 bytes) reads
> So that I get the performance of buffered input by avoiding lots of small 
> reads (which are expensive and slow if using S3) and also seek to a position.
> Currently I need to hack it using a mix of RandomAccessFile and 
> BufferedInputStream, like
> {{with _fs.open_input_file(url) as f:}}
> {{    f.seek(offset)}}
> {{    f = fs._wrap_input_stream(f, url, None, self._buffer_size)}}
> {{    x = }}{{{}f.read(8){}}}{{{}{}}}
> I'm wondering if there is any fundamental reason why seek is not implemented 
> for the buffered input stream? Looks like .NET implements it: 
> [https://docs.microsoft.com/en-us/dotnet/api/system.io.bufferedstream.seek?view=net-6.0]
> Or, what I actually need is to open a S3 file with an offset. Would this be 
> easier to do, or is it already supported in current API?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)