[ https://issues.apache.org/jira/browse/ARROW-16351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528589#comment-17528589 ]
Antoine Pitrou commented on ARROW-16351: ---------------------------------------- We could quite easily expose {{RandomAccessFile::GetStream}} in Python, I think, that would allow addressing the use case in a supported manner. > [C++][Python] Implement seek() for BufferedInputStream > ------------------------------------------------------ > > Key: ARROW-16351 > URL: https://issues.apache.org/jira/browse/ARROW-16351 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python > Affects Versions: 7.0.0 > Reporter: Frank Luan > Priority: Major > > I would like to use seek() in a buffered input stream for the following usage > scenario: > * Open a S3 file (e.g. 1GB) > * Jump to an offset (e.g. skip 500MB) > * Do a bunch of small (8 bytes) reads > So that I get the performance of buffered input by avoiding lots of small > reads (which are expensive and slow if using S3) and also seek to a position. > Currently I need to hack it using a mix of RandomAccessFile and > BufferedInputStream, like > {{with _fs.open_input_file(url) as f:}} > {{ f.seek(offset)}} > {{ f = fs._wrap_input_stream(f, url, None, self._buffer_size)}} > {{ x = }}{{{}f.read(8){}}}{{{}{}}} > I'm wondering if there is any fundamental reason why seek is not implemented > for the buffered input stream? Looks like .NET implements it: > [https://docs.microsoft.com/en-us/dotnet/api/system.io.bufferedstream.seek?view=net-6.0] > Or, what I actually need is to open a S3 file with an offset. Would this be > easier to do, or is it already supported in current API? -- This message was sent by Atlassian Jira (v8.20.7#820007)