samredai commented on a change in pull request #4021: URL: https://github.com/apache/iceberg/pull/4021#discussion_r798082837
########## File path: python/src/iceberg/io/base.py ########## @@ -24,7 +24,40 @@ """ from abc import ABC, abstractmethod -from typing import Union +from typing import Protocol, Union, runtime_checkable + + +@runtime_checkable +class InputStream(Protocol): + def read(self, n: int) -> bytes: + ... Review comment: These are the equivalent of [SeekableInputStream](https://github.com/apache/iceberg/blob/master/python_legacy/iceberg/api/io/seekable_input_stream.py#L19) and [PositionOutputStream](https://github.com/apache/iceberg/blob/master/python_legacy/iceberg/api/io/position_output_stream.py#L19) in the legacy python client. It's just less strict here in that any class with the same structure is automatically a subclass. I think this highlights the behavior: ```py from typing import Protocol, runtime_checkable @runtime_checkable class A(Protocol): def foo(self): ... def bar(self): ... class B: def foo(self): ... def bar(self): ... class C: def foo(self): ... def bar(self): ... def baz(self): ... class D: def foo(self): ... isinstance(B(), A) # True isinstance(C(), A) # True isinstance(D(), A) # False ``` So an input file or output file class could have more methods, as long as it has the required methods set in the protocol which should be the case for any file-like object that's based on [RawIOBase](https://docs.python.org/3/library/io.html#io.RawIOBase) (inherits from IOBase and adds a `read`, `readall`, `readinto`, and `write` method). The reason I renamed a few of the methods from the legacy client was also to be in-line with `RawIOBase`. Here's an example of how urllibs HTTPResponse object matches this protocol: ```py import urllib.request response = urllib.request.urlopen('http://python.org/') isinstance(response, InputStream) # True print(dir(response)) ``` output: ``` [... 'begin', 'chunk_left', 'chunked', 'close', 'closed', 'code', 'debuglevel', 'detach', 'fileno', 'flush', 'fp', 'getcode', 'getheader', 'getheaders', 'geturl', 'headers', 'info', 'isatty', 'isclosed', 'length', 'msg', 'peek', 'read', 'read1', 'readable', 'readinto', 'readinto1', 'readline', 'readlines', 'reason', 'seek', 'seekable', 'status', 'tell', 'truncate', 'url', 'version', 'will_close', 'writable', 'write', 'writelines'] ``` One of the remaining questions though is where do we set this `isinstance` check, or do we do it at all? The rest of the library will be assuming this interface so do we leave it to every `InputFile` and `OutputFile` implementations to do an `isinstance` check? Or do we do the `isinstance` check in other parts of the library wherever `FileIO.new_input_file(...)` is called, for example. This feels like it might be a tradeoff to this approach in addition to lack of python 3.7 support. (sorry for the super long comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org