samredai commented on a change in pull request #4021:
URL: https://github.com/apache/iceberg/pull/4021#discussion_r798082837
##########
File path: python/src/iceberg/io/base.py
##########
@@ -24,7 +24,40 @@
"""
from abc import ABC, abstractmethod
-from typing import Union
+from typing import Protocol, Union, runtime_checkable
+
+
+@runtime_checkable
+class InputStream(Protocol):
+ def read(self, n: int) -> bytes:
+ ...
Review comment:
These are the equivalent of
[SeekableInputStream](https://github.com/apache/iceberg/blob/master/python_legacy/iceberg/api/io/seekable_input_stream.py#L19)
and
[PositionOutputStream](https://github.com/apache/iceberg/blob/master/python_legacy/iceberg/api/io/position_output_stream.py#L19)
in the legacy python client. It's just less strict here in that any class with
the same structure is automatically a subclass.
I think this highlights the behavior:
```py
from typing import Protocol, runtime_checkable
@runtime_checkable
class A(Protocol):
def foo(self):
...
def bar(self):
...
class B:
def foo(self):
...
def bar(self):
...
class C:
def foo(self):
...
def bar(self):
...
def baz(self):
...
class D:
def foo(self):
...
isinstance(B(), A) # True
isinstance(C(), A) # True
isinstance(D(), A) # False
```
So an input file or output file class could have more methods, as long as it
has the required methods set in the protocol which should be the case for any
file-like object that's based on
[RawIOBase](https://docs.python.org/3/library/io.html#io.RawIOBase) (inherits
from IOBase and adds a `read`, `readall`, `readinto`, and `write` method).
The reason I renamed a few of the methods from the legacy client was also to
be in-line with `RawIOBase`. Here's an example of how urllibs HTTPResponse
object matches this protocol:
```py
import urllib.request
response = urllib.request.urlopen('http://python.org/')
isinstance(response, InputStream) # True
print(dir(response))
```
output:
```
[...
'begin',
'chunk_left',
'chunked',
'close',
'closed',
'code',
'debuglevel',
'detach',
'fileno',
'flush',
'fp',
'getcode',
'getheader',
'getheaders',
'geturl',
'headers',
'info',
'isatty',
'isclosed',
'length',
'msg',
'peek',
'read',
'read1',
'readable',
'readinto',
'readinto1',
'readline',
'readlines',
'reason',
'seek',
'seekable',
'status',
'tell',
'truncate',
'url',
'version',
'will_close',
'writable',
'write',
'writelines']
```
One of the remaining questions though is where do we set this `isinstance`
check, or do we do it at all? The rest of the library will be assuming this
interface so do we leave it to every `InputFile` and `OutputFile`
implementations to do an `isinstance` check? Or do we do the `isinstance` check
in other parts of the library wherever `FileIO.new_input_file(...)` is called,
for example. This feels like it might be a tradeoff to this approach in
addition to lack of python 3.7 support.
(sorry for the super long comment)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]