Re: [PROPOSAL] An initial Schema API in Python

Robert Bradshaw Thu, 08 Aug 2019 07:07:43 -0700

On Wed, Aug 7, 2019 at 11:12 PM Brian Hulette <bhule...@google.com> wrote:
>
> Thanks for all the suggestions, I've added responses inline.
>
> On Wed, Aug 7, 2019 at 12:52 PM Chad Dombrova <chad...@gmail.com> wrote:
>>
>> There’s a lot of ground to cover here, so I’m going to pull from a few 
>> different responses.
>>
>> ________________________________
>>
>> numpy ints
>>
>> A properly written library should accept any type implementing the int (or 
>> index) methods in place of an int, rather than doing explicit type checks
>>
>> Yes, but the reality is that very very few actually do this, including Beam 
>> itself (check the code for Timestamp and Duration, to name a few).
>>
>> Which brings me to my next topic:
>>
>> I tested this out with mypy and it would not be compatible:
>>
>> def square(x: int):
>> return x*x
>>
>> square(np.int16(32)) # mypy error
>>
>> The proper way to check this scenario is using typing.SupportsInt. Note that 
>> this only guarantees that __int__ exists, so you still need to cast to int 
>> if you want to do anything with the object:
>>
>> def square(x: typing.SupportsInt) -> int:
>>     if not isinstance(x, int):
>>         x = int(x)
>>     return x*x
>>
>> square('foo')  # error!
>> square(1.2)  # ok
>
>  Yep I came across this while writing my last reply. I agree though it seems 
> unlikely that many libraries actually do this.
>
>> ________________________________
>>
>> Native python ints
>>
>> Agreed on float since it seems to trivially map to a double, but I’m torn on 
>> int still. While I do want int type hints to work, it doesn’t seem 
>> appropriate to map it to AtomicType.INT64, since it has a completely 
>> different range of values.
>>
>> Let’s say we used native int for the runtime field type, not just as a 
>> schema declaration for numpy.int64. What is the real world fallout from 
>> this? Would there be data loss?
>
> I'm not sure I follow the question exactly, what is the interplay between int 
> and numpy.int64 in this scenario? Are you saying that np.int64 is used in the 
> schema declaration, but we just use native int at runtime, and check the bit 
> width when encoding?
>
> In any case, I don't think the real world fallout of using int is nearly that 
> dire. I suppose data loss is possible if a poorly designed pipeline overflows 
> an int64 and crashes,


The primary risk is that it *won't* crash when overflowing an int64,
it'll just silently give the wrong answer. That's much less safe than
using a native int and then actually crashing in the case it's too
large at the point one tries to encode it.

> but that's possible whether we use int or np.int64 at runtime. I'm just 
> saying that a user could be forgiven for thinking that they're safe from 
> overflows if they declare a schema as NamedTuple('Foo', 
> [('to_infinity_and_beyond', int)]), but they shouldn't make the same mistake 
> when they explicitly call it an int64.

Yes. But for schemas to be maximally useful, we'll want to be able to
infer them from all sorts of things that aren't written with Beam in
mind (e.g. external data classes, function annotations) and rejecting
the builtin int type will be a poor user experience here.

>> ________________________________
>>
>> Python3-only
>>
>> No need to worry about 2/3 compatibility for strings, we could just use str
>>
>> This is already widely handled throughout the Beam python SDK using the 
>> future/past library, so it seems silly to give up on this solution for 
>> schemas.
>>
>> On this topic, I added some comments to the PR about using 
>> past.builtins.unicode instead of numpy.unicode. They’re the same type, but 
>> there’s no reason to get this via numpy, when everywhere else in the code 
>> gets it from past.
>>
>> We could just use bytes for byte arrays (as a shorthand for 
>> typing.ByteString [1])
>>
>> Neat, but in my obviously very biased opinion it is not worth cutting off 
>> python2 users over this.
>
> Ok I won't do this :) I wasn't aware of typing.Sequence, that does seem like 
> a good fit. The other two items are just nice-to-haves, I'm happy to work 
> around those and use Sequence for arrays instead.

I would imagine that we could accept bytes or typing.ByteString for
BYTES, with only Python 2 users having to do the latter. (In both
Python 2 and Python 3 one would use str for STRING, it would decode to
past.builtins.unicode. This seems to capture the intent better than
mapping str to BYTES in Python 2 only.)

Re: [PROPOSAL] An initial Schema API in Python

Reply via email to