Re: [PROPOSAL] An initial Schema API in Python

Brian Hulette Wed, 07 Aug 2019 10:51:52 -0700

>
> If it is not a big deal supporting both sounds good. I was actually
> referring to your comment about  typing.Collection not being available on
> python 2.
>


Oh, of course, sorry somehow that completely slipped my mind. Now that I'm
actually thinking it through, you're right there are several things that
could be cleaner if we make it a python 3 only feature:
- We could use `typing.Collection`
- No need to worry about 2/3 compatibility for strings, we could just use
`str`
- We could just use `bytes` for byte arrays (as a shorthand for
`typing.ByteString` [1])

If we do make it a python 3 only feature how would it be enforced? Just by
documentation, or could I add a version check? I can't find any precedents
for entire features that reject py2 currently.

What happens when one does np.int16(107) + 1, is the numpy type preserved?


Yes it looks like once you're in numpy land you stay there, unless you
explicitly leave with a call to int(). ints in an arithmetic expression
like np.int16(107) + 1 are implicitly converted to a reasonable type
(np.int64 if bit_length() < 64, else np.float64).

Another concern I have is, if there is a user function or a library that
> user does not control, that uses typing to indicate that a function accepts
> a type of int, would it be compatible with numpy types?


That's a good point. I tested this out with mypy and it would not be
compatible:

def square(x: int):
    return x*x

square(np.int16(32))  # mypy error

Users would have to cast to int to make this work as Chad pointed out. I
was curious about the actual performance cost of these conversions, so I
wrote up a script to benchmark them [2]. The results from running on python
2.7.16 on my desktop:

pass:   6.117 ns/op
int to int:     71.524 ns/op
np.int8 to int: 89.784 ns/op
int to np.int8: 89.784 ns/op
np.int8 to np.int8:     89.784 ns/op
np.int16 to int:        86.715 ns/op
int to np.int16:        86.715 ns/op
np.int16 to np.int16:   86.715 ns/op
np.int32 to int:        89.172 ns/op
int to np.int32:        89.172 ns/op
np.int32 to np.int32:   89.172 ns/op
np.int64 to int:        88.072 ns/op
int to np.int64:        88.072 ns/op
np.int64 to np.int64:   88.072 ns/op

It's interesting to note you don't pay much of a penalty for converting
to/from numpy types over just casting something that's already an int.

[1] https://docs.python.org/3/library/typing.html#typing.ByteString
[2] https://gist.github.com/TheNeuralBit/158dff3fa90dc46a369bb014e913650d

On Fri, Aug 2, 2019 at 6:43 PM Ahmet Altay <[email protected]> wrote:

> To clarify, I am happy to start with implementation and iterating on it. I
> do not want to block this late into the discussion.
>
> On Fri, Aug 2, 2019 at 6:03 PM Brian Hulette <[email protected]> wrote:
>
>> I meant "or sub-class it and define fields with type annotations" not
>> "with attributes". I believe that version doesn't work in python 2 since it
>> doesn't support the `name: type` syntax.
>>
>
> If it is not a big deal supporting both sounds good. I was actually
> referring to your comment about  typing.Collection not being available on
> python 2.
>
>
>>
>> On Fri, Aug 2, 2019 at 5:55 PM Brian Hulette <[email protected]> wrote:
>>
>>> > Do we need to support python 2? If supporting python 2 will complicate
>>> things, we could make this a python3 only feature.
>>> I don't think supporting python 2 complicates things. It's just that
>>> there are two different ways to use typing.NamedTuple in python 3 - you can
>>> either instantiate it and provide a list of (name, type) pairs, or
>>> sub-class it and define fields with attributes. But in python 2 only the
>>> former works.
>>>
>>> > Why are we mapping to numpy types? Design document suggests mapping to
>>> python native types as the plan.
>>> We did discuss using numpy types in a comment [1], but you're right we
>>> never resolved it and the doc still lists native types. My biggest concern
>>> with just using native int/float types is I think we definitely need *some*
>>> way to distinguish between the schema proto's various int/float sizes in
>>> the python representation. If we don't we would need to either a) reject
>>> schemas that contain any size other than the one that we support, or b) no
>>> longer have a bijective mapping between proto and python (i.e. any integer
>>> type that passes through the Python SDK would get converted to an int64).
>>> And if we do need some way to distinguish between the integer types, I
>>> thought a de facto standard was better than creating our own - as Robert
>>> noted in that comment thread "The only strong opinion I have is that we
>>> shouldn't invent our own."
>>>
>>> As I was experimenting with different approaches I also discovered the
>>> numpy numeric types are very nice because you can instantiate them and they
>>> look just like ints, for example `np.int16(107) == 107` evaluates to true
>>> even though `type(np.int16(107)) == type(107)` does not.
>>>
>>> Another concern with python's int type is that it supports unlimited
>>> precision [2], so it's really not a good type to use for any of the schema
>>> ints. My PR as it is right now actually doesn't even support int. I
>>> probably should at least make a change to accept int as a type
>>> specification for iint64 but throw an error when encoding if an int is too
>>> big.
>>>
>>
> I agree with Robert's position of not inventing our own. I assume we could
> make a decision between python native types, arrow types, and numpy types.
>
> What happens when one does np.int16(107) + 1, is the numpy type preserved?
>
>  Another concern I have is, if there is a user function or a library that
> user does not control, that uses typing to indicate that a function accepts
> a type of int, would it be compatible with numpy types?
>
>
>>
>>> [1]
>>> https://docs.google.com/a/google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?disco=AAAACtLItNA
>>> [2] https://docs.python.org/3/library/stdtypes.html#typesnumeric
>>>
>>> On Fri, Aug 2, 2019 at 4:12 PM Ahmet Altay <[email protected]> wrote:
>>> >
>>> > Thank you Brian.
>>> >
>>> > I did not spend enough time yet to review. Some early questions, I
>>> apologize if I missed an earlier discussion.
>>> > - Do we need to support python 2? If supporting python 2 will
>>> complicate things, we could make this a python3 only feature.
>>> > - Why are we mapping to numpy types? Design document suggests mapping
>>> to python native types as the plan.
>>> >
>>> > On Wed, Jul 31, 2019 at 2:51 PM Brian Hulette <[email protected]>
>>> wrote:
>>> >>
>>> >> tl;dr: I have a PR at [1] that defines an initial Schema API in
>>> python based on the typing module, and uses typing.NamedTuple to represent
>>> a Schema. There are some risks with that approach but I propose we move
>>> forward with it as a first draft and iterate.
>>> >>
>>> >>
>>> >> I've opened up a PR [1] that implements RowCoder in the Python SDK
>>> and verifies it's compatibility with the Java implementation via tests in
>>> standard_coders.yaml. A lot of miscellaneous changes are required to get
>>> that point, including a pretty significant one: providing some native
>>> python representation for schemas.
>>> >>
>>> >> As discussed in the PR description I opted to fully embrace the
>>> typing module for the native representation of schema types:
>>> >> - Primitive types all map to numpy types (e.g. np.int16, np.unicode).
>>> >> - Arrays map to typing.List. In https://s.apache.org/beam-schemas we
>>> settled on typing.Collection, but unfortunately this doesn't seem to be
>>> supported in python 2, I'm open to other suggestions here.
>>> >> - Map maps to typing.Mapping.
>>> >> - Rows map to typing.NamedTuple.
>>> >> - nullability is indicated with typing.Optional. Note there's no
>>> distinction between Optional[Optional[T]] and Optional[T] in typing, both
>>> map to Union[T, None] - so this is actually a good analog for the nullable
>>> flag on FieldType in schema.proto.
>>> >>
>>> >> With this approach a schema in Python might look like:
>>> >> ```
>>> >> class Movie(NamedTuple):
>>> >>   name: np.unicode
>>> >>   year: Optional[np.int16]
>>> >>
>>> >> # The class/type annotation syntax doesn't work in Python 2. Instead
>>> you can use:
>>> >> # Movie = NamedTuple('Movie', [('name', np.unicode), ('year',
>>> Optional[np.int16])]
>>> >>
>>> >> # DoFns annotated with_output_types(Movie) will use RowCoder
>>> >> coders.registry.register_coder(Movie, coders.RowCoder)
>>> >> ```
>>> >>
>>> >> I think the choice to use typing.NamedTuple as a row type is
>>> potentially controversial - Udi, Robert Bradshaw and I were already
>>> discussing it a bit in a comment on the portable schemas doc [2], but I
>>> wanted to bring that discussion to the ML.
>>> >>
>>> >> On the pro side:
>>> >> + NamedTuple is a pretty great analog for Java's Row type [3]. Both
>>> store attributes internally as an ordered collection (List<Object> in Row,
>>> a tuple in NamedTuple) and provide shortcuts for accessing those attributes
>>> by field name based on the schema.
>>> >> +  NamedTuple is a native type, and we're trying to get out of the
>>> business of defining our own type hints (I think).
>>> >>
>>> >> On the con side:
>>> >> - When using the class-based version of NamedTuple in python 3 a user
>>> might be tempted to add more functionality to their class (for example,
>>> define a method) rather than just defining a schema - but I'm not sure
>>> we're prepared to guarantee that we will always produce an instance of
>>> their class, just something that has the defined attributes. This concern
>>> can potentially be alleviated once we have support for logical types.
>>> >>
>>> >> Unless there are any objections I think it would make sense to start
>>> with this implementation (documenting the limitations), and then iterate on
>>> it. Please take a look at the PR [1] and let me know what you think about
>>> this proposal.
>>> >>
>>> >> Thanks,
>>> >> Brian
>>> >>
>>> >> [1] https://github.com/apache/beam/pull/9188
>>> >> [2]
>>> https://docs.google.com/a/google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?disco=AAAADSP8gx8
>>> >> [3]
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java
>>>
>>

Re: [PROPOSAL] An initial Schema API in Python

Reply via email to