Re: [DISCUSSION] New Provider: Pandera

Kaxil Naik Tue, 06 Dec 2022 08:40:46 -0800

@Niels - Would you be willing to maintain the provider in your own repos?

And if accepted to Apache Airflow repo, would you or someone you know would
be open to maintaining it with newer APIs?


There are pros and cons to maintaining it in either of the 3 options you
have listed.

Some points for maintaining it separately (in your own repos):

   - Release at your cadence
   - Better tracking for issues and milestones
   - Isolated testing with CI/CD with integrations tests

One major con is it won't be part of Airflow constraints file.

Regards,
Kaxil

On Tue, 6 Dec 2022 at 15:26, Niels Bantilan <[email protected]>
wrote:

> Thanks for the feedback!
>
> We're still developing the hooks so that prototype repo isn't
> representative of the end goal, here's a WIP design doc for this operator:
> https://slime-hunter-536.notion.site/Design-Doc-Airflow-Pandera-Provider-a352cc3c49844a0dbacff16ba40ff079
>
> In summary, we're planning to have an operator that connects to SQL-like
> databases, validates the output of the query, and uploads the result to
> some target location (e.g. cloud blob store).
>
> Since pandera is a parsing (coercing dtypes) and validation (checking for
> constraints), the additional value of the provider would be to handle
> fetching data from some source (blob store, SQL db) and uploading the data
> to some destination (also blob store or SQL db).
>
> Additionally, in the future, pandera will support a reporting layer that
> writes human-readable artifacts that provide a granular view, which the
> airflow pandera provider can use to create data documentation (e.g. with
> tasks documentation
> https://airflow.apache.org/docs/apache-airflow/stable/tutorial/fundamentals.html#adding-dag-and-tasks-documentation
> ).
>
> Happy to discuss further!
>
> -NB
>
> On Mon, Dec 5, 2022 at 7:32 PM Jarek Potiuk <[email protected]> wrote:
>
>> With tools like Pandera, which are (as I understand) pretty standalone
>> and easy to fully automatically test (I believe it can be fully
>> testable with Pandas dataframes),
>>
>> In general I think personally (but that's my personal opinion) we
>> should have no problem with approving a new provider as another
>> community provider. No need for external services to run the tests on,
>> the dependencies are "typical" data science ones - and they are nicely
>> "lower-bounded" only (so no limiting of other providers by your
>> provider). License is good (MIT). You have nice, comprehensive
>> documentation, and I see the value you might provide with dataframe
>> validations. I understand this is a "complete" solution - no external
>> services needed to make use of Pandera, right?
>>
>> But we are generally rather cautious about accepting new providers -
>> mainly because of potential overhead it can cause for maintainers.
>> Also where "services" and there is no "organisaiton" that could afford
>> maintenance of the provider - those were all the points raised in the
>> past and most recently in this voting for the Cloudera provider
>> https://lists.apache.org/thread/8b1jvld3npgzz2z0o3gv14lvtornbdrm. But
>> you do not fall in the same camp as I understand.
>>
>> However, since Pandera is really a Python package that can be
>> installed and used in a Python operator via API - so I wonder what is
>> the value of the operators/integration since you can simply install
>> Pandera locally and run
>>
>> #####################################
>> from pandera.typing import Series
>>
>> class Schema(pa.SchemaModel):
>>
>>     column1: Series[int] = pa.Field(le=10)
>>     column2: Series[float] = pa.Field(lt=-1.2)
>>     column3: Series[str] = pa.Field(str_startswith="value_")
>>
>>     @pa.check("column3")
>>     def column_3_check(cls, series: Series[str]) -> Series[bool]:
>>         """Check that values have two elements after being split with
>> '_'"""
>>         return series.str.split("_", expand=True).shape[1] == 2
>>
>> @task
>> def my_task():
>>      df = read_dataframe_somehow()
>>      Schema.validate(df).
>> #####################################
>>
>> And similarly -  with classic PythonOperator.
>>
>> Do you really need a Pandera provider for that case or do you think of
>> more operators?
>>
>> In the prototype I see there are no Hooks, which means there are no
>> connections, and suddenly the operator for Pandera is not that useful
>> - because in Airflow you can add Python code like that very easily.
>>
>> I also wonder what others think,
>>
>> J.
>>
>> On Mon, Dec 5, 2022 at 6:51 PM Niels Bantilan <[email protected]>
>> wrote:
>> >
>> > Hello,
>> >
>> > I'm writing to propose a new pandera provider:
>> https://pandera.readthedocs.io/en/stable/
>> >
>> > Pandera is a Python dataframe validation and testing library similar to
>> Great Expectations and Soda, with a focus on simple syntax not too
>> dissimilar from dataclasses/pydantic.
>> >
>> > I wanted to start a discussion here on the different options of where
>> the repo would be hosted:
>> >
>> > as a third party provider
>> > as an official Astronomer provider, e.g.
>> https://github.com/astronomer/airflow-provider-great-expectations
>> > other options?
>> >
>> > I'm working with a collaborator on this prototype repo:
>> https://github.com/erichamers/airflow-provider-pandera.
>> >
>> > This'll be my first time contributing to the Airflow ecosystem
>> >
>> > Best,
>> > Niels
>>
>

Re: [DISCUSSION] New Provider: Pandera

Reply via email to