Thanks for the feedback!

We're still developing the hooks so that prototype repo isn't
representative of the end goal, here's a WIP design doc for this operator:
https://slime-hunter-536.notion.site/Design-Doc-Airflow-Pandera-Provider-a352cc3c49844a0dbacff16ba40ff079

In summary, we're planning to have an operator that connects to SQL-like
databases, validates the output of the query, and uploads the result to
some target location (e.g. cloud blob store).

Since pandera is a parsing (coercing dtypes) and validation (checking for
constraints), the additional value of the provider would be to handle
fetching data from some source (blob store, SQL db) and uploading the data
to some destination (also blob store or SQL db).

Additionally, in the future, pandera will support a reporting layer that
writes human-readable artifacts that provide a granular view, which the
airflow pandera provider can use to create data documentation (e.g. with
tasks documentation
https://airflow.apache.org/docs/apache-airflow/stable/tutorial/fundamentals.html#adding-dag-and-tasks-documentation
).

Happy to discuss further!

-NB

On Mon, Dec 5, 2022 at 7:32 PM Jarek Potiuk <[email protected]> wrote:

> With tools like Pandera, which are (as I understand) pretty standalone
> and easy to fully automatically test (I believe it can be fully
> testable with Pandas dataframes),
>
> In general I think personally (but that's my personal opinion) we
> should have no problem with approving a new provider as another
> community provider. No need for external services to run the tests on,
> the dependencies are "typical" data science ones - and they are nicely
> "lower-bounded" only (so no limiting of other providers by your
> provider). License is good (MIT). You have nice, comprehensive
> documentation, and I see the value you might provide with dataframe
> validations. I understand this is a "complete" solution - no external
> services needed to make use of Pandera, right?
>
> But we are generally rather cautious about accepting new providers -
> mainly because of potential overhead it can cause for maintainers.
> Also where "services" and there is no "organisaiton" that could afford
> maintenance of the provider - those were all the points raised in the
> past and most recently in this voting for the Cloudera provider
> https://lists.apache.org/thread/8b1jvld3npgzz2z0o3gv14lvtornbdrm. But
> you do not fall in the same camp as I understand.
>
> However, since Pandera is really a Python package that can be
> installed and used in a Python operator via API - so I wonder what is
> the value of the operators/integration since you can simply install
> Pandera locally and run
>
> #####################################
> from pandera.typing import Series
>
> class Schema(pa.SchemaModel):
>
>     column1: Series[int] = pa.Field(le=10)
>     column2: Series[float] = pa.Field(lt=-1.2)
>     column3: Series[str] = pa.Field(str_startswith="value_")
>
>     @pa.check("column3")
>     def column_3_check(cls, series: Series[str]) -> Series[bool]:
>         """Check that values have two elements after being split with
> '_'"""
>         return series.str.split("_", expand=True).shape[1] == 2
>
> @task
> def my_task():
>      df = read_dataframe_somehow()
>      Schema.validate(df).
> #####################################
>
> And similarly -  with classic PythonOperator.
>
> Do you really need a Pandera provider for that case or do you think of
> more operators?
>
> In the prototype I see there are no Hooks, which means there are no
> connections, and suddenly the operator for Pandera is not that useful
> - because in Airflow you can add Python code like that very easily.
>
> I also wonder what others think,
>
> J.
>
> On Mon, Dec 5, 2022 at 6:51 PM Niels Bantilan <[email protected]>
> wrote:
> >
> > Hello,
> >
> > I'm writing to propose a new pandera provider:
> https://pandera.readthedocs.io/en/stable/
> >
> > Pandera is a Python dataframe validation and testing library similar to
> Great Expectations and Soda, with a focus on simple syntax not too
> dissimilar from dataclasses/pydantic.
> >
> > I wanted to start a discussion here on the different options of where
> the repo would be hosted:
> >
> > as a third party provider
> > as an official Astronomer provider, e.g.
> https://github.com/astronomer/airflow-provider-great-expectations
> > other options?
> >
> > I'm working with a collaborator on this prototype repo:
> https://github.com/erichamers/airflow-provider-pandera.
> >
> > This'll be my first time contributing to the Airflow ecosystem
> >
> > Best,
> > Niels
>

Reply via email to