With tools like Pandera, which are (as I understand) pretty standalone and easy to fully automatically test (I believe it can be fully testable with Pandas dataframes),
In general I think personally (but that's my personal opinion) we should have no problem with approving a new provider as another community provider. No need for external services to run the tests on, the dependencies are "typical" data science ones - and they are nicely "lower-bounded" only (so no limiting of other providers by your provider). License is good (MIT). You have nice, comprehensive documentation, and I see the value you might provide with dataframe validations. I understand this is a "complete" solution - no external services needed to make use of Pandera, right? But we are generally rather cautious about accepting new providers - mainly because of potential overhead it can cause for maintainers. Also where "services" and there is no "organisaiton" that could afford maintenance of the provider - those were all the points raised in the past and most recently in this voting for the Cloudera provider https://lists.apache.org/thread/8b1jvld3npgzz2z0o3gv14lvtornbdrm. But you do not fall in the same camp as I understand. However, since Pandera is really a Python package that can be installed and used in a Python operator via API - so I wonder what is the value of the operators/integration since you can simply install Pandera locally and run ##################################### from pandera.typing import Series class Schema(pa.SchemaModel): column1: Series[int] = pa.Field(le=10) column2: Series[float] = pa.Field(lt=-1.2) column3: Series[str] = pa.Field(str_startswith="value_") @pa.check("column3") def column_3_check(cls, series: Series[str]) -> Series[bool]: """Check that values have two elements after being split with '_'""" return series.str.split("_", expand=True).shape[1] == 2 @task def my_task(): df = read_dataframe_somehow() Schema.validate(df). ##################################### And similarly - with classic PythonOperator. Do you really need a Pandera provider for that case or do you think of more operators? In the prototype I see there are no Hooks, which means there are no connections, and suddenly the operator for Pandera is not that useful - because in Airflow you can add Python code like that very easily. I also wonder what others think, J. On Mon, Dec 5, 2022 at 6:51 PM Niels Bantilan <[email protected]> wrote: > > Hello, > > I'm writing to propose a new pandera provider: > https://pandera.readthedocs.io/en/stable/ > > Pandera is a Python dataframe validation and testing library similar to Great > Expectations and Soda, with a focus on simple syntax not too dissimilar from > dataclasses/pydantic. > > I wanted to start a discussion here on the different options of where the > repo would be hosted: > > as a third party provider > as an official Astronomer provider, e.g. > https://github.com/astronomer/airflow-provider-great-expectations > other options? > > I'm working with a collaborator on this prototype repo: > https://github.com/erichamers/airflow-provider-pandera. > > This'll be my first time contributing to the Airflow ecosystem > > Best, > Niels
