With tools like Pandera, which are (as I understand) pretty standalone
and easy to fully automatically test (I believe it can be fully
testable with Pandas dataframes),

In general I think personally (but that's my personal opinion) we
should have no problem with approving a new provider as another
community provider. No need for external services to run the tests on,
the dependencies are "typical" data science ones - and they are nicely
"lower-bounded" only (so no limiting of other providers by your
provider). License is good (MIT). You have nice, comprehensive
documentation, and I see the value you might provide with dataframe
validations. I understand this is a "complete" solution - no external
services needed to make use of Pandera, right?

But we are generally rather cautious about accepting new providers -
mainly because of potential overhead it can cause for maintainers.
Also where "services" and there is no "organisaiton" that could afford
maintenance of the provider - those were all the points raised in the
past and most recently in this voting for the Cloudera provider
https://lists.apache.org/thread/8b1jvld3npgzz2z0o3gv14lvtornbdrm. But
you do not fall in the same camp as I understand.

However, since Pandera is really a Python package that can be
installed and used in a Python operator via API - so I wonder what is
the value of the operators/integration since you can simply install
Pandera locally and run

#####################################
from pandera.typing import Series

class Schema(pa.SchemaModel):

    column1: Series[int] = pa.Field(le=10)
    column2: Series[float] = pa.Field(lt=-1.2)
    column3: Series[str] = pa.Field(str_startswith="value_")

    @pa.check("column3")
    def column_3_check(cls, series: Series[str]) -> Series[bool]:
        """Check that values have two elements after being split with '_'"""
        return series.str.split("_", expand=True).shape[1] == 2

@task
def my_task():
     df = read_dataframe_somehow()
     Schema.validate(df).
#####################################

And similarly -  with classic PythonOperator.

Do you really need a Pandera provider for that case or do you think of
more operators?

In the prototype I see there are no Hooks, which means there are no
connections, and suddenly the operator for Pandera is not that useful
- because in Airflow you can add Python code like that very easily.

I also wonder what others think,

J.

On Mon, Dec 5, 2022 at 6:51 PM Niels Bantilan <[email protected]> wrote:
>
> Hello,
>
> I'm writing to propose a new pandera provider: 
> https://pandera.readthedocs.io/en/stable/
>
> Pandera is a Python dataframe validation and testing library similar to Great 
> Expectations and Soda, with a focus on simple syntax not too dissimilar from 
> dataclasses/pydantic.
>
> I wanted to start a discussion here on the different options of where the 
> repo would be hosted:
>
> as a third party provider
> as an official Astronomer provider, e.g. 
> https://github.com/astronomer/airflow-provider-great-expectations
> other options?
>
> I'm working with a collaborator on this prototype repo: 
> https://github.com/erichamers/airflow-provider-pandera.
>
> This'll be my first time contributing to the Airflow ecosystem
>
> Best,
> Niels

Reply via email to