If testing, wouldn't you actually want to execute things? even if at a
small scale, on a sample of data?

On Thu, Sep 30, 2021 at 9:07 AM Ali Behjati <bahja...@gmail.com> wrote:

> Hey everyone,
>
>
> By dry run I mean ability to validate the execution plan but not executing
> it within the code. I was wondering whether this exists in spark or not. I
> couldn't find it anywhere.
>
> If it doesn't exist I want to propose adding such a feature in spark.
>
> Why is it useful?
> 1. Faster testing: When using pyspark or spark on scala/java without
> DataSet we are prone to typos and mistakes about column names and other
> logical problems. Unfortunately IDEs won't help much and when dealing with
> Big Data, testing by running the code takes a lot of time. So this way we
> can understand typos very fast.
>
> 2. (Continuous) Integrity checks: When there are upstream and downstream
> pipelines, we can understand breaking changes much faster by running
> downstream pipelines in "dry run" mode.
>
> I believe it is not so hard to implement and I volunteer to work on it if
> the community approves this feature request.
>
> It can be tackled in different ways. I have two Ideas for implementation:
> 1. Noop (No Op) executor engine
> 2. On reads just infer schema and replace it with empty table with same
> schema
>
> Thanks,
> Ali
>

Reply via email to