If testing, wouldn't you actually want to execute things? even if at a small scale, on a sample of data?
On Thu, Sep 30, 2021 at 9:07 AM Ali Behjati <bahja...@gmail.com> wrote: > Hey everyone, > > > By dry run I mean ability to validate the execution plan but not executing > it within the code. I was wondering whether this exists in spark or not. I > couldn't find it anywhere. > > If it doesn't exist I want to propose adding such a feature in spark. > > Why is it useful? > 1. Faster testing: When using pyspark or spark on scala/java without > DataSet we are prone to typos and mistakes about column names and other > logical problems. Unfortunately IDEs won't help much and when dealing with > Big Data, testing by running the code takes a lot of time. So this way we > can understand typos very fast. > > 2. (Continuous) Integrity checks: When there are upstream and downstream > pipelines, we can understand breaking changes much faster by running > downstream pipelines in "dry run" mode. > > I believe it is not so hard to implement and I volunteer to work on it if > the community approves this feature request. > > It can be tackled in different ways. I have two Ideas for implementation: > 1. Noop (No Op) executor engine > 2. On reads just infer schema and replace it with empty table with same > schema > > Thanks, > Ali >