Yeah it doesn't remove the need of testing on sample data. It would be more of syntax check rather than test. I have witnessed that syntax errors occur a lot.
Maybe after having dry-run we will be able to create some automation around basic syntax checking for IDEs too. On Thu, Sep 30, 2021 at 4:15 PM Sean Owen <sro...@gmail.com> wrote: > If testing, wouldn't you actually want to execute things? even if at a > small scale, on a sample of data? > > On Thu, Sep 30, 2021 at 9:07 AM Ali Behjati <bahja...@gmail.com> wrote: > >> Hey everyone, >> >> >> By dry run I mean ability to validate the execution plan but not >> executing it within the code. I was wondering whether this exists in spark >> or not. I couldn't find it anywhere. >> >> If it doesn't exist I want to propose adding such a feature in spark. >> >> Why is it useful? >> 1. Faster testing: When using pyspark or spark on scala/java without >> DataSet we are prone to typos and mistakes about column names and other >> logical problems. Unfortunately IDEs won't help much and when dealing with >> Big Data, testing by running the code takes a lot of time. So this way we >> can understand typos very fast. >> >> 2. (Continuous) Integrity checks: When there are upstream and downstream >> pipelines, we can understand breaking changes much faster by running >> downstream pipelines in "dry run" mode. >> >> I believe it is not so hard to implement and I volunteer to work on it if >> the community approves this feature request. >> >> It can be tackled in different ways. I have two Ideas for implementation: >> 1. Noop (No Op) executor engine >> 2. On reads just infer schema and replace it with empty table with same >> schema >> >> Thanks, >> Ali >> >