Yeah it doesn't remove the need of testing on sample data. It would be more
of syntax check rather than test. I have witnessed that syntax errors occur
a lot.

Maybe after having dry-run we will be able to create some automation around
basic syntax checking for IDEs too.

On Thu, Sep 30, 2021 at 4:15 PM Sean Owen <sro...@gmail.com> wrote:

> If testing, wouldn't you actually want to execute things? even if at a
> small scale, on a sample of data?
>
> On Thu, Sep 30, 2021 at 9:07 AM Ali Behjati <bahja...@gmail.com> wrote:
>
>> Hey everyone,
>>
>>
>> By dry run I mean ability to validate the execution plan but not
>> executing it within the code. I was wondering whether this exists in spark
>> or not. I couldn't find it anywhere.
>>
>> If it doesn't exist I want to propose adding such a feature in spark.
>>
>> Why is it useful?
>> 1. Faster testing: When using pyspark or spark on scala/java without
>> DataSet we are prone to typos and mistakes about column names and other
>> logical problems. Unfortunately IDEs won't help much and when dealing with
>> Big Data, testing by running the code takes a lot of time. So this way we
>> can understand typos very fast.
>>
>> 2. (Continuous) Integrity checks: When there are upstream and downstream
>> pipelines, we can understand breaking changes much faster by running
>> downstream pipelines in "dry run" mode.
>>
>> I believe it is not so hard to implement and I volunteer to work on it if
>> the community approves this feature request.
>>
>> It can be tackled in different ways. I have two Ideas for implementation:
>> 1. Noop (No Op) executor engine
>> 2. On reads just infer schema and replace it with empty table with same
>> schema
>>
>> Thanks,
>> Ali
>>
>

Reply via email to