Re: [Spark-Core] Spark Dry Run

Ali Behjati Thu, 30 Sep 2021 07:43:20 -0700

Not anything specific in my mind. Any IDE which is open to plugins can use
it (e.g: VS Code and Jetbrains) to validate execution plans in the
background and mark syntax errors based on the result.


On Thu, Sep 30, 2021 at 4:40 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> What IDEs do you have in mind?
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 30 Sept 2021 at 15:20, Ali Behjati <bahja...@gmail.com> wrote:
>
>> Yeah it doesn't remove the need of testing on sample data. It would be
>> more of syntax check rather than test. I have witnessed that syntax errors
>> occur a lot.
>>
>> Maybe after having dry-run we will be able to create some automation
>> around basic syntax checking for IDEs too.
>>
>> On Thu, Sep 30, 2021 at 4:15 PM Sean Owen <sro...@gmail.com> wrote:
>>
>>> If testing, wouldn't you actually want to execute things? even if at a
>>> small scale, on a sample of data?
>>>
>>> On Thu, Sep 30, 2021 at 9:07 AM Ali Behjati <bahja...@gmail.com> wrote:
>>>
>>>> Hey everyone,
>>>>
>>>>
>>>> By dry run I mean ability to validate the execution plan but not
>>>> executing it within the code. I was wondering whether this exists in spark
>>>> or not. I couldn't find it anywhere.
>>>>
>>>> If it doesn't exist I want to propose adding such a feature in spark.
>>>>
>>>> Why is it useful?
>>>> 1. Faster testing: When using pyspark or spark on scala/java without
>>>> DataSet we are prone to typos and mistakes about column names and other
>>>> logical problems. Unfortunately IDEs won't help much and when dealing with
>>>> Big Data, testing by running the code takes a lot of time. So this way we
>>>> can understand typos very fast.
>>>>
>>>> 2. (Continuous) Integrity checks: When there are upstream and
>>>> downstream pipelines, we can understand breaking changes much faster by
>>>> running downstream pipelines in "dry run" mode.
>>>>
>>>> I believe it is not so hard to implement and I volunteer to work on it
>>>> if the community approves this feature request.
>>>>
>>>> It can be tackled in different ways. I have two Ideas for
>>>> implementation:
>>>> 1. Noop (No Op) executor engine
>>>> 2. On reads just infer schema and replace it with empty table with same
>>>> schema
>>>>
>>>> Thanks,
>>>> Ali
>>>>
>>>

Re: [Spark-Core] Spark Dry Run

Reply via email to