Re: [Spark-Core] Spark Dry Run

Mich Talebzadeh Thu, 30 Sep 2021 07:40:13 -0700

What IDEs do you have in mind?



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 30 Sept 2021 at 15:20, Ali Behjati <bahja...@gmail.com> wrote:

> Yeah it doesn't remove the need of testing on sample data. It would be
> more of syntax check rather than test. I have witnessed that syntax errors
> occur a lot.
>
> Maybe after having dry-run we will be able to create some automation
> around basic syntax checking for IDEs too.
>
> On Thu, Sep 30, 2021 at 4:15 PM Sean Owen <sro...@gmail.com> wrote:
>
>> If testing, wouldn't you actually want to execute things? even if at a
>> small scale, on a sample of data?
>>
>> On Thu, Sep 30, 2021 at 9:07 AM Ali Behjati <bahja...@gmail.com> wrote:
>>
>>> Hey everyone,
>>>
>>>
>>> By dry run I mean ability to validate the execution plan but not
>>> executing it within the code. I was wondering whether this exists in spark
>>> or not. I couldn't find it anywhere.
>>>
>>> If it doesn't exist I want to propose adding such a feature in spark.
>>>
>>> Why is it useful?
>>> 1. Faster testing: When using pyspark or spark on scala/java without
>>> DataSet we are prone to typos and mistakes about column names and other
>>> logical problems. Unfortunately IDEs won't help much and when dealing with
>>> Big Data, testing by running the code takes a lot of time. So this way we
>>> can understand typos very fast.
>>>
>>> 2. (Continuous) Integrity checks: When there are upstream and downstream
>>> pipelines, we can understand breaking changes much faster by running
>>> downstream pipelines in "dry run" mode.
>>>
>>> I believe it is not so hard to implement and I volunteer to work on it
>>> if the community approves this feature request.
>>>
>>> It can be tackled in different ways. I have two Ideas for implementation:
>>> 1. Noop (No Op) executor engine
>>> 2. On reads just infer schema and replace it with empty table with same
>>> schema
>>>
>>> Thanks,
>>> Ali
>>>
>>

Re: [Spark-Core] Spark Dry Run

Reply via email to