Hello Ali!, I've implemented a dry run in my data pipeline using a schema
repository. My pipeline takes a "dataset descriptor", which is a json
describing the dataset you want to build, loads some "entities", applies
some transformations and then writes the final dataset.
Is in the "dataset descriptor" where users can commit some mistakes or if
they reimplemented some steps inside the pipeline.  So, to perform a dry
run, first we separated the actions from the transformation. Each step
inside the pipeline has "input", "transform" and "write" methods. So, when
que want to "dry run" a pipeline, we obtain the schemas of the entities and
build "empty rdds" that we use as Input of the pipeline. Finally we just
trigger an action to test that all selected columns and queries in the
"dataset descriptor" are ok.
This is how you can create an empty dataset:

emp_RDD: RDD = spark.sparkContext.emptyRDD()
df = spark.createDataFrame(emp_RDD, schema)

Ramiro.

On Thu, Sep 30, 2021 at 11:48 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Ok thanks.
>
> What is your experience of VS Code (in terms of capabilities ) as it is
> becoming a standard tool available in Cloud workspaces like Amazon
> workspace?
>
> Mich
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 30 Sept 2021 at 15:43, Ali Behjati <bahja...@gmail.com> wrote:
>
>> Not anything specific in my mind. Any IDE which is open to plugins can
>> use it (e.g: VS Code and Jetbrains) to validate execution plans in the
>> background and mark syntax errors based on the result.
>>
>> On Thu, Sep 30, 2021 at 4:40 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> What IDEs do you have in mind?
>>>
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 30 Sept 2021 at 15:20, Ali Behjati <bahja...@gmail.com> wrote:
>>>
>>>> Yeah it doesn't remove the need of testing on sample data. It would be
>>>> more of syntax check rather than test. I have witnessed that syntax errors
>>>> occur a lot.
>>>>
>>>> Maybe after having dry-run we will be able to create some automation
>>>> around basic syntax checking for IDEs too.
>>>>
>>>> On Thu, Sep 30, 2021 at 4:15 PM Sean Owen <sro...@gmail.com> wrote:
>>>>
>>>>> If testing, wouldn't you actually want to execute things? even if at a
>>>>> small scale, on a sample of data?
>>>>>
>>>>> On Thu, Sep 30, 2021 at 9:07 AM Ali Behjati <bahja...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hey everyone,
>>>>>>
>>>>>>
>>>>>> By dry run I mean ability to validate the execution plan but not
>>>>>> executing it within the code. I was wondering whether this exists in 
>>>>>> spark
>>>>>> or not. I couldn't find it anywhere.
>>>>>>
>>>>>> If it doesn't exist I want to propose adding such a feature in spark.
>>>>>>
>>>>>> Why is it useful?
>>>>>> 1. Faster testing: When using pyspark or spark on scala/java without
>>>>>> DataSet we are prone to typos and mistakes about column names and other
>>>>>> logical problems. Unfortunately IDEs won't help much and when dealing 
>>>>>> with
>>>>>> Big Data, testing by running the code takes a lot of time. So this way we
>>>>>> can understand typos very fast.
>>>>>>
>>>>>> 2. (Continuous) Integrity checks: When there are upstream and
>>>>>> downstream pipelines, we can understand breaking changes much faster by
>>>>>> running downstream pipelines in "dry run" mode.
>>>>>>
>>>>>> I believe it is not so hard to implement and I volunteer to work on
>>>>>> it if the community approves this feature request.
>>>>>>
>>>>>> It can be tackled in different ways. I have two Ideas for
>>>>>> implementation:
>>>>>> 1. Noop (No Op) executor engine
>>>>>> 2. On reads just infer schema and replace it with empty table with
>>>>>> same schema
>>>>>>
>>>>>> Thanks,
>>>>>> Ali
>>>>>>
>>>>>

Reply via email to