Re: [Spark-Core] Spark Dry Run

2021-10-04 Thread Ali Behjati
Hey Ramiro,

Thank you for your detailed answer.
We also have a similar framework which does the same and I saw very good
results. However, pipelines using normal spark apps require change to adapt
to a framework and it requires a lot of effort. This is why I'm suggesting
adding it to spark core to make it available to everyone out of the box.

-
Ali

On Mon, Oct 4, 2021 at 1:35 PM Ramiro Laso  wrote:

> Hello Ali!, I've implemented a dry run in my data pipeline using a schema
> repository. My pipeline takes a "dataset descriptor", which is a json
> describing the dataset you want to build, loads some "entities", applies
> some transformations and then writes the final dataset.
> Is in the "dataset descriptor" where users can commit some mistakes or if
> they reimplemented some steps inside the pipeline.  So, to perform a dry
> run, first we separated the actions from the transformation. Each step
> inside the pipeline has "input", "transform" and "write" methods. So, when
> que want to "dry run" a pipeline, we obtain the schemas of the entities and
> build "empty rdds" that we use as Input of the pipeline. Finally we just
> trigger an action to test that all selected columns and queries in the
> "dataset descriptor" are ok.
> This is how you can create an empty dataset:
>
> emp_RDD: RDD = spark.sparkContext.emptyRDD()
> df = spark.createDataFrame(emp_RDD, schema)
>
> Ramiro.
>
> On Thu, Sep 30, 2021 at 11:48 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Ok thanks.
>>
>> What is your experience of VS Code (in terms of capabilities ) as it is
>> becoming a standard tool available in Cloud workspaces like Amazon
>> workspace?
>>
>> Mich
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 30 Sept 2021 at 15:43, Ali Behjati  wrote:
>>
>>> Not anything specific in my mind. Any IDE which is open to plugins can
>>> use it (e.g: VS Code and Jetbrains) to validate execution plans in the
>>> background and mark syntax errors based on the result.
>>>
>>> On Thu, Sep 30, 2021 at 4:40 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 What IDEs do you have in mind?



view my Linkedin profile
 



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Thu, 30 Sept 2021 at 15:20, Ali Behjati  wrote:

> Yeah it doesn't remove the need of testing on sample data. It would be
> more of syntax check rather than test. I have witnessed that syntax errors
> occur a lot.
>
> Maybe after having dry-run we will be able to create some automation
> around basic syntax checking for IDEs too.
>
> On Thu, Sep 30, 2021 at 4:15 PM Sean Owen  wrote:
>
>> If testing, wouldn't you actually want to execute things? even if at
>> a small scale, on a sample of data?
>>
>> On Thu, Sep 30, 2021 at 9:07 AM Ali Behjati 
>> wrote:
>>
>>> Hey everyone,
>>>
>>>
>>> By dry run I mean ability to validate the execution plan but not
>>> executing it within the code. I was wondering whether this exists in 
>>> spark
>>> or not. I couldn't find it anywhere.
>>>
>>> If it doesn't exist I want to propose adding such a feature in
>>> spark.
>>>
>>> Why is it useful?
>>> 1. Faster testing: When using pyspark or spark on scala/java without
>>> DataSet we are prone to typos and mistakes about column names and other
>>> logical problems. Unfortunately IDEs won't help much and when dealing 
>>> with
>>> Big Data, testing by running the code takes a lot of time. So this way 
>>> we
>>> can understand typos very fast.
>>>
>>> 2. (Continuous) Integrity checks: When there are upstream and
>>> downstream pipelines, we can understand breaking changes much faster by
>>> running downstream pipelines in "dry run" mode.
>>>
>>> I believe it is not so hard to implement and I volunteer to work on
>>> it if the community approves this feature request.
>>>
>>> It can be tackled in different ways. I have two Ideas for
>>> implementation:
>>> 1. Noop (No Op) executor engine
>>> 2. On reads just infer schema and replace it with 

Re: [Spark-Core] Spark Dry Run

2021-10-04 Thread Ramiro Laso
Hello Ali!, I've implemented a dry run in my data pipeline using a schema
repository. My pipeline takes a "dataset descriptor", which is a json
describing the dataset you want to build, loads some "entities", applies
some transformations and then writes the final dataset.
Is in the "dataset descriptor" where users can commit some mistakes or if
they reimplemented some steps inside the pipeline.  So, to perform a dry
run, first we separated the actions from the transformation. Each step
inside the pipeline has "input", "transform" and "write" methods. So, when
que want to "dry run" a pipeline, we obtain the schemas of the entities and
build "empty rdds" that we use as Input of the pipeline. Finally we just
trigger an action to test that all selected columns and queries in the
"dataset descriptor" are ok.
This is how you can create an empty dataset:

emp_RDD: RDD = spark.sparkContext.emptyRDD()
df = spark.createDataFrame(emp_RDD, schema)

Ramiro.

On Thu, Sep 30, 2021 at 11:48 AM Mich Talebzadeh 
wrote:

> Ok thanks.
>
> What is your experience of VS Code (in terms of capabilities ) as it is
> becoming a standard tool available in Cloud workspaces like Amazon
> workspace?
>
> Mich
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 30 Sept 2021 at 15:43, Ali Behjati  wrote:
>
>> Not anything specific in my mind. Any IDE which is open to plugins can
>> use it (e.g: VS Code and Jetbrains) to validate execution plans in the
>> background and mark syntax errors based on the result.
>>
>> On Thu, Sep 30, 2021 at 4:40 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> What IDEs do you have in mind?
>>>
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 30 Sept 2021 at 15:20, Ali Behjati  wrote:
>>>
 Yeah it doesn't remove the need of testing on sample data. It would be
 more of syntax check rather than test. I have witnessed that syntax errors
 occur a lot.

 Maybe after having dry-run we will be able to create some automation
 around basic syntax checking for IDEs too.

 On Thu, Sep 30, 2021 at 4:15 PM Sean Owen  wrote:

> If testing, wouldn't you actually want to execute things? even if at a
> small scale, on a sample of data?
>
> On Thu, Sep 30, 2021 at 9:07 AM Ali Behjati 
> wrote:
>
>> Hey everyone,
>>
>>
>> By dry run I mean ability to validate the execution plan but not
>> executing it within the code. I was wondering whether this exists in 
>> spark
>> or not. I couldn't find it anywhere.
>>
>> If it doesn't exist I want to propose adding such a feature in spark.
>>
>> Why is it useful?
>> 1. Faster testing: When using pyspark or spark on scala/java without
>> DataSet we are prone to typos and mistakes about column names and other
>> logical problems. Unfortunately IDEs won't help much and when dealing 
>> with
>> Big Data, testing by running the code takes a lot of time. So this way we
>> can understand typos very fast.
>>
>> 2. (Continuous) Integrity checks: When there are upstream and
>> downstream pipelines, we can understand breaking changes much faster by
>> running downstream pipelines in "dry run" mode.
>>
>> I believe it is not so hard to implement and I volunteer to work on
>> it if the community approves this feature request.
>>
>> It can be tackled in different ways. I have two Ideas for
>> implementation:
>> 1. Noop (No Op) executor engine
>> 2. On reads just infer schema and replace it with empty table with
>> same schema
>>
>> Thanks,
>> Ali
>>
>


Re: [Spark-Core] Spark Dry Run

2021-09-30 Thread Mich Talebzadeh
Ok thanks.

What is your experience of VS Code (in terms of capabilities ) as it is
becoming a standard tool available in Cloud workspaces like Amazon
workspace?

Mich



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 30 Sept 2021 at 15:43, Ali Behjati  wrote:

> Not anything specific in my mind. Any IDE which is open to plugins can use
> it (e.g: VS Code and Jetbrains) to validate execution plans in the
> background and mark syntax errors based on the result.
>
> On Thu, Sep 30, 2021 at 4:40 PM Mich Talebzadeh 
> wrote:
>
>> What IDEs do you have in mind?
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 30 Sept 2021 at 15:20, Ali Behjati  wrote:
>>
>>> Yeah it doesn't remove the need of testing on sample data. It would be
>>> more of syntax check rather than test. I have witnessed that syntax errors
>>> occur a lot.
>>>
>>> Maybe after having dry-run we will be able to create some automation
>>> around basic syntax checking for IDEs too.
>>>
>>> On Thu, Sep 30, 2021 at 4:15 PM Sean Owen  wrote:
>>>
 If testing, wouldn't you actually want to execute things? even if at a
 small scale, on a sample of data?

 On Thu, Sep 30, 2021 at 9:07 AM Ali Behjati  wrote:

> Hey everyone,
>
>
> By dry run I mean ability to validate the execution plan but not
> executing it within the code. I was wondering whether this exists in spark
> or not. I couldn't find it anywhere.
>
> If it doesn't exist I want to propose adding such a feature in spark.
>
> Why is it useful?
> 1. Faster testing: When using pyspark or spark on scala/java without
> DataSet we are prone to typos and mistakes about column names and other
> logical problems. Unfortunately IDEs won't help much and when dealing with
> Big Data, testing by running the code takes a lot of time. So this way we
> can understand typos very fast.
>
> 2. (Continuous) Integrity checks: When there are upstream and
> downstream pipelines, we can understand breaking changes much faster by
> running downstream pipelines in "dry run" mode.
>
> I believe it is not so hard to implement and I volunteer to work on it
> if the community approves this feature request.
>
> It can be tackled in different ways. I have two Ideas for
> implementation:
> 1. Noop (No Op) executor engine
> 2. On reads just infer schema and replace it with empty table with
> same schema
>
> Thanks,
> Ali
>



Re: [Spark-Core] Spark Dry Run

2021-09-30 Thread Ali Behjati
Not anything specific in my mind. Any IDE which is open to plugins can use
it (e.g: VS Code and Jetbrains) to validate execution plans in the
background and mark syntax errors based on the result.

On Thu, Sep 30, 2021 at 4:40 PM Mich Talebzadeh 
wrote:

> What IDEs do you have in mind?
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 30 Sept 2021 at 15:20, Ali Behjati  wrote:
>
>> Yeah it doesn't remove the need of testing on sample data. It would be
>> more of syntax check rather than test. I have witnessed that syntax errors
>> occur a lot.
>>
>> Maybe after having dry-run we will be able to create some automation
>> around basic syntax checking for IDEs too.
>>
>> On Thu, Sep 30, 2021 at 4:15 PM Sean Owen  wrote:
>>
>>> If testing, wouldn't you actually want to execute things? even if at a
>>> small scale, on a sample of data?
>>>
>>> On Thu, Sep 30, 2021 at 9:07 AM Ali Behjati  wrote:
>>>
 Hey everyone,


 By dry run I mean ability to validate the execution plan but not
 executing it within the code. I was wondering whether this exists in spark
 or not. I couldn't find it anywhere.

 If it doesn't exist I want to propose adding such a feature in spark.

 Why is it useful?
 1. Faster testing: When using pyspark or spark on scala/java without
 DataSet we are prone to typos and mistakes about column names and other
 logical problems. Unfortunately IDEs won't help much and when dealing with
 Big Data, testing by running the code takes a lot of time. So this way we
 can understand typos very fast.

 2. (Continuous) Integrity checks: When there are upstream and
 downstream pipelines, we can understand breaking changes much faster by
 running downstream pipelines in "dry run" mode.

 I believe it is not so hard to implement and I volunteer to work on it
 if the community approves this feature request.

 It can be tackled in different ways. I have two Ideas for
 implementation:
 1. Noop (No Op) executor engine
 2. On reads just infer schema and replace it with empty table with same
 schema

 Thanks,
 Ali

>>>


Re: [Spark-Core] Spark Dry Run

2021-09-30 Thread Mich Talebzadeh
What IDEs do you have in mind?



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 30 Sept 2021 at 15:20, Ali Behjati  wrote:

> Yeah it doesn't remove the need of testing on sample data. It would be
> more of syntax check rather than test. I have witnessed that syntax errors
> occur a lot.
>
> Maybe after having dry-run we will be able to create some automation
> around basic syntax checking for IDEs too.
>
> On Thu, Sep 30, 2021 at 4:15 PM Sean Owen  wrote:
>
>> If testing, wouldn't you actually want to execute things? even if at a
>> small scale, on a sample of data?
>>
>> On Thu, Sep 30, 2021 at 9:07 AM Ali Behjati  wrote:
>>
>>> Hey everyone,
>>>
>>>
>>> By dry run I mean ability to validate the execution plan but not
>>> executing it within the code. I was wondering whether this exists in spark
>>> or not. I couldn't find it anywhere.
>>>
>>> If it doesn't exist I want to propose adding such a feature in spark.
>>>
>>> Why is it useful?
>>> 1. Faster testing: When using pyspark or spark on scala/java without
>>> DataSet we are prone to typos and mistakes about column names and other
>>> logical problems. Unfortunately IDEs won't help much and when dealing with
>>> Big Data, testing by running the code takes a lot of time. So this way we
>>> can understand typos very fast.
>>>
>>> 2. (Continuous) Integrity checks: When there are upstream and downstream
>>> pipelines, we can understand breaking changes much faster by running
>>> downstream pipelines in "dry run" mode.
>>>
>>> I believe it is not so hard to implement and I volunteer to work on it
>>> if the community approves this feature request.
>>>
>>> It can be tackled in different ways. I have two Ideas for implementation:
>>> 1. Noop (No Op) executor engine
>>> 2. On reads just infer schema and replace it with empty table with same
>>> schema
>>>
>>> Thanks,
>>> Ali
>>>
>>


Re: [Spark-Core] Spark Dry Run

2021-09-30 Thread Ali Behjati
Yeah it doesn't remove the need of testing on sample data. It would be more
of syntax check rather than test. I have witnessed that syntax errors occur
a lot.

Maybe after having dry-run we will be able to create some automation around
basic syntax checking for IDEs too.

On Thu, Sep 30, 2021 at 4:15 PM Sean Owen  wrote:

> If testing, wouldn't you actually want to execute things? even if at a
> small scale, on a sample of data?
>
> On Thu, Sep 30, 2021 at 9:07 AM Ali Behjati  wrote:
>
>> Hey everyone,
>>
>>
>> By dry run I mean ability to validate the execution plan but not
>> executing it within the code. I was wondering whether this exists in spark
>> or not. I couldn't find it anywhere.
>>
>> If it doesn't exist I want to propose adding such a feature in spark.
>>
>> Why is it useful?
>> 1. Faster testing: When using pyspark or spark on scala/java without
>> DataSet we are prone to typos and mistakes about column names and other
>> logical problems. Unfortunately IDEs won't help much and when dealing with
>> Big Data, testing by running the code takes a lot of time. So this way we
>> can understand typos very fast.
>>
>> 2. (Continuous) Integrity checks: When there are upstream and downstream
>> pipelines, we can understand breaking changes much faster by running
>> downstream pipelines in "dry run" mode.
>>
>> I believe it is not so hard to implement and I volunteer to work on it if
>> the community approves this feature request.
>>
>> It can be tackled in different ways. I have two Ideas for implementation:
>> 1. Noop (No Op) executor engine
>> 2. On reads just infer schema and replace it with empty table with same
>> schema
>>
>> Thanks,
>> Ali
>>
>


Re: [Spark-Core] Spark Dry Run

2021-09-30 Thread Sean Owen
If testing, wouldn't you actually want to execute things? even if at a
small scale, on a sample of data?

On Thu, Sep 30, 2021 at 9:07 AM Ali Behjati  wrote:

> Hey everyone,
>
>
> By dry run I mean ability to validate the execution plan but not executing
> it within the code. I was wondering whether this exists in spark or not. I
> couldn't find it anywhere.
>
> If it doesn't exist I want to propose adding such a feature in spark.
>
> Why is it useful?
> 1. Faster testing: When using pyspark or spark on scala/java without
> DataSet we are prone to typos and mistakes about column names and other
> logical problems. Unfortunately IDEs won't help much and when dealing with
> Big Data, testing by running the code takes a lot of time. So this way we
> can understand typos very fast.
>
> 2. (Continuous) Integrity checks: When there are upstream and downstream
> pipelines, we can understand breaking changes much faster by running
> downstream pipelines in "dry run" mode.
>
> I believe it is not so hard to implement and I volunteer to work on it if
> the community approves this feature request.
>
> It can be tackled in different ways. I have two Ideas for implementation:
> 1. Noop (No Op) executor engine
> 2. On reads just infer schema and replace it with empty table with same
> schema
>
> Thanks,
> Ali
>


[Spark-Core] Spark Dry Run

2021-09-30 Thread Ali Behjati
Hey everyone,


By dry run I mean ability to validate the execution plan but not executing
it within the code. I was wondering whether this exists in spark or not. I
couldn't find it anywhere.

If it doesn't exist I want to propose adding such a feature in spark.

Why is it useful?
1. Faster testing: When using pyspark or spark on scala/java without
DataSet we are prone to typos and mistakes about column names and other
logical problems. Unfortunately IDEs won't help much and when dealing with
Big Data, testing by running the code takes a lot of time. So this way we
can understand typos very fast.

2. (Continuous) Integrity checks: When there are upstream and downstream
pipelines, we can understand breaking changes much faster by running
downstream pipelines in "dry run" mode.

I believe it is not so hard to implement and I volunteer to work on it if
the community approves this feature request.

It can be tackled in different ways. I have two Ideas for implementation:
1. Noop (No Op) executor engine
2. On reads just infer schema and replace it with empty table with same
schema

Thanks,
Ali