Yes, transformations are indeed executed on the worker nodes, but they are
only performed when necessary, usually when an action is called. This lazy
evaluation helps in optimizing the execution of Spark jobs by allowing
Spark to optimize the execution plan and perform optimizations such as
pipelining transformations and removing unnecessary computations.

"I may need something like that for synthetic data for testing. Any way to
do that ?"

Have a look at this.

https://github.com/joke2k/faker

<https://github.com/joke2k/faker>HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 18 Mar 2024 at 07:16, Sreyan Chakravarty <sreya...@gmail.com> wrote:

>
> On Fri, Mar 15, 2024 at 3:10 AM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>>
>> No Data Transfer During Creation: --> Data transfer occurs only when an
>> action is triggered.
>> Distributed Processing: --> DataFrames are distributed for parallel
>> execution, not stored entirely on the driver node.
>> Lazy Evaluation Optimization: --> Delaying data transfer until necessary
>> enhances performance.
>> Shuffle vs. Partitioning: --> Data movement during partitioning is not
>> considered a shuffle in Spark terminology.
>> Shuffles involve more complex data rearrangement.
>>
>
> So just to be clear the transformations are always executed on the worker
> node but it is just transferred until an action on the dataframe is
> triggered.
>
> Am I correct ?
>
> If so, then how do I generate a large dataset ?
>
> I may need something like that for synthetic data for testing. Any way to
> do that ?
>
>
> --
> Regards,
> Sreyan Chakravarty
>

Reply via email to