If you’re using EMR and Spark, you need to choose nodes with enough RAM to
accommodate any given partition in your data or you can get an OOM error.
Not sure if this job involves a reduce, but I would choose a single 128GB+
memory optimized instance and then adjust parallelism as via the Dpark docs
using pyspark.sql.DataFrame.repartition(n) at the start of your job.

Thanks,
Russell Jurney @rjurney <http://twitter.com/rjurney>
russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
<http://facebook.com/jurney> datasyndrome.com


On Mon, May 27, 2024 at 9:15 AM Perez <flinkbyhe...@gmail.com> wrote:

> Hi Team,
>
> I want to extract the data from DB and just dump it into S3. I
> don't have to perform any transformations on the data yet. My data size
> would be ~100 GB (historical load).
>
> Choosing the right DPUs(Glue jobs) should solve this problem right? Or
> should I move to EMR.
>
> I don't feel the need to move to EMR but wanted the expertise suggestions.
>
> TIA.
>

Reply via email to