Your mileage varies as usual Glue with DPUs seems like a strong contender for your data transfer needs based on the simplicity, scalability, and managed service aspects. However, if data transfer speed is critical or costs become a concern after testing, consider EMR as an alternative.
HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia.org/wiki/Imperial_College_London> London, United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". On Tue, 28 May 2024 at 09:04, Perez <flinkbyhe...@gmail.com> wrote: > Thank you everyone for your response. > > I am not getting any errors as of now. I am just trying to choose the > right tool for my task which is data loading from an external source into > s3 via Glue/EMR. > > I think Glue job would be the best fit for me because I can calculate DPUs > needed (maybe keeping some extra buffer) so just wanted to check if there > are any edge cases I need to consider. > > > On Tue, May 28, 2024 at 5:39 AM Russell Jurney <russell.jur...@gmail.com> > wrote: > >> If you’re using EMR and Spark, you need to choose nodes with enough RAM >> to accommodate any given partition in your data or you can get an OOM >> error. Not sure if this job involves a reduce, but I would choose a single >> 128GB+ memory optimized instance and then adjust parallelism as via the >> Dpark docs using pyspark.sql.DataFrame.repartition(n) at the start of your >> job. >> >> Thanks, >> Russell Jurney @rjurney <http://twitter.com/rjurney> >> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB >> <http://facebook.com/jurney> datasyndrome.com >> >> >> On Mon, May 27, 2024 at 9:15 AM Perez <flinkbyhe...@gmail.com> wrote: >> >>> Hi Team, >>> >>> I want to extract the data from DB and just dump it into S3. I >>> don't have to perform any transformations on the data yet. My data size >>> would be ~100 GB (historical load). >>> >>> Choosing the right DPUs(Glue jobs) should solve this problem right? Or >>> should I move to EMR. >>> >>> I don't feel the need to move to EMR but wanted the expertise >>> suggestions. >>> >>> TIA. >>> >>