Thanks Mich for the detailed explanation. On Tue, May 28, 2024 at 9:53 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> Russell mentioned some of these issues before. So in short your mileage > varies. For a 100 GB data transfer, the speed difference between Glue and > EMR might not be significant, especially considering the benefits of Glue's > managed service aspects. However, for much larger datasets or scenarios > where speed is critical, EMR's customization options might provide a slight > edge. > > My recommendation is test and Compare: If speed is a concern, consider > running a test job with both Glue and EMR (if feasible) on a smaller subset > of your data to compare transfer times and costs in your specific > environment.. Focus on Benefits: If the speed difference with Glue is > minimal but it offers significant benefits in terms of management and cost > for your use case, Glue might still be the preferable option.. Also > bandwidth: Ensure your network bandwidth between the database and S3 is > sufficient to handle the data transfer rate, regardless of the service you > choose. > > > HTH > Mich Talebzadeh, > Technologist | Architect | Data Engineer | Generative AI | FinCrime > PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College > London <https://en.wikipedia.org/wiki/Imperial_College_London> > London, United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* The information provided is correct to the best of my > knowledge but of course cannot be guaranteed . It is essential to note > that, as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > > > On Tue, 28 May 2024 at 16:40, Perez <flinkbyhe...@gmail.com> wrote: > >> Thanks Mich. >> >> Yes, I agree on the costing part but how does the data transfer speed be >> impacted? Is it because glue takes some time to initialize underlying >> resources and then process the data? >> >> >> On Tue, May 28, 2024 at 2:23 PM Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> Your mileage varies as usual >>> >>> Glue with DPUs seems like a strong contender for your data transfer >>> needs based on the simplicity, scalability, and managed service aspects. >>> However, if data transfer speed is critical or costs become a concern after >>> testing, consider EMR as an alternative. >>> >>> HTH >>> >>> Mich Talebzadeh, >>> Technologist | Architect | Data Engineer | Generative AI | FinCrime >>> PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial >>> College London <https://en.wikipedia.org/wiki/Imperial_College_London> >>> London, United Kingdom >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* The information provided is correct to the best of my >>> knowledge but of course cannot be guaranteed . It is essential to note >>> that, as with any advice, quote "one test result is worth one-thousand >>> expert opinions (Werner >>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>> >>> >>> On Tue, 28 May 2024 at 09:04, Perez <flinkbyhe...@gmail.com> wrote: >>> >>>> Thank you everyone for your response. >>>> >>>> I am not getting any errors as of now. I am just trying to choose the >>>> right tool for my task which is data loading from an external source into >>>> s3 via Glue/EMR. >>>> >>>> I think Glue job would be the best fit for me because I can calculate >>>> DPUs needed (maybe keeping some extra buffer) so just wanted to check if >>>> there are any edge cases I need to consider. >>>> >>>> >>>> On Tue, May 28, 2024 at 5:39 AM Russell Jurney < >>>> russell.jur...@gmail.com> wrote: >>>> >>>>> If you’re using EMR and Spark, you need to choose nodes with enough >>>>> RAM to accommodate any given partition in your data or you can get an OOM >>>>> error. Not sure if this job involves a reduce, but I would choose a single >>>>> 128GB+ memory optimized instance and then adjust parallelism as via the >>>>> Dpark docs using pyspark.sql.DataFrame.repartition(n) at the start of your >>>>> job. >>>>> >>>>> Thanks, >>>>> Russell Jurney @rjurney <http://twitter.com/rjurney> >>>>> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB >>>>> <http://facebook.com/jurney> datasyndrome.com >>>>> >>>>> >>>>> On Mon, May 27, 2024 at 9:15 AM Perez <flinkbyhe...@gmail.com> wrote: >>>>> >>>>>> Hi Team, >>>>>> >>>>>> I want to extract the data from DB and just dump it into S3. I >>>>>> don't have to perform any transformations on the data yet. My data size >>>>>> would be ~100 GB (historical load). >>>>>> >>>>>> Choosing the right DPUs(Glue jobs) should solve this problem right? >>>>>> Or should I move to EMR. >>>>>> >>>>>> I don't feel the need to move to EMR but wanted the expertise >>>>>> suggestions. >>>>>> >>>>>> TIA. >>>>>> >>>>>