Thanks Mich.

Yes, I agree on the costing part but how does the data transfer speed be
impacted? Is it because glue takes some time to initialize underlying
resources and then process the data?


On Tue, May 28, 2024 at 2:23 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Your mileage varies as usual
>
> Glue with DPUs seems like a strong contender for your data transfer needs
> based on the simplicity, scalability, and managed service aspects. However,
> if data transfer speed is critical or costs become a concern after testing,
> consider EMR as an alternative.
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College
> London <https://en.wikipedia.org/wiki/Imperial_College_London>
> London, United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Tue, 28 May 2024 at 09:04, Perez <flinkbyhe...@gmail.com> wrote:
>
>> Thank you everyone for your response.
>>
>> I am not getting any errors as of now. I am just trying to choose the
>> right tool for my task which is data loading from an external source into
>> s3 via Glue/EMR.
>>
>> I think Glue job would be the best fit for me because I can calculate
>> DPUs needed (maybe keeping some extra buffer) so just wanted to check if
>> there are any edge cases I need to consider.
>>
>>
>> On Tue, May 28, 2024 at 5:39 AM Russell Jurney <russell.jur...@gmail.com>
>> wrote:
>>
>>> If you’re using EMR and Spark, you need to choose nodes with enough RAM
>>> to accommodate any given partition in your data or you can get an OOM
>>> error. Not sure if this job involves a reduce, but I would choose a single
>>> 128GB+ memory optimized instance and then adjust parallelism as via the
>>> Dpark docs using pyspark.sql.DataFrame.repartition(n) at the start of your
>>> job.
>>>
>>> Thanks,
>>> Russell Jurney @rjurney <http://twitter.com/rjurney>
>>> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
>>> <http://facebook.com/jurney> datasyndrome.com
>>>
>>>
>>> On Mon, May 27, 2024 at 9:15 AM Perez <flinkbyhe...@gmail.com> wrote:
>>>
>>>> Hi Team,
>>>>
>>>> I want to extract the data from DB and just dump it into S3. I
>>>> don't have to perform any transformations on the data yet. My data size
>>>> would be ~100 GB (historical load).
>>>>
>>>> Choosing the right DPUs(Glue jobs) should solve this problem right? Or
>>>> should I move to EMR.
>>>>
>>>> I don't feel the need to move to EMR but wanted the expertise
>>>> suggestions.
>>>>
>>>> TIA.
>>>>
>>>

Reply via email to