Your mileage varies as usual

Glue with DPUs seems like a strong contender for your data transfer needs
based on the simplicity, scalability, and managed service aspects. However,
if data transfer speed is critical or costs become a concern after testing,
consider EMR as an alternative.

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College
London <https://en.wikipedia.org/wiki/Imperial_College_London>
London, United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Tue, 28 May 2024 at 09:04, Perez <flinkbyhe...@gmail.com> wrote:

> Thank you everyone for your response.
>
> I am not getting any errors as of now. I am just trying to choose the
> right tool for my task which is data loading from an external source into
> s3 via Glue/EMR.
>
> I think Glue job would be the best fit for me because I can calculate DPUs
> needed (maybe keeping some extra buffer) so just wanted to check if there
> are any edge cases I need to consider.
>
>
> On Tue, May 28, 2024 at 5:39 AM Russell Jurney <russell.jur...@gmail.com>
> wrote:
>
>> If you’re using EMR and Spark, you need to choose nodes with enough RAM
>> to accommodate any given partition in your data or you can get an OOM
>> error. Not sure if this job involves a reduce, but I would choose a single
>> 128GB+ memory optimized instance and then adjust parallelism as via the
>> Dpark docs using pyspark.sql.DataFrame.repartition(n) at the start of your
>> job.
>>
>> Thanks,
>> Russell Jurney @rjurney <http://twitter.com/rjurney>
>> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
>> <http://facebook.com/jurney> datasyndrome.com
>>
>>
>> On Mon, May 27, 2024 at 9:15 AM Perez <flinkbyhe...@gmail.com> wrote:
>>
>>> Hi Team,
>>>
>>> I want to extract the data from DB and just dump it into S3. I
>>> don't have to perform any transformations on the data yet. My data size
>>> would be ~100 GB (historical load).
>>>
>>> Choosing the right DPUs(Glue jobs) should solve this problem right? Or
>>> should I move to EMR.
>>>
>>> I don't feel the need to move to EMR but wanted the expertise
>>> suggestions.
>>>
>>> TIA.
>>>
>>

Reply via email to