Re: OOM concern

Perez Tue, 28 May 2024 23:34:19 -0700

Thanks Mich for the detailed explanation.

On Tue, May 28, 2024 at 9:53 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:


> Russell mentioned some of these issues before. So in short your mileage
> varies. For a 100 GB data transfer, the speed difference between Glue and
> EMR might not be significant, especially considering the benefits of Glue's
> managed service aspects. However, for much larger datasets or scenarios
> where speed is critical, EMR's customization options might provide a slight
> edge.
>
> My recommendation is test and Compare: If speed is a concern, consider
> running a test job with both Glue and EMR (if feasible) on a smaller subset
> of your data to compare transfer times and costs in your specific
> environment.. Focus on Benefits: If the speed difference with Glue is
> minimal but it offers significant benefits in terms of management and cost
> for your use case, Glue might still be the preferable option.. Also
> bandwidth: Ensure your network bandwidth between the database and S3 is
> sufficient to handle the data transfer rate, regardless of the service you
> choose.
>
>
> HTH
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College
> London <https://en.wikipedia.org/wiki/Imperial_College_London>
> London, United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Tue, 28 May 2024 at 16:40, Perez <flinkbyhe...@gmail.com> wrote:
>
>> Thanks Mich.
>>
>> Yes, I agree on the costing part but how does the data transfer speed be
>> impacted? Is it because glue takes some time to initialize underlying
>> resources and then process the data?
>>
>>
>> On Tue, May 28, 2024 at 2:23 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Your mileage varies as usual
>>>
>>> Glue with DPUs seems like a strong contender for your data transfer
>>> needs based on the simplicity, scalability, and managed service aspects.
>>> However, if data transfer speed is critical or costs become a concern after
>>> testing, consider EMR as an alternative.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>> PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial
>>> College London <https://en.wikipedia.org/wiki/Imperial_College_London>
>>> London, United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>
>>>
>>> On Tue, 28 May 2024 at 09:04, Perez <flinkbyhe...@gmail.com> wrote:
>>>
>>>> Thank you everyone for your response.
>>>>
>>>> I am not getting any errors as of now. I am just trying to choose the
>>>> right tool for my task which is data loading from an external source into
>>>> s3 via Glue/EMR.
>>>>
>>>> I think Glue job would be the best fit for me because I can calculate
>>>> DPUs needed (maybe keeping some extra buffer) so just wanted to check if
>>>> there are any edge cases I need to consider.
>>>>
>>>>
>>>> On Tue, May 28, 2024 at 5:39 AM Russell Jurney <
>>>> russell.jur...@gmail.com> wrote:
>>>>
>>>>> If you’re using EMR and Spark, you need to choose nodes with enough
>>>>> RAM to accommodate any given partition in your data or you can get an OOM
>>>>> error. Not sure if this job involves a reduce, but I would choose a single
>>>>> 128GB+ memory optimized instance and then adjust parallelism as via the
>>>>> Dpark docs using pyspark.sql.DataFrame.repartition(n) at the start of your
>>>>> job.
>>>>>
>>>>> Thanks,
>>>>> Russell Jurney @rjurney <http://twitter.com/rjurney>
>>>>> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
>>>>> <http://facebook.com/jurney> datasyndrome.com
>>>>>
>>>>>
>>>>> On Mon, May 27, 2024 at 9:15 AM Perez <flinkbyhe...@gmail.com> wrote:
>>>>>
>>>>>> Hi Team,
>>>>>>
>>>>>> I want to extract the data from DB and just dump it into S3. I
>>>>>> don't have to perform any transformations on the data yet. My data size
>>>>>> would be ~100 GB (historical load).
>>>>>>
>>>>>> Choosing the right DPUs(Glue jobs) should solve this problem right?
>>>>>> Or should I move to EMR.
>>>>>>
>>>>>> I don't feel the need to move to EMR but wanted the expertise
>>>>>> suggestions.
>>>>>>
>>>>>> TIA.
>>>>>>
>>>>>

Re: OOM concern

Reply via email to