Tox and Pyspark

2024-05-28 Thread Perez
Hi Team, I need help with this https://stackoverflow.com/questions/78547676/tox-with-pyspark

Re: OOM concern

2024-05-28 Thread Perez
Thanks Mich for the detailed explanation. On Tue, May 28, 2024 at 9:53 PM Mich Talebzadeh wrote: > Russell mentioned some of these issues before. So in short your mileage > varies. For a 100 GB data transfer, the speed difference between Glue and > EMR might not be significant, especially consid

Re: Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-28 Thread Gera Shegalov
I agree with the previous answers that (if requirements allow it) it is much easier to just orchestrate a copy either in the same app or sync externally. A long time ago and not for a Spark app we were solving a similar usecase via https://hadoop.apache.org/docs/r3.2.3/hadoop-project-dist/hadoop-h

Re: OOM concern

2024-05-28 Thread Russell Jurney
If Glue lets you take a configuration based approach, and you don't have to operate any servers as with EMR... use Glue. Try EMR if that is troublesome. Russ On Tue, May 28, 2024 at 9:23 AM Mich Talebzadeh wrote: > Russell mentioned some of these issues before. So in short your mileage > varies

Re: OOM concern

2024-05-28 Thread Mich Talebzadeh
Russell mentioned some of these issues before. So in short your mileage varies. For a 100 GB data transfer, the speed difference between Glue and EMR might not be significant, especially considering the benefits of Glue's managed service aspects. However, for much larger datasets or scenarios where

Re: OOM concern

2024-05-28 Thread Perez
Thanks Mich. Yes, I agree on the costing part but how does the data transfer speed be impacted? Is it because glue takes some time to initialize underlying resources and then process the data? On Tue, May 28, 2024 at 2:23 PM Mich Talebzadeh wrote: > Your mileage varies as usual > > Glue with D

Re: OOM concern

2024-05-28 Thread Mich Talebzadeh
Your mileage varies as usual Glue with DPUs seems like a strong contender for your data transfer needs based on the simplicity, scalability, and managed service aspects. However, if data transfer speed is critical or costs become a concern after testing, consider EMR as an alternative. HTH Mich