RE: Understanding Spark S3 Read Performance

2023-05-16 Thread info
Hi,For clarification, are those 12 / 14 minutes cumulative cpu time or wall clock time? How many executors executed those 1 / 375 tasks?Cheers,Enrico Ursprüngliche Nachricht Von: Shashank Rao Datum: 16.05.23 19:48 (GMT+01:00) An: user@spark.apache.org Betreff:

Understanding Spark S3 Read Performance

2023-05-16 Thread Shashank Rao
Hi, I'm trying to set up a Spark pipeline which reads data from S3 and writes it into Google Big Query. Environment Details: --- Java 8 AWS EMR-6.10.0 Spark v3.3.1 2 m5.xlarge executor nodes S3 Directory structure: --- bucket-name: |---folder1: |---folder2:

Spark shuffle and inevitability of writing to Disk

2023-05-16 Thread Mich Talebzadeh
Hi, On the issue of Spark shuffle it is accepted that shuffle *often involves* the following if not all below: - Disk I/O - Data serialization and deserialization - Network I/O Excluding external shuffle service and without relying on the configuration options provided by spark for