Dear Spark Community, I'm writing to seek your expertise in optimizing the performance of our Spark History Server (SHS) deployed on Amazon EKS. We're encountering timeouts (HTTP 504) when loading large event logs exceeding 5 GB.
*Our Setup:* - Deployment: SHS on EKS with Nginx ingress (idle connection timeout: 60 seconds) - Instance: Memory-optimized with sufficient RAM and CPU - Spark Daemon Memory: 30 GB - Spark History Server Options: - K8s Namespace has a limit of 128Gb - The backend S3 has a lifecycle policy to delete objects that are older than *7 days*. sparkHistoryOpts: "-Dspark.history.fs.logDirectory=s3a://<bucket-name>/eks-infra-use1/ -Dspark.history.retainedApplications=1 -Dspark.history.ui.maxApplications=20 -Dspark.history.store.serializer=PROTOBUF -Dspark.hadoop.fs.s3a.threads.max=25 -Dspark.hadoop.fs.s3a.connection.maximum=650 -Dspark.hadoop.fs.s3a.readahead.range=512K -Dspark.history.fs.endEventReparseChunkSize=2m -Dspark.history.store.maxDiskUsage=30g" *Problem:* - SHS times out when loading large event logs (8 GB or more). *Request:* We would greatly appreciate any insights or suggestions you may have to improve the performance of our SHS and prevent these timeouts. Here are some areas we're particularly interested in exploring: - Are there additional configuration options we should consider for handling large event logs? - Could Nginx configuration adjustments help with timeouts? - Are there best practices for optimizing SHS performance on EKS? We appreciate any assistance you can provide. Thank you for your time and support. Sincerely, -- Vikas Tharyani Associate Manager, DevOps Nielsen www.nielsen.com <https://global.nielsen.com/> <https://global.nielsen.com/>