Dear Spark Community,

I'm writing to seek your expertise in optimizing the performance of our
Spark History Server (SHS) deployed on Amazon EKS. We're encountering
timeouts (HTTP 504) when loading large event logs exceeding 5 GB.

*Our Setup:*

   - Deployment: SHS on EKS with Nginx ingress (idle connection timeout: 60
   seconds)
   - Instance: Memory-optimized with sufficient RAM and CPU
   - Spark Daemon Memory: 30 GB
   - Spark History Server Options:
   - K8s Namespace has a limit of 128Gb
   - The backend S3 has a lifecycle policy to delete objects that are older
   than *7 days*.

sparkHistoryOpts:
"-Dspark.history.fs.logDirectory=s3a://<bucket-name>/eks-infra-use1/
-Dspark.history.retainedApplications=1
-Dspark.history.ui.maxApplications=20
-Dspark.history.store.serializer=PROTOBUF
-Dspark.hadoop.fs.s3a.threads.max=25
-Dspark.hadoop.fs.s3a.connection.maximum=650
-Dspark.hadoop.fs.s3a.readahead.range=512K
-Dspark.history.fs.endEventReparseChunkSize=2m
-Dspark.history.store.maxDiskUsage=30g"

*Problem:*

   - SHS times out when loading large event logs (8 GB or more).

*Request:*

We would greatly appreciate any insights or suggestions you may have to
improve the performance of our SHS and prevent these timeouts. Here are
some areas we're particularly interested in exploring:

   - Are there additional configuration options we should consider for
   handling large event logs?
   - Could Nginx configuration adjustments help with timeouts?
   - Are there best practices for optimizing SHS performance on EKS?

We appreciate any assistance you can provide.

Thank you for your time and support.

Sincerely,
-- 

Vikas Tharyani

Associate Manager, DevOps

Nielsen

www.nielsen.com <https://global.nielsen.com/>


<https://global.nielsen.com/>

Reply via email to