Hi Team, We are currently working on POC based on Spark and Scala. we have to read 18million records from parquet file and perform the 25 user defined aggregation based on grouping keys. we have used spark high level Dataframe API for the aggregation. On cluster of two node we could finish end to end job ((Read+Aggregation+Write))in 2 min.
Cluster Information: Number of Node:2 Total Core:28Core Total RAM:128GB Component: Spark Core Scenario: How-to Tuning Parameter: spark.serializer org.apache.spark.serializer.KryoSerializer spark.default.parallelism 24 spark.sql.shuffle.partitions 24 spark.executor.extraJavaOptions -XX:+UseG1GC spark.speculation true spark.executor.memory 16G spark.driver.memory 8G spark.sql.codegen true spark.sql.inMemoryColumnarStorage.batchSize 100000 spark.locality.wait 1s spark.ui.showConsoleProgress false spark.io.compression.codec org.apache.spark.io.SnappyCompressionCodec Please let us know, If you have any ideas/tuning parameter that we can use to finish the job in less than one min. Regards, Pallavi DISCLAIMER ========== This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.