To Akhil's point, see Tuning Data structures. Avoid standard collection hashmap.
With fewer machines, try running 4 or 5 cores per executor and only 3-4 executors (1 per node): http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/. Ought to reduce shuffle performance hit (someone else confirm?) #7 see default.shuffle.partitions (default: 200) On Sun, Mar 29, 2015 at 7:57 AM, Akhil Das <ak...@sigmoidanalytics.com> wrote: > Go through this once, if you haven't read it already. > https://spark.apache.org/docs/latest/tuning.html > > Thanks > Best Regards > > On Sat, Mar 28, 2015 at 7:33 PM, nsareen <nsar...@gmail.com> wrote: >> >> Hi All, >> >> I'm facing performance issues with spark implementation, and was briefly >> investigating on WebUI logs, i noticed that my RDD size is 55GB & the >> Shuffle Write is 10 GB & Input Size is 200GB. Application is a web >> application which does predictive analytics, so we keep most of our data >> in >> memory. This observation was only for 30mins usage of the application on a >> single user. We anticipate atleast 10-15 users of the application sending >> requests in parallel, which makes me a bit nervous. >> >> One constraint we have is that we do not have too many nodes in a cluster, >> we may end up with 3-4 machines at best, but they can be scaled up >> vertically each having 24 cores / 512 GB ram etc. which can allow us to >> make >> a virtual 10-15 node cluster. >> >> Even then the input size & shuffle write is too high for my liking. Any >> suggestions in this regard will be greatly appreciated as there aren't >> much >> resource on the net for handling performance issues such as these. >> >> Some pointers on my application's data structures & design >> >> 1) RDD is a JavaPairRDD with Key / Value as CustomPOJO containing 3-4 >> Hashmaps & Value containing 1 Hashmap >> 2) Data is loaded via JDBCRDD during application startup, which also tends >> to take a lot of time, since we massage the data once it is fetched from >> DB >> and then save it as JavaPairRDD. >> 3) Most of the data is structured, but we are still using JavaPairRDD, >> have >> not explored the option of Spark SQL though. >> 4) We have only one SparkContext which caters to all the requests coming >> into the application from various users. >> 5) During a single user session user can send 3-4 parallel stages >> consisting >> of Map / Group By / Join / Reduce etc. >> 6) We have to change the RDD structure using different types of group by >> operations since the user can do drill down drill up of the data ( >> aggregation at a higher / lower level). This is where we make use of >> Groupby's but there is a cost associated with this. >> 7) We have observed, that the initial RDD's we create have 40 odd >> partitions, but post some stage executions like groupby's the partitions >> increase to 200 or so, this was odd, and we havn't figured out why this >> happens. >> >> In summary we wan to use Spark to provide us the capability to process our >> in-memory data structure very fast as well as scale to a larger volume >> when >> required in the future. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/input-size-too-large-Performance-issues-with-Spark-tp22270.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> > -- Christian Perez Silicon Valley Data Science Data Analyst christ...@svds.com @cp_phd --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org