If you look at the recurrent issues in datacentre-scale computing systems, two stand out -resilience to failure: that's algorithms and the layers underneath (storage, work allocation & tracking ...) -scheduling: maximising resource utilisation while prioritising high-SLA work (interactive things, HBase) on a mixed-workload cluster
Scheduling is where you get to use phrases like "APX-hard" in papers attached to JIRAs and not only not scare people, you may actually get feedback. In large —and we are taking 10K+ node clusters these days—, fractional improvements in cluster utilisation are measurable in a size big enough to show up in spreadsheets that people who sign off on cluster purchases. But for the same reason, the maintainers of the big schedulers (e.g. the YARN ones) are usually pretty reluctant to trust getting patches in. Focusing on scheduling within an app is likely to be more tractable HA is one of those really-hard-get-right problems. If you like your Lamport papers and enjoy the TLA+ toolchain it's the one to go for. The best tactic here is start with the work of others, which outside of Google means "Zookeeper"..understanding its proof would be a good start into that work. Other topic -work optimisation (e.g. partitioning, placement, ordering of operations) -self-tuning & cluster operations optimisation: can logs & live monitoring be used to improve system efficiency. The fact that logs are themselves large datasets means you get to use the analysis layer to introspect on past work. Cluster ops don't viewed as a CS problem, more one of those implementation details Finally, and this isn't in a software stack itself, but something that'd be used to test everything, and again, uses the tooling, is something to radically improve how we test and understand those test results http://steveloughran.blogspot.co.uk/2015/05/distributed-system-testing-where-now.html A challenge there would actually be getting your supervisors to recognise the problem and accept that its worth you putting in the effort. Fault injection & failure simulation is something to consider here; it hooks up to HA nicely. Look at the jepsen work as an example https://aphyr.com/posts -Steve On 30 Aug 2015, at 02:42, Reynold Xin <r...@databricks.com<mailto:r...@databricks.com>> wrote: Both 2 and 3 are pretty good topics for master's project I think. You can also look into how one can improve Spark's scheduler throughput. Couple years ago Kay measured it but things have changed. It would be great to start with measurement, and then look at where the bottlenecks are, and see how we can improve it. On Sat, Aug 29, 2015 at 10:52 AM, Сергей Лихоман <sergliho...@gmail.com<mailto:sergliho...@gmail.com>> wrote: Hi guys! I am going to make a contribution to Spark, but I didn't have much experience using it under high load and will be very appreciated for any help for pointing out scalability or performance issues that can be researched and resolved. I have several ideas: 1. Nodes HA (Seems like this is resolved in spark, but maybe someone knows existing problems..) 2. Improve data distribution between nodes. (analyze queries and automatically suggest data distribution to improve performance) 3. To think about Geo distribution. but is it actual? It will be master degree project. please, help me to select right improvement. Thanks in advance!