If you look at the recurrent issues in datacentre-scale computing systems, two 
stand out
-resilience to failure: that's algorithms and the layers underneath (storage, 
work allocation & tracking ...)
-scheduling: maximising resource utilisation while prioritising high-SLA work 
(interactive things, HBase)  on a mixed-workload cluster

Scheduling is where you get to use phrases like "APX-hard"  in papers attached 
to JIRAs and not only not scare people, you may actually get feedback. In large 
—and we are taking 10K+ node clusters these days—, fractional improvements in 
cluster utilisation are measurable in a size big enough to show up in 
spreadsheets that people who sign off on cluster purchases. But for the same 
reason, the maintainers of the big schedulers (e.g. the YARN ones) are usually 
pretty reluctant to trust getting patches in. Focusing on scheduling within an 
app is likely to be more tractable


HA is one of those really-hard-get-right problems. If you like your Lamport 
papers and enjoy the TLA+ toolchain it's the one to go for. The best tactic 
here is start with the work of others, which outside of Google means 
"Zookeeper"..understanding its proof would be a good start into that work.

Other topic
-work optimisation (e.g. partitioning, placement, ordering of operations)
-self-tuning & cluster operations optimisation: can logs & live monitoring be 
used to improve system efficiency. The fact that logs are themselves large 
datasets means you get to use the analysis layer to introspect on past work. 
Cluster ops don't viewed as a CS problem, more one of those implementation 
details

Finally, and this isn't in a software stack itself, but something that'd be 
used to test everything, and again, uses the tooling, is something to radically 
improve how we test and understand those test results
http://steveloughran.blogspot.co.uk/2015/05/distributed-system-testing-where-now.html

A challenge there would actually be getting your supervisors to recognise the 
problem and accept that its worth you putting in the effort. Fault injection & 
failure simulation is something to consider here; it hooks up to HA nicely. 
Look at the jepsen work as an example https://aphyr.com/posts

-Steve


On 30 Aug 2015, at 02:42, Reynold Xin 
<r...@databricks.com<mailto:r...@databricks.com>> wrote:

Both 2 and 3 are pretty good topics for master's project I think.

You can also look into how one can improve Spark's scheduler throughput. Couple 
years ago Kay measured it but things have changed. It would be great to start 
with measurement, and then look at where the bottlenecks are, and see how we 
can improve it.


On Sat, Aug 29, 2015 at 10:52 AM, Сергей Лихоман 
<sergliho...@gmail.com<mailto:sergliho...@gmail.com>> wrote:
Hi guys!

I am going to make a contribution to Spark, but I didn't have much experience 
using it under high load and will be very appreciated for any help for pointing 
out scalability or performance issues that can be researched and resolved.

I have several ideas:
1. Nodes HA (Seems like this is resolved in spark, but maybe someone knows 
existing problems..)
2. Improve data distribution between nodes. (analyze queries and automatically 
suggest data distribution to improve performance)
3. To think about Geo distribution. but is it actual?

It will be master degree project. please, help me to select right improvement.

Thanks in advance!


Reply via email to