Both 2 and 3 are pretty good topics for master's project I think.

You can also look into how one can improve Spark's scheduler throughput.
Couple years ago Kay measured it but things have changed. It would be great
to start with measurement, and then look at where the bottlenecks are, and
see how we can improve it.


On Sat, Aug 29, 2015 at 10:52 AM, Сергей Лихоман <sergliho...@gmail.com>
wrote:

> Hi guys!
>
> I am going to make a contribution to Spark, but I didn't have much
> experience using it under high load and will be very appreciated for any
> help for pointing out scalability or performance issues that can be
> researched and resolved.
>
> I have several ideas:
> 1. Nodes HA (Seems like this is resolved in spark, but maybe someone knows
> existing problems..)
> 2. Improve data distribution between nodes. (analyze queries and
> automatically suggest data distribution to improve performance)
> 3. To think about Geo distribution. but is it actual?
>
> It will be master degree project. please, help me to select right
> improvement.
>
> Thanks in advance!
>

Reply via email to