Adding more tests to spark-perf is a good idea. It would be great if it covered some of the ML algorithms for example. In addition, for correctness, the test suites in core can also be enhanced. In particular I'd like to make sure we're testing all methods in the RDD API in all of Java, Python and Scala -- we recently found some methods that don't quite work in Java for example.
Our JIRA is currently still at https://spark-project.atlassian.net/secure/MyJiraHome.jspa but it's hopefully going to be imported into Apache really soon so I'd recommend holding off of creating new issues for a bit to see if the import succeeds. This is the task to import it into Apache: https://issues.apache.org/jira/browse/INFRA-6419. Matei On Oct 12, 2013, at 2:44 PM, Christopher Nguyen <[email protected]> wrote: > Perfect. This is a great start of what I'm looking for. > > -- > Christopher T. Nguyen > Co-founder & CEO, Adatao <http://adatao.com> > linkedin.com/in/ctnguyen > > > > On Sat, Oct 12, 2013 at 2:31 PM, Mark Hamstra <[email protected]>wrote: > >> There is also spark-perf <https://github.com/amplab/spark-perf>. >> >> >> On Sat, Oct 12, 2013 at 2:22 PM, Christopher Nguyen <[email protected]> >> wrote: >> >>> Roman, an area I think would (a) have high impact, and (b) is relatively >>> not well covered is performance analysis. I'm sure most teams are doing >>> this internally at their respective companies, but there is no shared >> code >>> base and shared wisdom about what we're finding/improving. >>> >>> For example, consider the task of loading a table from disk into memory >> by >>> Shark. We're getting conflicting data about how much of this is cpu-bound >>> vs I/O-bound. Our effort to track this down should be sharable somehow, >> and >>> would benefit from others' findings. Of course this is dependent on the >>> particular configuration, but there is a lot of test harness code/scripts >>> that can be shared. And individual findings, even if/especially if they >> are >>> conflicting, are very valuable if well documented. >>> >>> There is a Benchmark effort covered here >>> https://amplab.cs.berkeley.edu/benchmark/, but it addresses a slightly >>> different goal. You could consider this Perf-Analysis as part of that, or >>> as its own effort. >>> >>> This may be more than you were looking to own, but given your stated >>> enthusiasm :) I want to throw the idea out there. >>> >>> -- >>> Christopher T. Nguyen >>> Co-founder & CEO, Adatao <http://adatao.com> >>> linkedin.com/in/ctnguyen >>> >>> >>> >>> On Sat, Oct 12, 2013 at 1:48 PM, Роман Ткаленко <[email protected] >>>> wrote: >>> >>>> Hello. >>>> I'm trying to dive into Spark's sources on a deeper-than-mere-glance >>> level >>>> and I find beginning with writing unit tests a good way to do it. So, >>>> basically, I'm wondering if there are points to which I could >>> specifically >>>> apply my enthusiasm, i. e. are there some un- or not enough covered >> parts >>>> for which I could write some tests? >>>> I'm wondering as well about the state of Apache-hosted JIRA for Spark >> - I >>>> currently can't see any entry in there. Should I look for them in >> Github >>>> mirror or still in the antecedent JIRA instance on >>>> http://spark-project.atlassian.net/? >>>> Regards, >>>> Roman. >>>> >>> >>
