On the 0x50E day of Apache Harmony Aleksey Shipilev wrote: > Hi, Egor! > > Your thoughts are truly pessimistic like everyone who develop at least > one compiler has. Of course, there's no silver bullet, e.g. there's no > such system where you can press the big red button and the system will > say where're the bugs :) > > The whole thing about that fuzzy testing is: > a. Yes, there can be false-positives. > b. Yes, there can be plenty of false-positives. > c. Somewhere behind the stack there are real issues covered. > > The problem is, no matter what we are thinking about automated testing > of compiler, any testing results would produce nearly the same amount > of garbage above the real issues. > > You'll make the random search, you'll have the whole search space to > track: 200 boolean params effectively produce 2^200 possible tuples. > What these results are for, they are more focused on near-optimal > configurations and so we needn't to scratch our heads on whether we > should take care of configuration that lies far away from optimal. > Again, there can be lots of garbage in those tests, but 5.400+ is the > number I could live with, not with the 2^200. Having only this little > of the tests enables me to actually tackle them, without having > another young-looking Universe to run that tests in. <g> > > But the discussion is really inspiring, thanks! The point of > contributing those tests were the impression that JIT developers are > crying for tests and bugs to fix. Ian Rogers from JikesRVM had asked > me to contribute the failure reports for JikesRVM, solely for testing > of deep dark corners of RVM, so I extrapolated the same intention on > Harmony. I for sure underestimated the failure rate for RVM and > Harmony and now have to think how to make the worth of that pile of > crashed configurations. For now on, I just disclosed them to community > without clear thoughts what to do next. Nevertheless, in the > background we all thinking what to do. > > Please don't take the offense :) I perfectly know the tests have to go > for human-assisted post-processing, I know there is a lot of garbage, > I know there are lots of implications and complications around. I also > suspect that this kind of work is like running ahead the train. But > anyway, the work is done, it was an auxiliary result so we can just > dump it -- but can we make any use of it?
offense? why? :) I really enjoy the conversation. > There's an excellent idea with re-testing that issues in debug mode, > to make more clear taxonomy of the crashes. Though it's not related to > my job and thesis anymore, I also have an idea how to sweep the tests > and make them more fine-grained, by introducing the similarity metric > and searching for nearest non-failure configuration. Any other ideas? Aleksey, is there a combined solution, where I push the red button, which makes the silver bullet to fire? :) Your argument about these being 'near' most effective configurations is interesting indeed. And result is interesting overall. True. Big respect, etc. My concern is: is it effective to look through configurations one by one to find issues in this compiler? Lots of false-positives really worry me. Seems like without traversing 10s or even 100s of emconfs by hand it is hard to find something valuable. For Jikes the situation might be completely different, Jikes is not an argument here :) However, there is one idea .. why are you classifying the configurations based only by end result status? clusters are obviously too big. I would also take the configurations as a parameter for clustering failures. Yes, you'll need a fair amount of machine learning efforts to cluster them. But that may pay off really well. Looking into trained model BDD could give an insight on what optpass compatibility rules are. Or it can show that the rules are too compicated (that will also be the case if you overtrain the model, ha ha). The latter idea might look like a purple button, however.. :) -- Egor Pasko
