Thanks for looking into this, Sean! Loved the tl;dr.
On Fri, Aug 31, 2018 at 12:28 PM Sean Owen <sro...@gmail.com> wrote: > TL;DR - We already had the fix from SPARK-5984. The delta from the current > JDK implementation to Spark's looks actually inconsequential. No action > required AFAICT. > > On Fri, Aug 31, 2018 at 12:30 PM Sean Owen <sro...@gmail.com> wrote: > >> I looked into this, because it sure sounds like a similar issue from a >> few years ago that was fixed in >> https://issues.apache.org/jira/browse/SPARK-5984 >> The change in that JIRA actually looks almost identical to the change >> mentioned in the JDK bug: >> http://hg.openjdk.java.net/jdk/jdk/rev/3a6d47df8239 >> >> Reading the paper >> http://drops.dagstuhl.de/opus/volltexte/2018/9467/pdf/LIPIcs-ESA-2018-4.pdf >> in >> section 5 a little more, I think they are saying that there were two ways >> to fix the original problem: a) fix the invariant, or b) increase some data >> structure size. Java did the latter, it seems, and now they've shown it's >> actually still busted. However Python and the original paper did the >> former, and we seem to have copied that fix from >> http://envisage-project.eu/proving-android-java-and-python-sorting-algorithm-is-broken-and-how-to-fix-it/ >> My >> understanding is that this still works, and is what Java *now* implements. >> >> The only difference I can see in implementation is an extra check for a >> negative array index before dereferencing array[n]. We can add that for >> full consistency with the JVM change, I suppose. I don't think it's >> relevant to the problem reported in the paper, but could be an issue >> otherwise. The JVM implementation suggests it thinks this needs to be >> guarded. >> >> I did hack together a crude translation of the paper's bug reproduction >> at http://igm.univ-mlv.fr/~pivoteau/Timsort/Test.java by copying some >> Spark test code. It does need a huge amount of memory to run (>32g.. ended >> up at 44g) so not so feasible to put in the test suite. Running it on Spark >> master nets a JVM crash: >> >> # Problematic frame: >> # J 10195 C2 >> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(Ljava/lang/Object;IILjava/util/Comparator;)I >> (198 bytes) @ 0x00007ff64dd9a262 [0x00007ff64dd99f20+0x342] >> >> Thats... not good, but I can't tell if it's really due to this same issue >> or something else going on in the off-heap code. Making the tiny change I >> mentioned above doesn't do anything. >> >> On Fri, Aug 31, 2018 at 2:37 AM Reynold Xin <r...@databricks.com> wrote: >> >>> “As a byproduct of our study, we uncover a bug in the Java >>> implementation that can cause the sorting method to fail during the >>> execution.” >>> >>> http://drops.dagstuhl.de/opus/volltexte/2018/9467/ >>> >>> This might impact Spark since we took the Java based TimSort >>> implementation. I have seen in the wild TimSort failing in the past. Maybe >>> this is the cause. >>> >>