Thanks for looking into this, Sean! Loved the tl;dr.

On Fri, Aug 31, 2018 at 12:28 PM Sean Owen <sro...@gmail.com> wrote:

> TL;DR - We already had the fix from SPARK-5984. The delta from the current
> JDK implementation to Spark's looks actually inconsequential. No action
> required AFAICT.
>
> On Fri, Aug 31, 2018 at 12:30 PM Sean Owen <sro...@gmail.com> wrote:
>
>> I looked into this, because it sure sounds like a similar issue from a
>> few years ago that was fixed in
>> https://issues.apache.org/jira/browse/SPARK-5984
>> The change in that JIRA actually looks almost identical to the change
>> mentioned in the JDK bug:
>> http://hg.openjdk.java.net/jdk/jdk/rev/3a6d47df8239
>>
>> Reading the paper
>> http://drops.dagstuhl.de/opus/volltexte/2018/9467/pdf/LIPIcs-ESA-2018-4.pdf 
>> in
>> section 5 a little more, I think they are saying that there were two ways
>> to fix the original problem: a) fix the invariant, or b) increase some data
>> structure size. Java did the latter, it seems, and now they've shown it's
>> actually still busted. However Python and the original paper did the
>> former, and we seem to have copied that fix from
>> http://envisage-project.eu/proving-android-java-and-python-sorting-algorithm-is-broken-and-how-to-fix-it/
>>  My
>> understanding is that this still works, and is what Java *now* implements.
>>
>> The only difference I can see in implementation is an extra check for a
>> negative array index before dereferencing array[n]. We can add that for
>> full consistency with the JVM change, I suppose. I don't think it's
>> relevant to the problem reported in the paper, but could be an issue
>> otherwise. The JVM implementation suggests it thinks this needs to be
>> guarded.
>>
>> I did hack together a crude translation of the paper's bug reproduction
>> at http://igm.univ-mlv.fr/~pivoteau/Timsort/Test.java by copying some
>> Spark test code. It does need a huge amount of memory to run (>32g.. ended
>> up at 44g) so not so feasible to put in the test suite. Running it on Spark
>> master nets a JVM crash:
>>
>> # Problematic frame:
>> # J 10195 C2
>> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(Ljava/lang/Object;IILjava/util/Comparator;)I
>> (198 bytes) @ 0x00007ff64dd9a262 [0x00007ff64dd99f20+0x342]
>>
>> Thats... not good, but I can't tell if it's really due to this same issue
>> or something else going on in the off-heap code. Making the tiny change I
>> mentioned above doesn't do anything.
>>
>> On Fri, Aug 31, 2018 at 2:37 AM Reynold Xin <r...@databricks.com> wrote:
>>
>>> “As a byproduct of our study, we uncover a bug in the Java
>>> implementation that can cause the sorting method to fail during the
>>> execution.”
>>>
>>> http://drops.dagstuhl.de/opus/volltexte/2018/9467/
>>>
>>> This might impact Spark since we took the Java based TimSort
>>> implementation. I have seen in the wild TimSort failing in the past. Maybe
>>> this is the cause.
>>>
>>

Reply via email to