Re: RFR [15] 6394757: rev2: AbstractSet.removeAll semantics are surprisingly dependent on relative sizes

Stuart Marks Wed, 06 May 2020 14:19:24 -0700



On 5/4/20 6:44 PM, Alan Snyder wrote:

What problem are you trying to solve?
The Collections framework explicitly does not support custom membershipsemantics. If you think it should, file an RFE and create a JEP. It’s not asmall change and tinkering is not the way to get there.

There are already three different kinds of sets in the JDK that supportdifferent membership semantics: sets derived from IdentityHashMap, ordinary setslike HashSet, and the SortedSet/NavigableSet family. Arguably the last alreadysupports custom membership semantics, as it's possible for callers to providetheir own comparators. I'm trying to fix semantic bugs in the way variouscollection operations handle situations with mixed membership semantics, andsecondarily, to avoid pathological performance problems that have arisen.

If your present concern is performance surprises, be aware that your newproposal has the same problem as the old one. Although it removes somedata-dependent performance surprises, it adds a potential JDK-dependentperformance surprise. The data-dependent performance issues can be detected bytesting, but no amount of testing can alert a developer to the possibility thatan unexpected implementation change in a future JDK might cause a bigperformance hit.
You may have mis-remembered the performance problem that I am concerned about.It involves using removeAll to remove a relatively small number of elements froma large, hash based collection. The performance hit is the need to iterate overthe entire collection and test every element regardless of the number ofelements being removed. Although the performance problem may be exacerbated whenthe argument is a List, the problem exists for any collection that is muchsmaller than the target collection.


You're conflating two different parts of the performance issue.

This is illustrated in an article that Jon Skeet posted back in 2010, [1] whichis linked from JDK-6982173. Briefly, Skeet observed that a removeAll using aList of 300,000 elements could take nearly 3 minutes, whereas iterating aHashSet of 1,000,000 elements would take only 38ms.

(These numbers are from 2010, and hardware is certainly different today, andthese aren't rigorous benchmarks. However, an informal benchmark that shows thedifference between 3 minutes and 38ms is a pretty clear demonstration of aperformance problem.)

Taking 3 minutes for this kind of operation is clearly pathological behavior,which is what I'm trying to address. Although it seems impossible to prevent itfrom ever happening, putting some special cases for handling List into placessuch as HashSet.removeAll would seem to cover the most of the common cases.

It's true that if you're removing a small set from a large set, iterating the"wrong" set might take 38ms instead of a much smaller time (probablymicroseconds). This would indeed be a performance regression. (It might also bean improvement in correctness, if the sets have different membership contracts.)

The fact is that there are performance regressions from one JDK release to thenext. Sometimes they're introduced by accident, and we try to address thesewhere possible. Sometimes they're introduced intentionally, as part of varioustradeoffs. That's what's going on here. I'm improving the correctness of thesystem and avoiding pathological performance problems, while introducing aperformance regression that seems modest relative to the pathologicalperformance issue that's being mitigated.


s'marks

[1]https://codeblog.jonskeet.uk/2010/07/29/there-s-a-hole-in-my-abstraction-dear-liza-dear-liza/

   Alan
On May 4, 2020, at 5:25 PM, Stuart Marks <stuart.ma...@oracle.com<mailto:stuart.ma...@oracle.com>> wrote:
On 5/1/20 10:41 PM, Jason Mehrens wrote:
1. I assume you are using "c instanceof List" instead of "!(c instanceofSet)" to correctly handle IdentitityHashMap.values()? The instanceof Listseems like safe choice but it is too bad we can still fool that check bywrapping List as an unmodifiableCollection. IfsplitIterator().characteristics() could tell us that the collection usedidentity comparisons I think we would be able to determine if it was safe toswap the ordering in the general case as we could check for IDENTITY, SORTED,and comparator equality.
I'm using "instance List", not for the reason of IdentityHashMap.values()specifically (though that's a good example), but mainly to try to be minimal.While I think getting the semantics right takes priority, the change doesimpact performance, so I want to reintroduce the performance heuristic in thesafest manner possible. Checking for !Set seems dangerous, as there might beany number of Collections out there that aren't Sets and that aren'tconsistent with equals. Checking for instanceof List seemed like the safestand most minimal thing to do.
In fact, I'm not even sure how safe List is! It's certainly possible forsomeone to have a List that isn't consistent with equals. Such a thing mightviolate the List contract, but that hasn't stopped people from implementingsuch things (outside the JDK).
I also toyed around with various additional tests for when it would beprofitable to switch iteration to the smaller collection. This could beapplied to a variety of consistent-with-equals set implementations in the JDK.The benefits of swapping the iteration is more modest in these cases comparedto List, however. Avoiding repeated List.contains() helps avoidquasi-quadratic (O(M*N)) performance. Swapping iteration order of sets gets usonly the smaller of O(M) vs O(N), which is still linear.
Also, as you noted, this heuristic is easily defeated by things like thecollection wrappers.
2. Should code applied to HashSet.removeAll also be applied toHashMap.keySet().removeAll and HashMap.entrySet().removeAll? Collections::newSetFromMap will end up having different behavior if it isnot consistently applied.
I think the *results* will be consistent, but the *performance* might not beconsistent.
But these cases could result in pathological performance if removeAll(list) iscalled, so I'm a bit concerned about them. If we agree that "instanceof List"is a reasonable heuristic, then I don't have any objection in principle toadding it to these locations as well. But I want to be careful aboutsprinkling ad hoc policies like this around the JDK.
I note that the erroneous size-based optimization was copied into, andtherefore needs to be removed from ConcurrentSkipListSet (JDK-8218945) and theset views of CHM (JDK-8219069). I'd more concerned about getting these cleanedup in the short term.
3. Not to derail this thread but do think we need a new JIRA ticket toaddress Collections.disjoint? Seems like it has similar sins of calling sizeand using "!(c2 instanceof Set)" but I don't want to muddy the waters bycovering any solutions to that method in this thread.
Yeah I'm not sure what to do about Collections.disjoint().
Note that there are some statements in bug reports (possibly even made by me!)that assert that Collections.disjoint should be consistent with Set.removeAll.I don't think that's the case. As discussed elsewhere, removeAll() needs to beconsistent about relying on the membership semantics of the argument collection.
As Collections.disjoint currently stands, it has the big "Care must beexercised" disclaimer in its spec, and it goes to some length to make a bunchof tests and vary the iteration accordingly. The programmer can get a speedupusing disjoint() compared to copying a set and then calling retainAll(),provided sufficient care is taken. Maybe that's an acceptable tradeoff.
If you have any ideas about how disjoint()'s spec or implementation could beimproved, you could file a JIRA issue, or I could file one if you sent me theinfo.
s'marks

Re: RFR [15] 6394757: rev2: AbstractSet.removeAll semantics are surprisingly dependent on relative sizes

Reply via email to