Re: Proposal to enhance Stream.collect

Brian Goetz Sun, 24 Feb 2019 06:51:43 -0800

We did consider this problem when designing the Collector API; forexample, it would have been nice if we could have a `toArray()`collector that had all the optimizations of `Stream::toArray`.

When we looked into it, we found a number of concerning details thatcaused us to turn back (many of which you've already identified), suchas the difficulty of managing parallelism, the intrusion into the API,etc. What we found is that all this additional complexity was basicallyin aid of only a few use cases -- such as collecting into a pre-sizedArrayList. Where are the next hundred use cases for such a mechanism,that would justify this incremental API complexity? We didn't see them,but maybe there are some.

A less intrusive API direction might be a version of Collector whosesupplier function took a size-estimate argument; this might even help inparallel since it allows for intermediate results to start with a betterinitial size guess. (And this could be implemented as a default thatdelegates to the existing supplier.) Still, not really sure thiscarries its weight.

  The below code returns false, for example (is this
a bug?):

Stream.of(1,2,3).parallel().map(i ->
i+1).spliterator().hasCharacteristics(Spliterator.CONCURRENT)

Not a bug. The `Stream::spliterator` method (along with `iterator`) isprovided as an "escape hatch" for operations that need to get at theelements but which cannot be easily expressed using the Stream API. This method makes a good-faith attempt to propagate a reasonable set ofcharacteristics (for a stream with no intermediate ops, it does delegateto the underlying source for its spliterator), but, given that `Stream`is not in fact a data structure, when there is nontrivial computation onthe actual source, a relatively bare-bones spliterator is provided.

While this could probably be improved in specific cases, the return oneffort (and risk) is likely to be low, because `Stream::spliterator` isalready an infrequently used method, and it would only matter in a smallfraction of those cases. So you're in "corner case of a corner case"territory.



On 2/23/2019 5:27 PM, August Nagro wrote:

Calling Stream.collect(Collector) is a popular terminal stream operation.
But because the collect methods provide no detail of the stream's
characteristics, collectors are not as efficient as they could be.

For example, consider a non-parallel, sized stream that is to be collected
as a List. This is a very common case for streams with a Collection source.
Because of the stream characteristics, the Collector.supplier() could
initialize a list with initial size (since the merging function will never
be called), but the current implementation prevents this.

I should note that the characteristics important to collectors are those
defined by Spliterator, like: Spliterator::characteristics,
Spliterator::estimateSize, and Spliterator::getExactSizeIfKnown.

One way this enhancement could be implemented is by adding a method
Stream.collect(Function<ReadOnlySpliterator, Collector> collectorBuilder).
ReadOnlySpliterator would implement the spliterator methods mentioned
above, and Spliterator would be made to implement this interface.

For example, here is a gist with what Collectors.toList could look like:
https://gist.github.com/AugustNagro/e66a0ddf7d47b4f11fec8760281bb538

ReadOnlySpliterator may need to be replaced with some stream specific
abstraction, however, since Stream.spliterator() does not return with the
correct characteristics. The below code returns false, for example (is this
a bug?):

Stream.of(1,2,3).parallel().map(i ->
i+1).spliterator().hasCharacteristics(Spliterator.CONCURRENT)

Looking forward to your thoughts,

- August Nagro

Re: Proposal to enhance Stream.collect

Reply via email to