We did consider this problem when designing the Collector API; for example, it would have been nice if we could have a `toArray()` collector that had all the optimizations of `Stream::toArray`.

When we looked into it, we found a number of concerning details that caused us to turn back (many of which you've already identified), such as the difficulty of managing parallelism, the intrusion into the API, etc.  What we found is that all this additional complexity was basically in aid of only a few use cases -- such as collecting into a pre-sized ArrayList.  Where are the next hundred use cases for such a mechanism, that would justify this incremental API complexity?  We didn't see them, but maybe there are some.

A less intrusive API direction might be a version of Collector whose supplier function took a size-estimate argument; this might even help in parallel since it allows for intermediate results to start with a better initial size guess.  (And this could be implemented as a default that delegates to the existing supplier.)  Still, not really sure this carries its weight.

  The below code returns false, for example (is this
a bug?):

Stream.of(1,2,3).parallel().map(i ->
i+1).spliterator().hasCharacteristics(Spliterator.CONCURRENT)

Not a bug.   The `Stream::spliterator` method (along with `iterator`) is provided as an "escape hatch" for operations that need to get at the elements but which cannot be easily expressed using the Stream API.  This method makes a good-faith attempt to propagate a reasonable set of characteristics (for a stream with no intermediate ops, it does delegate to the underlying source for its spliterator), but, given that `Stream` is not in fact a data structure, when there is nontrivial computation on the actual source, a relatively bare-bones spliterator is provided.

While this could probably be improved in specific cases, the return on effort (and risk) is likely to be low, because `Stream::spliterator` is already an infrequently used method, and it would only matter in a small fraction of those cases.  So you're in "corner case of a corner case" territory.


On 2/23/2019 5:27 PM, August Nagro wrote:
Calling Stream.collect(Collector) is a popular terminal stream operation.
But because the collect methods provide no detail of the stream's
characteristics, collectors are not as efficient as they could be.

For example, consider a non-parallel, sized stream that is to be collected
as a List. This is a very common case for streams with a Collection source.
Because of the stream characteristics, the Collector.supplier() could
initialize a list with initial size (since the merging function will never
be called), but the current implementation prevents this.

I should note that the characteristics important to collectors are those
defined by Spliterator, like: Spliterator::characteristics,
Spliterator::estimateSize, and Spliterator::getExactSizeIfKnown.

One way this enhancement could be implemented is by adding a method
Stream.collect(Function<ReadOnlySpliterator, Collector> collectorBuilder).
ReadOnlySpliterator would implement the spliterator methods mentioned
above, and Spliterator would be made to implement this interface.

For example, here is a gist with what Collectors.toList could look like:
https://gist.github.com/AugustNagro/e66a0ddf7d47b4f11fec8760281bb538

ReadOnlySpliterator may need to be replaced with some stream specific
abstraction, however, since Stream.spliterator() does not return with the
correct characteristics. The below code returns false, for example (is this
a bug?):

Stream.of(1,2,3).parallel().map(i ->
i+1).spliterator().hasCharacteristics(Spliterator.CONCURRENT)

Looking forward to your thoughts,

- August Nagro

Reply via email to