We did consider this problem when designing the Collector API; for
example, it would have been nice if we could have a `toArray()`
collector that had all the optimizations of `Stream::toArray`.
When we looked into it, we found a number of concerning details that
caused us to turn back (many of which you've already identified), such
as the difficulty of managing parallelism, the intrusion into the API,
etc. What we found is that all this additional complexity was basically
in aid of only a few use cases -- such as collecting into a pre-sized
ArrayList. Where are the next hundred use cases for such a mechanism,
that would justify this incremental API complexity? We didn't see them,
but maybe there are some.
A less intrusive API direction might be a version of Collector whose
supplier function took a size-estimate argument; this might even help in
parallel since it allows for intermediate results to start with a better
initial size guess. (And this could be implemented as a default that
delegates to the existing supplier.) Still, not really sure this
carries its weight.
The below code returns false, for example (is this
a bug?):
Stream.of(1,2,3).parallel().map(i ->
i+1).spliterator().hasCharacteristics(Spliterator.CONCURRENT)
Not a bug. The `Stream::spliterator` method (along with `iterator`) is
provided as an "escape hatch" for operations that need to get at the
elements but which cannot be easily expressed using the Stream API.
This method makes a good-faith attempt to propagate a reasonable set of
characteristics (for a stream with no intermediate ops, it does delegate
to the underlying source for its spliterator), but, given that `Stream`
is not in fact a data structure, when there is nontrivial computation on
the actual source, a relatively bare-bones spliterator is provided.
While this could probably be improved in specific cases, the return on
effort (and risk) is likely to be low, because `Stream::spliterator` is
already an infrequently used method, and it would only matter in a small
fraction of those cases. So you're in "corner case of a corner case"
territory.
On 2/23/2019 5:27 PM, August Nagro wrote:
Calling Stream.collect(Collector) is a popular terminal stream operation.
But because the collect methods provide no detail of the stream's
characteristics, collectors are not as efficient as they could be.
For example, consider a non-parallel, sized stream that is to be collected
as a List. This is a very common case for streams with a Collection source.
Because of the stream characteristics, the Collector.supplier() could
initialize a list with initial size (since the merging function will never
be called), but the current implementation prevents this.
I should note that the characteristics important to collectors are those
defined by Spliterator, like: Spliterator::characteristics,
Spliterator::estimateSize, and Spliterator::getExactSizeIfKnown.
One way this enhancement could be implemented is by adding a method
Stream.collect(Function<ReadOnlySpliterator, Collector> collectorBuilder).
ReadOnlySpliterator would implement the spliterator methods mentioned
above, and Spliterator would be made to implement this interface.
For example, here is a gist with what Collectors.toList could look like:
https://gist.github.com/AugustNagro/e66a0ddf7d47b4f11fec8760281bb538
ReadOnlySpliterator may need to be replaced with some stream specific
abstraction, however, since Stream.spliterator() does not return with the
correct characteristics. The below code returns false, for example (is this
a bug?):
Stream.of(1,2,3).parallel().map(i ->
i+1).spliterator().hasCharacteristics(Spliterator.CONCURRENT)
Looking forward to your thoughts,
- August Nagro