Another perspective is to look at other projects in the Hadoop ecosystem.
Impala had to have a LIMIT any time you did an ORDER BY. They're since
removed this limitation.
Hive has two sorting options. ORDER BY does a global order. SORT BY orders
everything in that partition.
On Thu, May 26, 2016
I had a similar thought, but wasn't sure if that violated a tenet of Beam.
I'm thinking an ordered sink could wrap around another sink. I could see
something like:
collection.apply(OrderedSink.Timestamp.write(TextIO.Write.To(...)));
On Thu, May 26, 2016 at 12:26 PM Robert Bradshaw
As Frances alluded to, it's also really hard to reconcile the notion
of a globally ordered PCollection in the context of a streaming
pipeline. Sorting also imposes conditions on partitioning, which we
intentionally leave unspecified for maximum flexibility in the
runtime. One also gets into the
@frances great analysis. I'm hoping this serves as the starting point for
the discussion.
It really comes down to: is this a nice to have or a show stopping
requirement? As you mention, it comes down to the use case. I've taught at
large financial companies where (global) sorting was a real and
https://blog.twitter.com/2016/open-sourcing-twitter-heron
More the merrier for Beam? :-)
Venkatesh
This is somewhat the continuation of my thread "Writing Out List."
Right now, the only way to do sorting is with the Top class. This works
well, but has the constraint of fitting in memory.
A common batch use case is to take a large file and sort it. For example,
this would be sorting a large