Hi Jesse, A PCollection does not have a definite order, but is just a multiset/bag of elements. So any ordering you are seeing is a facet of a particular runner, sort of a coincidence. Can you tell me more about your use case?
Kenn On Fri, May 20, 2016 at 1:46 PM, Jesse Anderson <[email protected]> wrote: > Kenn, > > The conversion to PCollection<String> doesn't work for me because I wanted > to maintain order. To keep the order, I need things in > PCollection<List<String>> To create the ordered list, I did: > > PCollection<List<String>> orderedList = > formattedCountsGlobal.apply(Top.smallest(200)); > Then tried to write it out with: > > > orderedList.apply(TextIO.Write.withCoder(ListCoder.of(StringDelegateCoder.of(String.class))).to("output/result")); > > I'm using Top.smallest as a hack to order results, but that's a separate > topic. > > To answer my own question about DataOutputStream.writeUTF not working, a > short is written out before the string is written. This causes the same > issue as the VarInt. I should have used writeBytes(). That doesn't write > out a size first. > > Thanks, > > Jesse > > On Fri, May 20, 2016, 11:47 AM Kenneth Knowles <[email protected]> wrote: > >> Hi Jesse, >> >> I'm having trouble following exactly where the trouble is arising, but >> let me expand my main recommendation to be an edit of your code snippet >> (please forgive any typos or type errors). >> >> Original: >> ---------- >> orderedList >> .apply(TextIO.Write >> .withCoder(ListCoder.of(StringDelegateCoder.of(String.class))) >> .to("output/result")); >> >> >> My main recommendation >> --------------------- >> import static org.apache.beam.values.TypeDescriptors.strings; >> >> orderedList >> .apply(MapElements.via(x -> x.toString()).withOutputType(strings()) >> .apply(TextIO.Write.to("output/result")); >> >> >> Another approach, which I do not recommend >> -------------------------------------------------------------- >> orderedList >> .apply(TextIO.Write >> .withCoder(StringDelegateCoder.of(List.class)) >> .to("output/result")); >> >> I don't recommend it because StringDelegateCoder; it is really intended >> for things like URI which have a canonical string representation for 1-1 >> conversions, not for readable human output. >> >> If neither of these works for you, perhaps you could paste a larger >> snippet of your pipeline. >> >> Kenn >> >> On Thu, May 19, 2016 at 9:32 PM, Jesse Anderson <[email protected]> >> wrote: >> >>> I'm writing out a PCollection<List<String>>. My goal is to write out >>> each element in the list as a new line. >>> >>> The StringUtf8Coder also writes out a VarInt for the size of the bytes. >>> The StringDelegateCoder with the ListCoder doesn't actually write out >>> text. >>> >>> I think List<String> support should be added to TextIO.Write. Or maybe a >>> new coder needs to be added that outputs text, with support for Lists, KVs, >>> Sets, etc. >>> >>> On Thu, May 19, 2016 at 9:23 PM Kenneth Knowles <[email protected]> wrote: >>> >>>> Hi Jesse, >>>> >>>> StringDelegateCoder does just what you have said: it encodes using >>>> #toString() and decodes assuming a single-arg constructor. >>>> >>>> But by analogy with what you have written, and if I understand your >>>> goals correctly, what you want here is >>>> TextIO.Write.withCoder(StringDelegateCoder.of(List.class)) >>>> since you want to base it on List#toString() not String#toString(). >>>> >>>> That said, probably the best way to write a reliable and/or readable >>>> format with TextIO.Write is to intentionally produce just the string you >>>> want for your output format - including escaping newlines, etc - and then >>>> use StringUtf8Coder. >>>> >>>> Kenn >>>> >>>> On Thu, May 19, 2016 at 9:00 PM, Jesse Anderson <[email protected]> >>>> wrote: >>>> >>>>> I'm trying to write out a List<String> with TextIO.Write. The only >>>>> supported type is String. I ended up writing an anonymous coder. >>>>> >>>>> I want to check if there is a a coder that I couldn't find that would >>>>> just take an object and write out out the .toString() of it. >>>>> >>>>> I tried this: >>>>> >>>>> orderedList.apply(TextIO.Write.withCoder(ListCoder.of(StringDelegateCoder.of(String.class))).to("output/result")); >>>>> >>>>> But a VarInt is encoded along with everything. I'm looking for a coder >>>>> that only writes out the UTF8. >>>>> >>>>> This functionality would be similar to Hadoop TextOutputFormat. It >>>>> just runs a .toString before writing it out. >>>>> >>>>> In the anonymous coder I wrote, I hit a weird issue. This code just >>>>> writes out a bunch of "\n". Yes, value is populated with data. >>>>> dataOutputStream.writeUTF(value); >>>>> dataOutputStream.writeUTF("\n"); >>>>> >>>>> This code works: >>>>> byte[] bytes = value.getBytes(StandardCharsets.UTF_8); >>>>> dataOutputStream.write(bytes); >>>>> dataOutputStream.writeUTF("\n"); >>>>> >>>>> I took this from the string coder. What's odd is that DOS' writeUTF >>>>> should work too. Is there a reason why? >>>>> >>>>> Thanks, >>>>> >>>>> jesse >>>>> >>>> >>>> >>
