Kenn,
The conversion to PCollection<String> doesn't work for me because I wanted
to maintain order. To keep the order, I need things in
PCollection<List<String>> To create the ordered list, I did:
PCollection<List<String>> orderedList =
formattedCountsGlobal.apply(Top.smallest(200));
Then tried to write it out with:
orderedList.apply(TextIO.Write.withCoder(ListCoder.of(StringDelegateCoder.of(String.class))).to("output/result"));
I'm using Top.smallest as a hack to order results, but that's a separate
topic.
To answer my own question about DataOutputStream.writeUTF not working, a
short is written out before the string is written. This causes the same
issue as the VarInt. I should have used writeBytes(). That doesn't write
out a size first.
Thanks,
Jesse
On Fri, May 20, 2016, 11:47 AM Kenneth Knowles <[email protected]> wrote:
> Hi Jesse,
>
> I'm having trouble following exactly where the trouble is arising, but let
> me expand my main recommendation to be an edit of your code snippet (please
> forgive any typos or type errors).
>
> Original:
> ----------
> orderedList
> .apply(TextIO.Write
> .withCoder(ListCoder.of(StringDelegateCoder.of(String.class)))
> .to("output/result"));
>
>
> My main recommendation
> ---------------------
> import static org.apache.beam.values.TypeDescriptors.strings;
>
> orderedList
> .apply(MapElements.via(x -> x.toString()).withOutputType(strings())
> .apply(TextIO.Write.to("output/result"));
>
>
> Another approach, which I do not recommend
> --------------------------------------------------------------
> orderedList
> .apply(TextIO.Write
> .withCoder(StringDelegateCoder.of(List.class))
> .to("output/result"));
>
> I don't recommend it because StringDelegateCoder; it is really intended
> for things like URI which have a canonical string representation for 1-1
> conversions, not for readable human output.
>
> If neither of these works for you, perhaps you could paste a larger
> snippet of your pipeline.
>
> Kenn
>
> On Thu, May 19, 2016 at 9:32 PM, Jesse Anderson <[email protected]>
> wrote:
>
>> I'm writing out a PCollection<List<String>>. My goal is to write out each
>> element in the list as a new line.
>>
>> The StringUtf8Coder also writes out a VarInt for the size of the bytes.
>> The StringDelegateCoder with the ListCoder doesn't actually write out
>> text.
>>
>> I think List<String> support should be added to TextIO.Write. Or maybe a
>> new coder needs to be added that outputs text, with support for Lists, KVs,
>> Sets, etc.
>>
>> On Thu, May 19, 2016 at 9:23 PM Kenneth Knowles <[email protected]> wrote:
>>
>>> Hi Jesse,
>>>
>>> StringDelegateCoder does just what you have said: it encodes using
>>> #toString() and decodes assuming a single-arg constructor.
>>>
>>> But by analogy with what you have written, and if I understand your
>>> goals correctly, what you want here is
>>> TextIO.Write.withCoder(StringDelegateCoder.of(List.class))
>>> since you want to base it on List#toString() not String#toString().
>>>
>>> That said, probably the best way to write a reliable and/or readable
>>> format with TextIO.Write is to intentionally produce just the string you
>>> want for your output format - including escaping newlines, etc - and then
>>> use StringUtf8Coder.
>>>
>>> Kenn
>>>
>>> On Thu, May 19, 2016 at 9:00 PM, Jesse Anderson <[email protected]>
>>> wrote:
>>>
>>>> I'm trying to write out a List<String> with TextIO.Write. The only
>>>> supported type is String. I ended up writing an anonymous coder.
>>>>
>>>> I want to check if there is a a coder that I couldn't find that would
>>>> just take an object and write out out the .toString() of it.
>>>>
>>>> I tried this:
>>>>
>>>> orderedList.apply(TextIO.Write.withCoder(ListCoder.of(StringDelegateCoder.of(String.class))).to("output/result"));
>>>>
>>>> But a VarInt is encoded along with everything. I'm looking for a coder
>>>> that only writes out the UTF8.
>>>>
>>>> This functionality would be similar to Hadoop TextOutputFormat. It just
>>>> runs a .toString before writing it out.
>>>>
>>>> In the anonymous coder I wrote, I hit a weird issue. This code just
>>>> writes out a bunch of "\n". Yes, value is populated with data.
>>>> dataOutputStream.writeUTF(value);
>>>> dataOutputStream.writeUTF("\n");
>>>>
>>>> This code works:
>>>> byte[] bytes = value.getBytes(StandardCharsets.UTF_8);
>>>> dataOutputStream.write(bytes);
>>>> dataOutputStream.writeUTF("\n");
>>>>
>>>> I took this from the string coder. What's odd is that DOS' writeUTF
>>>> should work too. Is there a reason why?
>>>>
>>>> Thanks,
>>>>
>>>> jesse
>>>>
>>>
>>>
>