[ https://issues.apache.org/jira/browse/BEAM-73?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kenneth Knowles updated BEAM-73: -------------------------------- Fix Version/s: First stable release > IO design pattern: Decouple Parsers and Coders > ---------------------------------------------- > > Key: BEAM-73 > URL: https://issues.apache.org/jira/browse/BEAM-73 > Project: Beam > Issue Type: New Feature > Components: sdk-java-core > Reporter: Daniel Halperin > Priority: Minor > Labels: backward-incompatible > Fix For: First stable release > > > Many Sources can be thought of as providing a byte[] payload -- e.g. TextIO > bytes between newlines, or PubSubIO messages. Therefore, we originally > suggested a Coder as the thing to use to decode these byte[] into T (what > I'll call Parsing). > Consider the case of a text file of integers. > 123\n > 456\n > ... > We want a PCollection<Integer> out, so we can use TextualIntegerCoder with > TextIO.Read. However, that Coder will get propagated as the default coder for > that PCollection (and may be used in downstream DoFns). This seem bad as, > once the data is parsed, we probably want to use VarIntCoder or another Coder > that is more CPU- and Space-efficient. > Another design pattern is > TextIO.Read() -> MapElements<String, Integer> (lambda s : > Integer.parseInt(s)) > This has better behavior, but now we go from byte[] to String to Integer > rather than directly from byte[] to Integer. > The solution seems to be to explicitly add Parser and Coder abstractions. -- This message was sent by Atlassian JIRA (v6.3.15#6346)