[ 
https://issues.apache.org/jira/browse/BEAM-73?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Halperin updated BEAM-73:
--------------------------------
    Description: 
Many Sources can be thought of as providing a byte[] payload -- e.g. TextIO 
bytes between newlines, or PubSubIO messages. Therefore, we originally 
suggested a Coder as the thing to use to decode these byte[] into T (what I'll 
call Parsing).

Consider the case of a text file of integers.

123\n
456\n
...

We want a PCollection<Integer> out, so we can use TextualIntegerCoder with 
TextIO.Read. However, that Coder will get propagated as the default coder for 
that PCollection (and may be used in downstream DoFns). This seem bad as, once 
the data is parsed, we probably want to use VarIntCoder or another Coder that 
is more CPU- and Space-efficient.

Another design pattern is
    TextIO.Read() -> MapElements<String, Integer> (lambda s : 
Integer.parseInt(s))

This has better behavior, but now we go from byte[] to String to Integer rather 
than directly from byte[] to Integer.

The solution seems to be to explicitly add Parser and Coder abstractions.

  was:
Many Sources can be thought of as providing a byte[] payload -- e.g. TextIO 
bytes between newlines, or PubSubIO messages. Therefore, we originally 
suggested a Coder as the thing to use to decode these byte[] into T (what I'll 
call Parsing).

Consider the case of a text file of integers.

123\n
456\n
...

We want a PCollection<Integer> out, so we can use TextualIntegerCoder with 
TextIO.Read. However, that Coder will get propagated as the default coder for 
that PCollection (and may be used in downstream DoFns). This seem bad as, once 
the data is parsed, we probably want to use VarIntCoder or another Coder that 
is more CPU- and Space-efficient.


> IO design pattern: Decouple Parsers and Coders
> ----------------------------------------------
>
>                 Key: BEAM-73
>                 URL: https://issues.apache.org/jira/browse/BEAM-73
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-core
>            Reporter: Daniel Halperin
>            Priority: Minor
>
> Many Sources can be thought of as providing a byte[] payload -- e.g. TextIO 
> bytes between newlines, or PubSubIO messages. Therefore, we originally 
> suggested a Coder as the thing to use to decode these byte[] into T (what 
> I'll call Parsing).
> Consider the case of a text file of integers.
> 123\n
> 456\n
> ...
> We want a PCollection<Integer> out, so we can use TextualIntegerCoder with 
> TextIO.Read. However, that Coder will get propagated as the default coder for 
> that PCollection (and may be used in downstream DoFns). This seem bad as, 
> once the data is parsed, we probably want to use VarIntCoder or another Coder 
> that is more CPU- and Space-efficient.
> Another design pattern is
>     TextIO.Read() -> MapElements<String, Integer> (lambda s : 
> Integer.parseInt(s))
> This has better behavior, but now we go from byte[] to String to Integer 
> rather than directly from byte[] to Integer.
> The solution seems to be to explicitly add Parser and Coder abstractions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to