GitHub user prateekm opened a pull request:
https://github.com/apache/samza/pull/293
Support for declaring serdes in high level API code for
input/output/intermediate streams.
@nickpan47 @vjagadish1989 @jmakes Please take a look.
Some notes/considerations:
* Serde interface is now Serializable. I've updated the built-in serde
implementations to make sure they are Serializable, but custom serde
implementations will need to be updated by users when they upgrade.
* I've tried to preserve the number of serde objects and their usage before
and after serialization, but don't guarantee that any references shared by
these serde instances before serialization will be shared after
deserialization. Ideally, serde of serdes should be done in one pass along with
all other user provided objects in the DAG so that any shared references can be
preserved.
* I've left them as-is for now, but we probably need to move Serdes to
samza-api now that users can interact with them directly.
* Samza's JsonSerde uses Samza's custom ObjectMapper since it was probably
meant to be an Samza internal serde. One peculiarity is that it forces the
'dasherized-names' property naming convention instead of the (default in
Jackson and more common) camelCase naming convention. This makes user POJO
definitions more verbose with JsonProperty annotations, and if users don't
remember to dasherize property names the resulting errors are somewhat
confusing to debug. Should we create a new JsonSerde with a default
ObjectMapper when we move it to samza-api?
* Note that the default value for a 'defaultKey/MsgSerde` is a no-op serde
instead of null. This helps users of Systems that do the
serialization/deserialization themselves (e.g. Wikipedia example), but the
downside is that it is possible to forget setting the default serde before
creating streams and end up with a no-op serde and ClassCastExceptions at
runtime.
* I've made a choice here to disallow setting system/stream serdes in
config to make configs simpler. Imho the 'serdes only in code' and 'only 2
levels of serde' model is much simpler to understand. The risk is that there
might be streams that aren't part of the user application code, but need their
serdes to be set in configuration. An example is the logging topic. Do you
think this is a reasonable choice?
* There's a separate discussion to be had about introducing a KV type and
specifying serdes for such KV types instead of specifying 'key' and 'msg'
serdes separately.
Tested with samza-hello-world Wikipedia application and all
TestRepartitionWindowApp. Will add unit tests if the overall design is
acceptable.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/prateekm/samza serde-instance
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/samza/pull/293.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #293
----
commit 0d399607064faecc983526c255c0b4037ae84aa0
Author: Prateek Maheshwari <[email protected]>
Date: 2017-09-08T21:27:58Z
Adding support for declaring serdes in high level API code for
input/output/intermediate stream.
----
---