>From the private@ thread, I suggested: "With the JvmInitializer[1] being supported by Dataflow and the portable Java container, users would be able to write code which sets the system property jdk.serialFilter or by configuring ObjectInputFilter.Config.setSerialFilter(filter)[2]"
This could become a documentation change to SerializableCoder. 1: https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/harness/JvmInitializer.java 2: https://docs.oracle.com/javase/10/core/serialization-filtering1.htm#JSCOR-GUID-952E2328-AB66-4412-8B6B-3BCCB3195C25 On Thu, Mar 26, 2020 at 4:51 AM Colm O hEigeartaigh <[email protected]> wrote: > Thanks for all the feedback. Two possible ideas that occur to me: > > - Create TypeSafeSerializableCoder or somesuch which extends > SerializableCoder and enforces the check as in the PR. Update the > documentation to suggest using the new coder if you don't need to support > collections as the raw type or subclasses. > - Add a new "of" method to SerializableCoder which takes a boolean > parameter to control whether we perform this check or not (defaults to > true?). > > Colm. > > On Tue, Mar 24, 2020 at 2:11 AM Luke Cwik <[email protected]> wrote: > >> I have seen people ingest data using SerializableCoder. >> >> On Mon, Mar 23, 2020 at 2:51 PM Kenneth Knowles <[email protected]> wrote: >> >>> I won't bring other people's words from private@, but can share mine. I >>> don't believe it exposes anything new. >>> >>> > If it is SerializableCoder - attacker controls the other end of e.g. >>> Kafka or Pubsub that is decoding w/ ObjectInputStream - [then we could have >>> an allowlist or try to automatically construct an allowlist] and otherwise >>> there is no vulnerability for internal coders. >>> >>> I have never seen or heard of a user doing dynamic deserialization >>> dispatch on ingestion, but that doesn't mean it doesn't happen. If it is >>> important to someone then they would need a more secure solution than >>> SerializableCoder. >>> >>> Side note: it would be great to provide an efficient and usable solution >>> for the problem of wanting to dynamically dispatch serde in the middle of a >>> pipeline. It is actually independent from being able to provide coders for >>> a wide variety of types, which we can do a bunch of different (mostly >>> better) ways. (has a better solution been built since the last time I >>> thought about this?) >>> >>> Kenn >>> >>> On Mon, Mar 23, 2020 at 1:36 PM Ismaël Mejía <[email protected]> wrote: >>> >>>> The link to the previous covnersation (discussion happened in private@ >>>> and I suppose we can bring some relevant bits here if needed) >>>> >>>> https://lists.apache.org/thread.html/2e1c00999e992e15b08938866bfe7bd3c3d3b3d4d7aa2f8f6eb4600d%40%3Cprivate.beam.apache.org%3E >>>> >>>> I remember Robert had some points there, but I am not sure we >>>> found/agreed on a solution that was relevant and did not break current >>>> users and their user experience (like the case of blacklists). >>>> >>>> On Mon, Mar 23, 2020 at 9:22 PM Luke Cwik <[email protected]> wrote: >>>> > >>>> > Being able to have something that can encode any object (or at least >>>> a large class of objects) is extremely powerful so requiring >>>> SerializableCoder<T> to only encode T.class would hurt our users. >>>> > >>>> > I believe someone looked at this kind of problem before and we came >>>> to an agreement of usng an explicit approve/deny list on the class names >>>> which would address the security concern. I don't remember the thread >>>> though and couldn't find the thread after a few minutes of searching. >>>> > >>>> > >>>> > >>>> > On Mon, Mar 23, 2020 at 1:07 PM Kenneth Knowles <[email protected]> >>>> wrote: >>>> >> >>>> >> So you think the spec for SerializableCoder<T> (currently doesn't >>>> really have one) should be that it dynamically dispatches what it >>>> deserializes? I had imagined we would treat it more as a statically >>>> determined coder, so because it is invariant in T we would not allow up or >>>> down casts (they are unsafe). But we probably don't actually have the >>>> static information to do that anyhow so you are probably right. >>>> >> >>>> >> I wonder about the threat model here. Is this the event that the >>>> runner (managed service or bespoke cluster) is compromised and is >>>> attempting RCE on the Java SDK harness or runner-specific Java-based >>>> worker? >>>> >> >>>> >> Kenn >>>> >> >>>> >> On Mon, Mar 23, 2020 at 8:09 AM Luke Cwik <[email protected]> wrote: >>>> >>> >>>> >>> I don't think this is going to work since SerializableCoder<T> >>>> should be able to decode T and all objects that implement/extend T. I'm >>>> pretty sure SerializableCoder<Set/List/...> is common enough while the >>>> concrete type is HashSet/ArrayList/... >>>> >>> I'm pretty sure there is some way you could come up with some way >>>> for making this optin though. >>>> >>> >>>> >>> On Mon, Mar 23, 2020 at 12:19 AM Colm O hEigeartaigh < >>>> [email protected]> wrote: >>>> >>>> >>>> >>>> Thanks Kenn. I submitted a PR here: >>>> https://github.com/apache/beam/pull/11191 >>>> >>>> >>>> >>>> Colm. >>>> >>>> >>>> >>>> On Thu, Mar 19, 2020 at 8:13 PM Kenneth Knowles <[email protected]> >>>> wrote: >>>> >>>>> >>>> >>>>> I think this is fine. The same coder is used for encode and >>>> decode, so the Class object should be the same as well. Inheritance is not >>>> part of the Beam model (thank goodness) so this is a language-specific >>>> concern. As far as the model is concerned, the full URN and the payload of >>>> the coder is its identity and coders with different identities have no >>>> inheritance or compatibility relationship. Pipeline snapshot/update is an >>>> edge case, but changing coder is not supported by any runner I know of, and >>>> probably won't be until we have some rather large new ideas. >>>> >>>>> >>>> >>>>> Kenn >>>> >>>>> >>>> >>>>> On Thu, Mar 19, 2020 at 4:50 AM Colm O hEigeartaigh < >>>> [email protected]> wrote: >>>> >>>>>> >>>> >>>>>> Hi, >>>> >>>>>> >>>> >>>>>> I have a question on SerializableCoder. I'm looking at hardening >>>> the Java Object deserialization that is taking place. We have a "Class<T> >>>> type" that is used to decode the input stream: >>>> >>>>>> >>>> >>>>>> ObjectInputStream ois = new ObjectInputStream(inStream); >>>> >>>>>> return type.cast(ois.readObject()); >>>> >>>>>> >>>> >>>>>> What I would like to do would be something like: >>>> >>>>>> >>>> >>>>>> ObjectInputStream ois = new ObjectInputStream(inStream) { >>>> >>>>>> @Override >>>> >>>>>> protected Class<?> resolveClass(ObjectStreamClass desc) >>>> throws IOException, ClassNotFoundException { >>>> >>>>>> if (!desc.getName().equals(type.getName())) { >>>> >>>>>> throw new InvalidClassException("Unauthorized >>>> deserialization attempt", desc.getName()); >>>> >>>>>> } >>>> >>>>>> return super.resolveClass(desc); >>>> >>>>>> } >>>> >>>>>> }; >>>> >>>>>> return type.cast(ois.readObject()); >>>> >>>>>> >>>> >>>>>> This would prevent a possible security hole where an attacker >>>> could try to force the recipient of the input stream to deserialize to a >>>> gadget class or the like for a RCE. >>>> >>>>>> >>>> >>>>>> The question is - does the deserialized type have to correspond >>>> exactly to the supplied Class? Or is it supported that it's a base type / >>>> abstract class? If the latter then my idea won't really work. But if the >>>> type corresponds exactly then it should work OK. >>>> >>>>>> >>>> >>>>>> Thanks, >>>> >>>>>> >>>> >>>>>> Colm. >>>> >>>
