Thanks Luke. I updated the JIRA to suggest changing the docs along the lines of your post: https://issues.apache.org/jira/browse/BEAM-9570
Colm. On Thu, Mar 26, 2020 at 3:06 PM Luke Cwik <[email protected]> wrote: > From the private@ thread, I suggested: > "With the JvmInitializer[1] being supported by Dataflow and the portable > Java container, users would be able to write code which sets the system > property jdk.serialFilter or by configuring > ObjectInputFilter.Config.setSerialFilter(filter)[2]" > > This could become a documentation change to SerializableCoder. > > 1: > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/harness/JvmInitializer.java > 2: > https://docs.oracle.com/javase/10/core/serialization-filtering1.htm#JSCOR-GUID-952E2328-AB66-4412-8B6B-3BCCB3195C25 > > On Thu, Mar 26, 2020 at 4:51 AM Colm O hEigeartaigh <[email protected]> > wrote: > >> Thanks for all the feedback. Two possible ideas that occur to me: >> >> - Create TypeSafeSerializableCoder or somesuch which extends >> SerializableCoder and enforces the check as in the PR. Update the >> documentation to suggest using the new coder if you don't need to support >> collections as the raw type or subclasses. >> - Add a new "of" method to SerializableCoder which takes a boolean >> parameter to control whether we perform this check or not (defaults to >> true?). >> >> Colm. >> >> On Tue, Mar 24, 2020 at 2:11 AM Luke Cwik <[email protected]> wrote: >> >>> I have seen people ingest data using SerializableCoder. >>> >>> On Mon, Mar 23, 2020 at 2:51 PM Kenneth Knowles <[email protected]> wrote: >>> >>>> I won't bring other people's words from private@, but can share mine. >>>> I don't believe it exposes anything new. >>>> >>>> > If it is SerializableCoder - attacker controls the other end of e.g. >>>> Kafka or Pubsub that is decoding w/ ObjectInputStream - [then we could have >>>> an allowlist or try to automatically construct an allowlist] and otherwise >>>> there is no vulnerability for internal coders. >>>> >>>> I have never seen or heard of a user doing dynamic deserialization >>>> dispatch on ingestion, but that doesn't mean it doesn't happen. If it is >>>> important to someone then they would need a more secure solution than >>>> SerializableCoder. >>>> >>>> Side note: it would be great to provide an efficient and usable >>>> solution for the problem of wanting to dynamically dispatch serde in the >>>> middle of a pipeline. It is actually independent from being able to provide >>>> coders for a wide variety of types, which we can do a bunch of different >>>> (mostly better) ways. (has a better solution been built since the last time >>>> I thought about this?) >>>> >>>> Kenn >>>> >>>> On Mon, Mar 23, 2020 at 1:36 PM Ismaël Mejía <[email protected]> wrote: >>>> >>>>> The link to the previous covnersation (discussion happened in private@ >>>>> and I suppose we can bring some relevant bits here if needed) >>>>> >>>>> https://lists.apache.org/thread.html/2e1c00999e992e15b08938866bfe7bd3c3d3b3d4d7aa2f8f6eb4600d%40%3Cprivate.beam.apache.org%3E >>>>> >>>>> I remember Robert had some points there, but I am not sure we >>>>> found/agreed on a solution that was relevant and did not break current >>>>> users and their user experience (like the case of blacklists). >>>>> >>>>> On Mon, Mar 23, 2020 at 9:22 PM Luke Cwik <[email protected]> wrote: >>>>> > >>>>> > Being able to have something that can encode any object (or at least >>>>> a large class of objects) is extremely powerful so requiring >>>>> SerializableCoder<T> to only encode T.class would hurt our users. >>>>> > >>>>> > I believe someone looked at this kind of problem before and we came >>>>> to an agreement of usng an explicit approve/deny list on the class names >>>>> which would address the security concern. I don't remember the thread >>>>> though and couldn't find the thread after a few minutes of searching. >>>>> > >>>>> > >>>>> > >>>>> > On Mon, Mar 23, 2020 at 1:07 PM Kenneth Knowles <[email protected]> >>>>> wrote: >>>>> >> >>>>> >> So you think the spec for SerializableCoder<T> (currently doesn't >>>>> really have one) should be that it dynamically dispatches what it >>>>> deserializes? I had imagined we would treat it more as a statically >>>>> determined coder, so because it is invariant in T we would not allow up or >>>>> down casts (they are unsafe). But we probably don't actually have the >>>>> static information to do that anyhow so you are probably right. >>>>> >> >>>>> >> I wonder about the threat model here. Is this the event that the >>>>> runner (managed service or bespoke cluster) is compromised and is >>>>> attempting RCE on the Java SDK harness or runner-specific Java-based >>>>> worker? >>>>> >> >>>>> >> Kenn >>>>> >> >>>>> >> On Mon, Mar 23, 2020 at 8:09 AM Luke Cwik <[email protected]> wrote: >>>>> >>> >>>>> >>> I don't think this is going to work since SerializableCoder<T> >>>>> should be able to decode T and all objects that implement/extend T. I'm >>>>> pretty sure SerializableCoder<Set/List/...> is common enough while the >>>>> concrete type is HashSet/ArrayList/... >>>>> >>> I'm pretty sure there is some way you could come up with some way >>>>> for making this optin though. >>>>> >>> >>>>> >>> On Mon, Mar 23, 2020 at 12:19 AM Colm O hEigeartaigh < >>>>> [email protected]> wrote: >>>>> >>>> >>>>> >>>> Thanks Kenn. I submitted a PR here: >>>>> https://github.com/apache/beam/pull/11191 >>>>> >>>> >>>>> >>>> Colm. >>>>> >>>> >>>>> >>>> On Thu, Mar 19, 2020 at 8:13 PM Kenneth Knowles <[email protected]> >>>>> wrote: >>>>> >>>>> >>>>> >>>>> I think this is fine. The same coder is used for encode and >>>>> decode, so the Class object should be the same as well. Inheritance is not >>>>> part of the Beam model (thank goodness) so this is a language-specific >>>>> concern. As far as the model is concerned, the full URN and the payload of >>>>> the coder is its identity and coders with different identities have no >>>>> inheritance or compatibility relationship. Pipeline snapshot/update is an >>>>> edge case, but changing coder is not supported by any runner I know of, >>>>> and >>>>> probably won't be until we have some rather large new ideas. >>>>> >>>>> >>>>> >>>>> Kenn >>>>> >>>>> >>>>> >>>>> On Thu, Mar 19, 2020 at 4:50 AM Colm O hEigeartaigh < >>>>> [email protected]> wrote: >>>>> >>>>>> >>>>> >>>>>> Hi, >>>>> >>>>>> >>>>> >>>>>> I have a question on SerializableCoder. I'm looking at >>>>> hardening the Java Object deserialization that is taking place. We have a >>>>> "Class<T> type" that is used to decode the input stream: >>>>> >>>>>> >>>>> >>>>>> ObjectInputStream ois = new ObjectInputStream(inStream); >>>>> >>>>>> return type.cast(ois.readObject()); >>>>> >>>>>> >>>>> >>>>>> What I would like to do would be something like: >>>>> >>>>>> >>>>> >>>>>> ObjectInputStream ois = new ObjectInputStream(inStream) { >>>>> >>>>>> @Override >>>>> >>>>>> protected Class<?> resolveClass(ObjectStreamClass desc) >>>>> throws IOException, ClassNotFoundException { >>>>> >>>>>> if (!desc.getName().equals(type.getName())) { >>>>> >>>>>> throw new InvalidClassException("Unauthorized >>>>> deserialization attempt", desc.getName()); >>>>> >>>>>> } >>>>> >>>>>> return super.resolveClass(desc); >>>>> >>>>>> } >>>>> >>>>>> }; >>>>> >>>>>> return type.cast(ois.readObject()); >>>>> >>>>>> >>>>> >>>>>> This would prevent a possible security hole where an attacker >>>>> could try to force the recipient of the input stream to deserialize to a >>>>> gadget class or the like for a RCE. >>>>> >>>>>> >>>>> >>>>>> The question is - does the deserialized type have to correspond >>>>> exactly to the supplied Class? Or is it supported that it's a base type / >>>>> abstract class? If the latter then my idea won't really work. But if the >>>>> type corresponds exactly then it should work OK. >>>>> >>>>>> >>>>> >>>>>> Thanks, >>>>> >>>>>> >>>>> >>>>>> Colm. >>>>> >>>>
