Re: Question on SerializableCoder

Luke Cwik Thu, 26 Mar 2020 08:06:59 -0700

>From the private@ thread, I suggested:
"With the JvmInitializer[1] being supported by Dataflow and the portable
Java container, users would be able to write code which sets the system
property jdk.serialFilter or by configuring
ObjectInputFilter.Config.setSerialFilter(filter)[2]"


This could become a documentation change to SerializableCoder.

1:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/harness/JvmInitializer.java
2:
https://docs.oracle.com/javase/10/core/serialization-filtering1.htm#JSCOR-GUID-952E2328-AB66-4412-8B6B-3BCCB3195C25

On Thu, Mar 26, 2020 at 4:51 AM Colm O hEigeartaigh <[email protected]>
wrote:

> Thanks for all the feedback. Two possible ideas that occur to me:
>
>  - Create TypeSafeSerializableCoder or somesuch which extends
> SerializableCoder and enforces the check as in the PR. Update the
> documentation to suggest using the new coder if you don't need to support
> collections as the raw type or subclasses.
>  - Add a new "of" method to SerializableCoder which takes a boolean
> parameter to control whether we perform this check or not (defaults to
> true?).
>
> Colm.
>
> On Tue, Mar 24, 2020 at 2:11 AM Luke Cwik <[email protected]> wrote:
>
>> I have seen people ingest data using SerializableCoder.
>>
>> On Mon, Mar 23, 2020 at 2:51 PM Kenneth Knowles <[email protected]> wrote:
>>
>>> I won't bring other people's words from private@, but can share mine. I
>>> don't believe it exposes anything new.
>>>
>>> > If it is SerializableCoder - attacker controls the other end of e.g.
>>> Kafka or Pubsub that is decoding w/ ObjectInputStream - [then we could have
>>> an allowlist or try to automatically construct an allowlist] and otherwise
>>> there is no vulnerability for internal coders.
>>>
>>> I have never seen or heard of a user doing dynamic deserialization
>>> dispatch on ingestion, but that doesn't mean it doesn't happen. If it is
>>> important to someone then they would need a more secure solution than
>>> SerializableCoder.
>>>
>>> Side note: it would be great to provide an efficient and usable solution
>>> for the problem of wanting to dynamically dispatch serde in the middle of a
>>> pipeline. It is actually independent from being able to provide coders for
>>> a wide variety of types, which we can do a bunch of different (mostly
>>> better) ways. (has a better solution been built since the last time I
>>> thought about this?)
>>>
>>> Kenn
>>>
>>> On Mon, Mar 23, 2020 at 1:36 PM Ismaël Mejía <[email protected]> wrote:
>>>
>>>> The link to the previous covnersation (discussion happened in private@
>>>> and I suppose we can bring some relevant bits here if needed)
>>>>
>>>> https://lists.apache.org/thread.html/2e1c00999e992e15b08938866bfe7bd3c3d3b3d4d7aa2f8f6eb4600d%40%3Cprivate.beam.apache.org%3E
>>>>
>>>> I remember Robert had some points there, but I am not sure we
>>>> found/agreed on a solution that was relevant and did not break current
>>>> users and their user experience (like the case of blacklists).
>>>>
>>>> On Mon, Mar 23, 2020 at 9:22 PM Luke Cwik <[email protected]> wrote:
>>>> >
>>>> > Being able to have something that can encode any object (or at least
>>>> a large class of objects) is extremely powerful so requiring
>>>> SerializableCoder<T> to only encode T.class would hurt our users.
>>>> >
>>>> > I believe someone looked at this kind of problem before and we came
>>>> to an agreement of usng an explicit approve/deny list on the class names
>>>> which would address the security concern. I don't remember the thread
>>>> though and couldn't find the thread after a few minutes of searching.
>>>> >
>>>> >
>>>> >
>>>> > On Mon, Mar 23, 2020 at 1:07 PM Kenneth Knowles <[email protected]>
>>>> wrote:
>>>> >>
>>>> >> So you think the spec for SerializableCoder<T> (currently doesn't
>>>> really have one) should be that it dynamically dispatches what it
>>>> deserializes? I had imagined we would treat it more as a statically
>>>> determined coder, so because it is invariant in T we would not allow up or
>>>> down casts (they are unsafe). But we probably don't actually have the
>>>> static information to do that anyhow so you are probably right.
>>>> >>
>>>> >> I wonder about the threat model here. Is this the event that the
>>>> runner (managed service or bespoke cluster) is compromised and is
>>>> attempting RCE on the Java SDK harness or runner-specific Java-based 
>>>> worker?
>>>> >>
>>>> >> Kenn
>>>> >>
>>>> >> On Mon, Mar 23, 2020 at 8:09 AM Luke Cwik <[email protected]> wrote:
>>>> >>>
>>>> >>> I don't think this is going to work since SerializableCoder<T>
>>>> should be able to decode T and all objects that implement/extend T. I'm
>>>> pretty sure SerializableCoder<Set/List/...> is common enough while the
>>>> concrete type is HashSet/ArrayList/...
>>>> >>> I'm pretty sure there is some way you could come up with some way
>>>> for making this optin though.
>>>> >>>
>>>> >>> On Mon, Mar 23, 2020 at 12:19 AM Colm O hEigeartaigh <
>>>> [email protected]> wrote:
>>>> >>>>
>>>> >>>> Thanks Kenn. I submitted a PR here:
>>>> https://github.com/apache/beam/pull/11191
>>>> >>>>
>>>> >>>> Colm.
>>>> >>>>
>>>> >>>> On Thu, Mar 19, 2020 at 8:13 PM Kenneth Knowles <[email protected]>
>>>> wrote:
>>>> >>>>>
>>>> >>>>> I think this is fine. The same coder is used for encode and
>>>> decode, so the Class object should be the same as well. Inheritance is not
>>>> part of the Beam model (thank goodness) so this is a language-specific
>>>> concern. As far as the model is concerned, the full URN and the payload of
>>>> the coder is its identity and coders with different identities have no
>>>> inheritance or compatibility relationship. Pipeline snapshot/update is an
>>>> edge case, but changing coder is not supported by any runner I know of, and
>>>> probably won't be until we have some rather large new ideas.
>>>> >>>>>
>>>> >>>>> Kenn
>>>> >>>>>
>>>> >>>>> On Thu, Mar 19, 2020 at 4:50 AM Colm O hEigeartaigh <
>>>> [email protected]> wrote:
>>>> >>>>>>
>>>> >>>>>> Hi,
>>>> >>>>>>
>>>> >>>>>> I have a question on SerializableCoder. I'm looking at hardening
>>>> the Java Object deserialization that is taking place. We have a "Class<T>
>>>> type" that is used to decode the input stream:
>>>> >>>>>>
>>>> >>>>>> ObjectInputStream ois = new ObjectInputStream(inStream);
>>>> >>>>>> return type.cast(ois.readObject());
>>>> >>>>>>
>>>> >>>>>> What I would like to do would be something like:
>>>> >>>>>>
>>>> >>>>>> ObjectInputStream ois = new ObjectInputStream(inStream) {
>>>> >>>>>>     @Override
>>>> >>>>>>     protected Class<?> resolveClass(ObjectStreamClass desc)
>>>> throws IOException, ClassNotFoundException {
>>>> >>>>>>         if (!desc.getName().equals(type.getName())) {
>>>> >>>>>>             throw new InvalidClassException("Unauthorized
>>>> deserialization attempt", desc.getName());
>>>> >>>>>>         }
>>>> >>>>>>         return super.resolveClass(desc);
>>>> >>>>>>     }
>>>> >>>>>> };
>>>> >>>>>> return type.cast(ois.readObject());
>>>> >>>>>>
>>>> >>>>>> This would prevent a possible security hole where an attacker
>>>> could try to force the recipient of the input stream to deserialize to a
>>>> gadget class or the like for a RCE.
>>>> >>>>>>
>>>> >>>>>> The question is - does the deserialized type have to correspond
>>>> exactly to the supplied Class? Or is it supported that it's a base type /
>>>> abstract class? If the latter then my idea won't really work. But if the
>>>> type corresponds exactly then it should work OK.
>>>> >>>>>>
>>>> >>>>>> Thanks,
>>>> >>>>>>
>>>> >>>>>> Colm.
>>>>
>>>

Re: Question on SerializableCoder

Reply via email to