[ https://issues.apache.org/jira/browse/SPARK-27799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Rosen updated SPARK-27799: ------------------------------- Description: Kryo serialization can offer a substantial performance boost compared to Java serialization and I generally recommend that users configure Spark to use it. That said, in general it may not be safe to _blindly_ flip the default to Kryo: certain jobs might depend on Java serialization, so switching them to Kryo might cause crashes or incorrect behavior. However, we may know that certain data types are safe to serialize with Kryo, in which case we can whitelist _just those types_ for use with Kryo serialization but keep everything else using the default Java serializer. Back in SPARK-13926 (Spark 2.0) I added a {{SerializerManager}} to implement this idea for strings, primitives, primitive arrays, and a few other data types: those types will automatically use Kryo serialization when used as top-level types in RDDs. However, there's no ability for users to customize / extend this whitelist. I propose to add a new user-facing configuration, name TBD, which accepts a comma-separated list of class / interface names and uses them to expand the {{SerializerMananger.canUseKryo}} whitelist. This will allow advanced users to incrementally default to Kryo for certain types (e.g. Scrooge ThriftStructs). This feature is useful for "data platform" teams who provide Spark-as-a-service to internal customers: with this proposed configuration, platform teams can configure global defaults for serialization in a way which is more incremental / narrow-in-scope than simply defaulting to Kryo everywhere. was: Kryo serialization can offer a substantial performance boost compared to Java serialization and I generally recommend that users configure Spark to use it. That said, in general it may not be safe to _blindly_ flip the default to Kryo: certain jobs might depend on Java serialization, so switching them to Kryo might cause crashes or incorrect behavior. However, we may know that certain data types are safe to serialize with Kryo, in which case we can whitelist _just those types_ for use with Kryo serialization but keep everything else using the default Java serializer. Back in SPARK-13926 (Spark 2.0) I added a {{SerializerManager}} to implement this idea for strings, primitives, primitive arrays, and a few other data types: those types will automatically use Kryo serialization when used as top-level types in RDDs. However, there's no ability for users to customize / extend this whitelist. I propose to add a new user-facing configuration, name TBD, which accepts a comma-separated list of class / interface names and uses the to expand the {{SerializerMananger.canUseKryo}} whitelist. This will allow advanced users to incrementally default to Kryo for certain types (e.g. Scrooge ThriftStructs). This feature is useful for "data platform" teams who provide Spark-as-a-service to internal customers: with this proposed configuration, platform teams can configure global defaults for serialization in a way which is more incremental / narrow-in-scope than simply defaulting to Kryo everywhere. > Allow SerializerManager.canUseKryo whitelist to be extended via a > configuration > ------------------------------------------------------------------------------- > > Key: SPARK-27799 > URL: https://issues.apache.org/jira/browse/SPARK-27799 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.4.0 > Reporter: Josh Rosen > Priority: Major > > Kryo serialization can offer a substantial performance boost compared to Java > serialization and I generally recommend that users configure Spark to use it. > That said, in general it may not be safe to _blindly_ flip the default to > Kryo: certain jobs might depend on Java serialization, so switching them to > Kryo might cause crashes or incorrect behavior. > However, we may know that certain data types are safe to serialize with Kryo, > in which case we can whitelist _just those types_ for use with Kryo > serialization but keep everything else using the default Java serializer. > Back in SPARK-13926 (Spark 2.0) I added a {{SerializerManager}} to implement > this idea for strings, primitives, primitive arrays, and a few other data > types: those types will automatically use Kryo serialization when used as > top-level types in RDDs. However, there's no ability for users to customize / > extend this whitelist. > I propose to add a new user-facing configuration, name TBD, which accepts a > comma-separated list of class / interface names and uses them to expand the > {{SerializerMananger.canUseKryo}} whitelist. > This will allow advanced users to incrementally default to Kryo for certain > types (e.g. Scrooge ThriftStructs). > This feature is useful for "data platform" teams who provide > Spark-as-a-service to internal customers: with this proposed configuration, > platform teams can configure global defaults for serialization in a way which > is more incremental / narrow-in-scope than simply defaulting to Kryo > everywhere. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org