[ https://issues.apache.org/jira/browse/SPARK-12197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15086441#comment-15086441 ]
Apache Spark commented on SPARK-12197: -------------------------------------- User 'RotemShaul' has created a pull request for this issue: https://github.com/apache/spark/pull/10625 > Kryo's Avro Serializer add support for dynamic schemas using SchemaRepository > ----------------------------------------------------------------------------- > > Key: SPARK-12197 > URL: https://issues.apache.org/jira/browse/SPARK-12197 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 1.5.0 > Reporter: Rotem Shaul > Labels: avro, kryo, schema, serialization > Original Estimate: 72h > Remaining Estimate: 72h > > The original problem: Serializing GenericRecords in Spark Core results in a > very high overhead, as the schema is serialized per record. (When in the > actual input data of HDFS it's stored once per file. ) > The extended problem: Spark 1.5 introduced the ability to register Avro > schemas ahead of time using SparkConf. This solution is partial as some > applications may not know exactly which schemas they're going to read ahead > of time. > Extended solution: > Adding a schema repository to the Serializer. Assuming the generic record has > schemaId on them, it's possible to extract them dynamically from the read > records and serialize only the schemaId. > Upon deserialization the schemaRepo will be queried once again. > The local caching mechanism will remain in tact - so in fact each Task will > query the schema repo only once per schemaId. > The previous static registering of schemas will remain in place, as it is > more efficient when the schemas are known ahead of time. > New flow of serializing generic record: > 1) check the pre-registered schema list, if found the schema, serialize only > its finger print > 2) if not found, and schema repo has been set, attempt to extract the > schemaId from record and check if repo contains the id. If so - serialize > only the schema id > 3) if no schema repo set or didn't find the schemaId in repo - compress and > send the entire schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org