[ 
https://issues.apache.org/jira/browse/SPARK-12197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15086441#comment-15086441
 ] 

Apache Spark commented on SPARK-12197:
--------------------------------------

User 'RotemShaul' has created a pull request for this issue:
https://github.com/apache/spark/pull/10625

> Kryo's Avro Serializer add support for dynamic schemas using SchemaRepository
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-12197
>                 URL: https://issues.apache.org/jira/browse/SPARK-12197
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.5.0
>            Reporter: Rotem Shaul
>              Labels: avro, kryo, schema, serialization
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> The original problem: Serializing GenericRecords in Spark Core results in a 
> very high overhead, as the schema is serialized per record. (When in the 
> actual input data of HDFS it's stored once per file. )
> The extended problem: Spark 1.5 introduced the ability to register Avro 
> schemas ahead of time using SparkConf. This solution is partial as some 
> applications may not know exactly which schemas they're going to read ahead 
> of time.
> Extended solution:
> Adding a schema repository to the Serializer. Assuming the generic record has 
> schemaId on them, it's possible to extract them dynamically from the read 
> records and serialize only the schemaId.
> Upon deserialization the schemaRepo will be queried once again.
> The local caching mechanism will remain in tact - so in fact each Task will 
> query the schema repo only once per schemaId.
> The previous static registering of schemas will remain in place, as it is 
> more efficient when the schemas are known ahead of time.
> New flow of serializing generic record:
> 1) check the pre-registered schema list, if found the schema, serialize only 
> its finger print
> 2) if not found, and schema repo has been set, attempt to extract the 
> schemaId from record and check if repo contains the id. If so - serialize 
> only the schema id
> 3) if no schema repo set or didn't find the schemaId in repo - compress and 
> send the entire schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to