Github user RotemShaul commented on the issue: https://github.com/apache/spark/pull/13761 Not as far as I understand, I'll explain the use case : 1. We do not want to serialize the Avro schema of our processed events, only its id 2. Our job can process multiple events with different schemas in a single RDD [for instance, we don't use Avro specific records, so this can't be our type in that DataSet. The DataSet will still have to remain DataSet[GenericRecord], can't be DataSet[Event]. Also GenericRecord or SpecificRecord of Avro is not a case class. ) 3. Most important - we do not know the schemas ahead of time, and we should be able to process an RDD with multiple different schemas (which are not known to us ahead of time, as it is constantly evolving) The whole point of using GenericRecords in Avro and not specific records is since we don't know the schemas ahead of time, we are processing events each with different schema version (according to Avro schema resolution rules and evolution). Our specific use case - kind of sessionization of events by key. We don't do analytics or aggregations, we get input data RDD[GenericRecord] and return RDD[Frame[K,Iterator[GenericRecord]] and we store that output somewhere. Now, since we only do groupBy, we do not care about the events body content. (we're infrastructure. the key is one field and sits somewhere in header that doesn't change), and such don't access any of it body fields, so we don't care about the evolving body schema and do not access any fields other than the key. If we were to put that GenericRecord into some Avro generated Specific Record (or some CaseClass) we'd have to know all the fields ahead of time, and the encoding into that class will result in failure or partial when we'll have mismatch in the events new added fields or old removed fields. If basically a DataSet has a tabular format behind the scenes, we can't have such table format for our data set - as our table is 'dynamic', each event has different schema. Taking the superset schema will result in new empty columns and basically changing the events. On Mon, Jun 20, 2016 at 8:35 AM, Herman van Hovell <notificati...@github.com > wrote: > Don't Datasets and Encoders make this less relevant? What would be the > use case here? > > â > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/13761#issuecomment-227053724>, or mute > the thread > <https://github.com/notifications/unsubscribe/AHlUNCQ6FThi-DA7V9gkqvUrgG3c4cprks5qNiasgaJpZM4I48Q9> > . >
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org