Hello, I'm trying to make a call on whether my team should invest time added a step to "flatten" our schema as part of our ETL pipeline to improve performance of interactive queries.
Our data start out life as Avro before being converted to Parquet, and so we follow the Avro idioms of creating our own types to reduce boilerplate in many areas. For example, every record we define has a "metadata" struct field with all the fields that are common to all records as part of the system design. Those fields are very common, and so virtually all queries need to access them. As a result, nearly all of our queries don't see the best performance we could be seeing in Spark SQL, etc. So my question - is this just inherently the way it is, or do we expect future releases will put them on a par with flat fields? The reason I ask is that I've actually seen similar differences in performance with Presto too. In benchmarks for both Spark and Presto, I generally see queries working on flat fields run 5-6x faster than queries doing the same thing on a nested field. If we expect fields nested in structs to always be much slower than flat fields, then I would be keen to address that in our ETL pipeline with a flattening step. If it's a known issue that we expect will be fixed in upcoming releases, I'll hold off. Any advice greatly appreciated! Thanks, James.