Hello,

I'm trying to make a call on whether my team should invest time added a
step to "flatten" our schema as part of our ETL pipeline to improve
performance of interactive queries.

Our data start out life as Avro before being converted to Parquet, and so
we follow the Avro idioms of creating our own types to reduce boilerplate
in many areas. For example, every record we define has a "metadata" struct
field with all the fields that are common to all records as part of the
system design. Those fields are very common, and so virtually all queries
need to access them. As a result, nearly all of our queries don't see the
best performance we could be seeing in Spark SQL, etc.

So my question - is this just inherently the way it is, or do we expect
future releases will put them on a par with flat fields? The reason I ask
is that I've actually seen similar differences in performance with Presto
too. In benchmarks for both Spark and Presto, I generally see queries
working on flat fields run 5-6x faster than queries doing the same thing on
a nested field.

If we expect fields nested in structs to always be much slower than flat
fields, then I would be keen to address that in our ETL pipeline with a
flattening step. If it's a known issue that we expect will be fixed in
upcoming releases, I'll hold off.

Any advice greatly appreciated!

Thanks,

James.

Reply via email to