This is tangential but might be relevant as well. Semi-structured data tends to produce wide schemas too. For example, using Parquet Variant to shred OpenTelemetry data (HTTP as an example https://opentelemetry.io/docs/specs/semconv/http/http-spans/) we end up with hundreds of leaf columns. They are semi named: from a user / query engine perspective they are accessed as `variant_get(attributes, ‘http.response.status_code’)` but from Parquet’s perspective those fields are leaf columns.
> On May 7, 2026, at 8:23 AM, Andrew Lamb <[email protected]> wrote: > > Each new column typically stores "features" or "embeddings" that are > extracted from the original data, and then those features or embeddings are > used in subsequent training and inference workflows > > I spoke about this topic (and others) in a talk I did a while ago [1][2] > (slides 15 and 16). Julien has done less academic versions of a similar > talk as well > > Andrew > > [1]: > https://docs.google.com/presentation/d/19F-XvNJ8sgIpIeIduA3PhbsWp4pC-P632J2eJV1cLG8 > [2]: https://www.youtube.com/watch?v=k9uhw7yqPsQ > > On Thu, May 7, 2026 at 9:11 AM Andrew Bell <[email protected]> wrote: > >> Hi, >> >> Can someone please explain why AI processing generates data with wide >> schemas? It's not an area I work in so I'm behind in trying to understand. >> If you have thousands of columns in a row, are they named? Are they >> expected to be queried by a human? >> >> Thanks, >> >> -- >> Andrew Bell >> [email protected] >>
