ooh, that's a good real world use case and dataset there. Are you using shredding already?
On Thu, 7 May 2026 at 14:37, Adrian Garcia Badaracco via dev < [email protected]> wrote: > This is tangential but might be relevant as well. Semi-structured data > tends to produce wide schemas too. For example, using Parquet Variant to > shred OpenTelemetry data (HTTP as an example > https://opentelemetry.io/docs/specs/semconv/http/http-spans/) we end up > with hundreds of leaf columns. They are semi named: from a user / query > engine perspective they are accessed as `variant_get(attributes, > ‘http.response.status_code’)` but from Parquet’s perspective those fields > are leaf columns. > > > On May 7, 2026, at 8:23 AM, Andrew Lamb <[email protected]> wrote: > > > > Each new column typically stores "features" or "embeddings" that are > > extracted from the original data, and then those features or embeddings > are > > used in subsequent training and inference workflows > > > > I spoke about this topic (and others) in a talk I did a while ago [1][2] > > (slides 15 and 16). Julien has done less academic versions of a similar > > talk as well > > > > Andrew > > > > [1]: > > > https://docs.google.com/presentation/d/19F-XvNJ8sgIpIeIduA3PhbsWp4pC-P632J2eJV1cLG8 > > [2]: https://www.youtube.com/watch?v=k9uhw7yqPsQ > > > > On Thu, May 7, 2026 at 9:11 AM Andrew Bell <[email protected]> > wrote: > > > >> Hi, > >> > >> Can someone please explain why AI processing generates data with wide > >> schemas? It's not an area I work in so I'm behind in trying to > understand. > >> If you have thousands of columns in a row, are they named? Are they > >> expected to be queried by a human? > >> > >> Thanks, > >> > >> -- > >> Andrew Bell > >> [email protected] > >> > >
