ooh, that's a good real world use case and dataset there. Are you using
shredding already?

On Thu, 7 May 2026 at 14:37, Adrian Garcia Badaracco via dev <
[email protected]> wrote:

> This is tangential but might be relevant as well. Semi-structured data
> tends to produce wide schemas too. For example, using Parquet Variant to
> shred OpenTelemetry data (HTTP as an example
> https://opentelemetry.io/docs/specs/semconv/http/http-spans/) we end up
> with hundreds of leaf columns. They are semi named: from a user / query
> engine perspective they are accessed as `variant_get(attributes,
> ‘http.response.status_code’)` but from Parquet’s perspective those fields
> are leaf columns.
>
> > On May 7, 2026, at 8:23 AM, Andrew Lamb <[email protected]> wrote:
> >
> > Each new column typically stores "features" or "embeddings" that are
> > extracted from the original data, and then those features or embeddings
> are
> > used in subsequent training and inference workflows
> >
> > I spoke about this topic (and others) in a talk I did a while ago [1][2]
> > (slides 15 and 16). Julien has done less academic versions of a similar
> > talk as well
> >
> > Andrew
> >
> > [1]:
> >
> https://docs.google.com/presentation/d/19F-XvNJ8sgIpIeIduA3PhbsWp4pC-P632J2eJV1cLG8
> > [2]: https://www.youtube.com/watch?v=k9uhw7yqPsQ
> >
> > On Thu, May 7, 2026 at 9:11 AM Andrew Bell <[email protected]>
> wrote:
> >
> >> Hi,
> >>
> >> Can someone please explain why AI processing generates data with wide
> >> schemas? It's not an area I work in so I'm behind in trying to
> understand.
> >> If you have thousands of columns in a row, are they named? Are they
> >> expected to be queried by a human?
> >>
> >> Thanks,
> >>
> >> --
> >> Andrew Bell
> >> [email protected]
> >>
>
>

Reply via email to