This is tangential but might be relevant as well. Semi-structured data tends to 
produce wide schemas too. For example, using Parquet Variant to shred 
OpenTelemetry data (HTTP as an example 
https://opentelemetry.io/docs/specs/semconv/http/http-spans/) we end up with 
hundreds of leaf columns. They are semi named: from a user / query engine 
perspective they are accessed as `variant_get(attributes, 
‘http.response.status_code’)` but from Parquet’s perspective those fields are 
leaf columns.

> On May 7, 2026, at 8:23 AM, Andrew Lamb <[email protected]> wrote:
> 
> Each new column typically stores "features" or "embeddings" that are
> extracted from the original data, and then those features or embeddings are
> used in subsequent training and inference workflows
> 
> I spoke about this topic (and others) in a talk I did a while ago [1][2]
> (slides 15 and 16). Julien has done less academic versions of a similar
> talk as well
> 
> Andrew
> 
> [1]:
> https://docs.google.com/presentation/d/19F-XvNJ8sgIpIeIduA3PhbsWp4pC-P632J2eJV1cLG8
> [2]: https://www.youtube.com/watch?v=k9uhw7yqPsQ
> 
> On Thu, May 7, 2026 at 9:11 AM Andrew Bell <[email protected]> wrote:
> 
>> Hi,
>> 
>> Can someone please explain why AI processing generates data with wide
>> schemas? It's not an area I work in so I'm behind in trying to understand.
>> If you have thousands of columns in a row, are they named? Are they
>> expected to be queried by a human?
>> 
>> Thanks,
>> 
>> --
>> Andrew Bell
>> [email protected]
>> 

Reply via email to