cashmand commented on PR #46831: URL: https://github.com/apache/spark/pull/46831#issuecomment-2207429887
Hi @shaeqahmed, I updated the scheme based on the discussion above, while still trying to keep the scheme relatively simple. At a high level, I added the option to define one or more of `object`, `array`, `typed_value` or `untyped_value` at each path segment (including at the top level, rather than having the one-off value/metadata). This provides the flexibility to union multiple schemas, and avoids the problem of having to fetch the top-level value to determine if an intermediate path was only partially shredded. We decided to allow only one `typed_value` at each level, rather than providing one per type. The storage overhead of storing alternative scalar values in `untyped_value` should be fairly low after encoding/compression, and it should still be possible to define custom stats/metadata schemes later if that turns out to be useful for filtering applications. Please take a look, and let me know if you have more feedback. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org