cashmand commented on PR #46831:
URL: https://github.com/apache/spark/pull/46831#issuecomment-2207429887

   Hi @shaeqahmed, I updated the scheme based on the discussion above, while 
still trying to keep the scheme relatively simple. At a high level, I added the 
option to define one or more of `object`, `array`, `typed_value` or 
`untyped_value` at each path segment (including at the top level, rather than 
having the one-off value/metadata). This provides the flexibility to union 
multiple schemas, and avoids the problem of having to fetch the top-level value 
to determine if an intermediate path was only partially shredded.
   
   We decided to allow only one `typed_value` at each level, rather than 
providing one per type. The storage overhead of storing alternative scalar 
values in `untyped_value` should be fairly low after encoding/compression, and 
it should still be possible to define custom stats/metadata schemes later if 
that turns out to be useful for filtering applications.
   
   Please take a look, and let me know if you have more feedback.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to