cashmand commented on PR #46831:
URL: https://github.com/apache/spark/pull/46831#issuecomment-2168665661

   Hi @shaeqahmed, sorry for the delay, and for not replying earlier about how 
nested structs are handled. I’ll try to update the doc with an example, but in 
the meantime, the plan is to support two of the cases you described:
   
   - Adding the struct directly as a nested key path within the existing 
`paths` structure is meant to be the primary approach. The example at the end 
of the doc shows an array-of-struct with this form, but a struct-of-struct 
would look the same. At any nesting level, if a given key doesn’t exist in the 
parquet schema, it would be stored in the top-level `value` binary. A request 
for any non-leaf field would require checking the top-level `value`, and 
merging the result with the shredded values (as described in the pseudo-code in 
the PR).
   - Adding a nested key path as a nested Variant is supported. This is 
indicated by just including `untyped_value`, with no corresponding 
`typed_value`. But in this case, it wouldn’t be possible to recursively shred 
the nested value.
   
   Please let me know if the above is clear, or if I’m misunderstanding the 
question.
   
   Thanks for describing your use case and the papers you’re referenced. The 
CloudTrail use case makes a lot of sense, and is definitely one that we should 
consider carefully. For the current approach, I think it would make sense to 
shred a field like `requestParameters` as a Variant binary. This would provide 
a lot of the benefit, since queries on `requestParameters` would not need to 
fetch the top-level binary or any other columns.
   
   I can see that the more flexible schema you’ve proposed could provide better 
performance for some query patterns, though. At the same time, we’d like to aim 
to minimize the complexity in the spec, the Parquet footer, and implementation.
   
   I’d like to spend a bit more time looking at the papers you’ve linked to, 
and considering the trade-offs between the proposals. Can you give us a better 
idea of what type of queries you expect to see on the read path, and how your 
scheme would benefit? E.g. would you expect to typically see a mix of queries 
that need all of `requestParameters`, and others that only need a field or two? 
What type of query is likely to benefit significantly from shredding different 
types (e.g. integer and string) vs. just shredding the most common type, and 
fetching the rest from the binary? We would like to better understand how the 
shredding scheme will improve read performance for your workload. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to