shaeqahmed commented on PR #46831:
URL: https://github.com/apache/spark/pull/46831#issuecomment-2172052552

   Thanks @cashmand and yes that is clear. The type of queries one would expect 
to see on AWS CloudTrail requestParameters and responseElements would both be  
(needle in haystack) queries that would filter on just a few nested field paths 
 corresponding to the AWS service, returning all columns and analytical 
dashboards that aggregate/summarize on a few fields such as queries for service 
level dashboards (e.g. AWS S3, etc.) that perform aggregations on some of the 
sub fields pertaining to each service within requestParameters and 
responseElements. 
   
   An example of the first type is the following AWS repository containing a 
variety of search oriented security detection rules: 
https://github.com/sbasu7241/AWS-Threat-Simulation-and-Detection. 
   
   Both of these types of queries (highly selective search & analytical) would 
benefit massively from shredding/subcolumnarization over being stored as a 
binary variant due to the compression, statistics, and reduction in data 
scanned. Note that these fields can contain massive user input strings blobs 
coming from user input (e.g. a stringified SQL query string as a request 
parameter 
https://docs.aws.amazon.com/athena/latest/APIReference/API_StartQueryExecution.html#athena-StartQueryExecution-request-QueryString)
 alongside compact low cardinality or numerical fields which are useful in a 
query or a dashboard (e.g. viewing distinct 
requestParameters.policyDocument.statements[*].action or a search like  
requestParameters.ipPermissions.items[0].ipRanges.items[0].cidrIp == 
"127.0.0.1"), which is why shredding is important for performance on this type 
of semi structured field.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to