shaeqahmed commented on PR #46831: URL: https://github.com/apache/spark/pull/46831#issuecomment-2172052552
Thanks @cashmand and yes that is clear. The type of queries one would expect to see on AWS CloudTrail requestParameters and responseElements would both be (needle in haystack) queries that would filter on just a few nested field paths corresponding to the AWS service, returning all columns and analytical dashboards that aggregate/summarize on a few fields such as queries for service level dashboards (e.g. AWS S3, etc.) that perform aggregations on some of the sub fields pertaining to each service within requestParameters and responseElements. An example of the first type is the following AWS repository containing a variety of search oriented security detection rules: https://github.com/sbasu7241/AWS-Threat-Simulation-and-Detection. Both of these types of queries (highly selective search & analytical) would benefit massively from shredding/subcolumnarization over being stored as a binary variant due to the compression, statistics, and reduction in data scanned. Note that these fields can contain massive user input strings blobs coming from user input (e.g. a stringified SQL query string as a request parameter https://docs.aws.amazon.com/athena/latest/APIReference/API_StartQueryExecution.html#athena-StartQueryExecution-request-QueryString) alongside compact low cardinality or numerical fields which are useful in a query or a dashboard (e.g. viewing distinct requestParameters.policyDocument.statements[*].action or a search like requestParameters.ipPermissions.items[0].ipRanges.items[0].cidrIp == "127.0.0.1"), which is why shredding is important for performance on this type of semi structured field. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org