Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22641 @mgaido91 Thanks for your input. I took another look at the testcase. Let me outline some of my understandings first. - The test validates the precedence rules in determining the resultant compression to be used in the presence of SessionLevel codecs and Table level codecs. - It verifies the correct compression is picked by reading the metadata information from parquet/orc file metadata. - The accepted configuration for parquet are : none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd - The accepted configuration for orc are : none, uncompressed, snappy, zlib, lzo - The testcase in question use only a SUBSET of allowable codecs for parquet : uncompressed, snappy, gzip - The test case in question use only a SUBSET of allowable codecs for orc : None, Snappy, Zlib One thing to note is that, the codecs being tested are not exhaustive and we pick a subset (perhaps the most popular ones). Other thing is that, we have a 3 way loop 1) isPartitioned 2) convertMetastore 3) useCTAS on top of the codec loop. So we will be calling the codec loop 6 times in a test for each unique combination of (isPartitioned, convertMetastore, useCTAS). And we have changed the codec loop to randomly pick one combination of table level and session level codecs. Given this, i feel we are getting a decent coverage and also i feel we should be able to catch regression as we will catch it in some jenkin run or the other. If you still feel uncomfortable, should we take 2 codecs as opposed to 1 ? It will generate a 24 (4 * 6) times loop as opposed to 54 (9 * 6).
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org