Github user dilipbiswal commented on the issue:

    https://github.com/apache/spark/pull/22641
  
    @mgaido91 
    Thanks for your input.
    
    I took another look at the testcase. Let me outline some of my 
understandings first.
     
    - The test validates the precedence rules in determining the resultant 
compression to be used in the presence of SessionLevel codecs and Table level 
codecs.
    - It verifies the correct compression is picked by reading the metadata 
information from parquet/orc file metadata.
    - The accepted configuration for parquet are : none, uncompressed, snappy, 
gzip, lzo, brotli, lz4, zstd
    - The accepted configuration for orc are : none, uncompressed, snappy, 
zlib, lzo
    -  The testcase in question use only a SUBSET of allowable codecs for 
parquet : 
    uncompressed, snappy, gzip 
    - The test case in question use only a SUBSET of allowable codecs for orc : 
    None, Snappy, Zlib
    
    One thing to note is that, the codecs being tested are not exhaustive and 
we pick a subset (perhaps the most popular ones). Other thing is that, we have 
a 3 way loop 1) isPartitioned 2) convertMetastore 3) useCTAS on top of the 
codec loop. So we will be calling the codec loop 6 times in a test for each 
unique combination of (isPartitioned, convertMetastore, useCTAS). And we have 
changed the codec loop to randomly pick one combination of table level and 
session level codecs.
    
    Given this, i feel we are getting a decent coverage and also i feel we 
should be able to catch regression as we will catch it in some jenkin run or 
the other. If you still feel uncomfortable, should we take 2 codecs as opposed 
to 1 ? It will generate a 24 (4 * 6)  times loop as opposed to 54 (9 * 6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to