bvaradar commented on issue #1253: [HUDI-558] Introduce ability to compress 
bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#issuecomment-593560005
 
 
   @lamber-ken @leesf @nsivabalan : Yes, the additional string conversion is 
not needed. So, I refactored a little bit to use correct bloom-filter 
serialization method (based on whether compression is enabled or not). 
   
   @lamber-ken : I am observing the same behavior when comparing compression vs 
non-compression case. I see that compression performs poorly based on the bloom 
filter utilization (number of keys stored in bloom-filter).  I see that snappy 
also behaves in the same way (although poorly compared to gzip).  I would need 
to investigate further on this.
   
   Result 
   
   ```
   test random keys
   original size: 4792548
   compress size (utilization=10%) : 2150956, CompressToOriginal=44
   compress size (utilization=20%) : 3078736, CompressToOriginal=64
   compress size (utilization=30%) : 3638548, CompressToOriginal=75
   compress size (utilization=40%) : 3977508, CompressToOriginal=82
   compress size (utilization=50%) : 4258972, CompressToOriginal=88
   compress size (utilization=60%) : 4490484, CompressToOriginal=93
   compress size (utilization=70%) : 4647776, CompressToOriginal=96
   compress size (utilization=80%) : 4750028, CompressToOriginal=99
   compress size (utilization=90%) : 4794040, CompressToOriginal=100
   
   test sequential keys
   original size: 4792548
   Using Byte[] - compress size (utilization=10%) : 2150852, 
CompressToOriginal=44
   Using Byte[] - compress size (utilization=20%) : 3078332, 
CompressToOriginal=64
   Using Byte[] - compress size (utilization=30%) : 3639000, 
CompressToOriginal=75
   Using Byte[] - compress size (utilization=40%) : 3977764, 
CompressToOriginal=82
   Using Byte[] - compress size (utilization=50%) : 4258544, 
CompressToOriginal=88
   Using Byte[] - compress size (utilization=60%) : 4490372, 
CompressToOriginal=93
   Using Byte[] - compress size (utilization=70%) : 4647832, 
CompressToOriginal=96
   Using Byte[] - compress size (utilization=80%) : 4749928, 
CompressToOriginal=99
   Using Byte[] - compress size (utilization=90%) : 4794040, 
CompressToOriginal=100
   
   Process finished with exit code 0
   
   ```
   
   Test - Code : 
   ```
   @Test
     public void testit() {
       int[] utilization = new int[] { 10, 20, 30, 40, 50, 60, 70, 80, 90};
   
       System.out.println("test random keys");
       int originalSize = 0;
       for (int i = 0; i < utilization.length; i++) {
         SimpleBloomFilter filter = new SimpleBloomFilter(1000000, 0.000001, 
Hash.MURMUR_HASH);
         int numKeys = 10000 * utilization[i];
         for (int j = 0; j < numKeys; j++) {
           String key = UUID.randomUUID().toString();
           filter.add(key);
         }
   
         if (i == 0) {
           originalSize = filter.serializeToString().length();
           System.out.println("original size: " + 
filter.serializeToString().length());
         }
         int compressedSize = 
GzipCompressionUtils.compress(filter.serializeToBytes()).length();
         System.out.println("compress size (utilization=" + utilization[i] + 
"%) : "
             +  compressedSize + ", CompressToOriginal=" + (compressedSize * 
100/originalSize));
       }
   
       System.out.println("\ntest sequential keys");
   
       for (int i = 0; i < utilization.length; i++) {
         SimpleBloomFilter filter = new SimpleBloomFilter(1000000, 0.000001, 
Hash.MURMUR_HASH);
         int numKeys = 10000 * utilization[i];
         for (int j = 0; j < numKeys; j++) {
           String key = "key-" + j;
           filter.add(key);
         }
         if (i == 0) {
           originalSize = filter.serializeToString().length();
           System.out.println("original size: " + 
filter.serializeToString().length());
         }
         int compressedSize = 
GzipCompressionUtils.compress(filter.serializeToBytes()).length();
         System.out.println("Using Byte[] - compress size (utilization=" + 
utilization[i] + "%) : "
             + compressedSize + ", CompressToOriginal=" + (compressedSize * 
100/originalSize));
       }
     }
   ```
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to