bvaradar commented on issue #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet URL: https://github.com/apache/incubator-hudi/pull/1253#issuecomment-593560005 @lamber-ken @leesf @nsivabalan : Yes, the additional string conversion is not needed. So, I refactored a little bit to use correct bloom-filter serialization method (based on whether compression is enabled or not). @lamber-ken : I am observing the same behavior when comparing compression vs non-compression case. I see that compression performs poorly based on the bloom filter utilization (number of keys stored in bloom-filter). I see that snappy also behaves in the same way (although poorly compared to gzip). I would need to investigate further on this. Result ``` test random keys original size: 4792548 compress size (utilization=10%) : 2150956, CompressToOriginal=44 compress size (utilization=20%) : 3078736, CompressToOriginal=64 compress size (utilization=30%) : 3638548, CompressToOriginal=75 compress size (utilization=40%) : 3977508, CompressToOriginal=82 compress size (utilization=50%) : 4258972, CompressToOriginal=88 compress size (utilization=60%) : 4490484, CompressToOriginal=93 compress size (utilization=70%) : 4647776, CompressToOriginal=96 compress size (utilization=80%) : 4750028, CompressToOriginal=99 compress size (utilization=90%) : 4794040, CompressToOriginal=100 test sequential keys original size: 4792548 Using Byte[] - compress size (utilization=10%) : 2150852, CompressToOriginal=44 Using Byte[] - compress size (utilization=20%) : 3078332, CompressToOriginal=64 Using Byte[] - compress size (utilization=30%) : 3639000, CompressToOriginal=75 Using Byte[] - compress size (utilization=40%) : 3977764, CompressToOriginal=82 Using Byte[] - compress size (utilization=50%) : 4258544, CompressToOriginal=88 Using Byte[] - compress size (utilization=60%) : 4490372, CompressToOriginal=93 Using Byte[] - compress size (utilization=70%) : 4647832, CompressToOriginal=96 Using Byte[] - compress size (utilization=80%) : 4749928, CompressToOriginal=99 Using Byte[] - compress size (utilization=90%) : 4794040, CompressToOriginal=100 Process finished with exit code 0 ``` Test - Code : ``` @Test public void testit() { int[] utilization = new int[] { 10, 20, 30, 40, 50, 60, 70, 80, 90}; System.out.println("test random keys"); int originalSize = 0; for (int i = 0; i < utilization.length; i++) { SimpleBloomFilter filter = new SimpleBloomFilter(1000000, 0.000001, Hash.MURMUR_HASH); int numKeys = 10000 * utilization[i]; for (int j = 0; j < numKeys; j++) { String key = UUID.randomUUID().toString(); filter.add(key); } if (i == 0) { originalSize = filter.serializeToString().length(); System.out.println("original size: " + filter.serializeToString().length()); } int compressedSize = GzipCompressionUtils.compress(filter.serializeToBytes()).length(); System.out.println("compress size (utilization=" + utilization[i] + "%) : " + compressedSize + ", CompressToOriginal=" + (compressedSize * 100/originalSize)); } System.out.println("\ntest sequential keys"); for (int i = 0; i < utilization.length; i++) { SimpleBloomFilter filter = new SimpleBloomFilter(1000000, 0.000001, Hash.MURMUR_HASH); int numKeys = 10000 * utilization[i]; for (int j = 0; j < numKeys; j++) { String key = "key-" + j; filter.add(key); } if (i == 0) { originalSize = filter.serializeToString().length(); System.out.println("original size: " + filter.serializeToString().length()); } int compressedSize = GzipCompressionUtils.compress(filter.serializeToBytes()).length(); System.out.println("Using Byte[] - compress size (utilization=" + utilization[i] + "%) : " + compressedSize + ", CompressToOriginal=" + (compressedSize * 100/originalSize)); } } ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services