clintropolis commented on PR #12408: URL: https://github.com/apache/druid/pull/12408#issuecomment-1112621153
Hi @churromorales, very sorry for the delay! >Hi @clintropolis I ran the tests with the parameters, but had to run it with java -jar as the parameterization did not work with the command you provided above. That is unfortunate, I haven't actually run these in a while so maybe I made a mistake in my instructions, I'll try to have a look at it at some point. I was hoping it would work so we could easily get the size difference so we could compare with LZ4 and consider the differences for guidance documentation for cluster operators. >What I did find was that zstandard was slower than lz4, I believe all these tests are reading out of memory. To verify I added the none compression options to the ColumnarLongsSelectRowsFromGeneratorBenchmark. If we are reading out of memory, then any codec which decompresses slower than your bus (of course adjusting for the compression ratio) would perform slower. I did see that while zstd was slower than lz4 for these tests, lz4 was slower than having no compression as well. Yes, these benchmarks are testing the "hot" historical use case, where it is provisioned with a lot of extra "free" memory to be occupied with page cache of the memory mapped segment files. It is meant to find the best case scenario of performance. It would be interesting to have measurements where every read was from disk as well, so we could compare the opposite case of very dense historicals, but such benchmarks do not exist yet afaik. >I think for this patch, it would be interesting to see it from a tiering standpoint in druid. For those segment files that can be memory mapped, having no compression or a ultra fast compression library is best (while sacrificing compression ratio). But for a cold tier maybe it would be best to use a library like zstandard, where you are not always allocating enough memory to guarantee these segment files are memory mapped, perhaps you care more about the space requirement. Totally, I don't think this needs to be faster than LZ4 to be added, mostly just curious where it falls so we know how to advise cluster operators on when it might be a good idea to consider using it. >Anyways I'll post the results for you to take a look at. I turned zstandard in one of our clusters. I had 2 datasources which were both reading from the same kafka topic. One was using zstd, one lz4. After ingesting a few TB of data we did see that zstd had about a footprint of about 8-10% less than lz4. Do you still have these results? I'd be interested in seeing the difference, as well as the size difference if that isn't too hard to find out. In the mean time, I'll try to get this PR reviewed so we can add the option. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
