[GitHub] [arrow] ahmet-uyar commented on a change in pull request #11872: ARROW-8823: [C++] Compute aggregate compression ratio when producing compressed IPC body messages

GitBox Mon, 06 Dec 2021 09:37:24 -0800


ahmet-uyar commented on a change in pull request #11872:
URL: https://github.com/apache/arrow/pull/11872#discussion_r763225907




##########
File path: cpp/src/arrow/ipc/read_write_test.cc
##########
@@ -1727,6 +1727,52 @@ TEST(TestIpcFileFormat, FooterMetaData) {
   ASSERT_TRUE(out_metadata->Equals(*metadata));
 }
 
+TEST_F(TestWriteRecordBatch, CompressionRatio) {
+  // ARROW-8823: Calculating the compression ratio
+  FileWriterHelper helper;
+  IpcWriteOptions write_options1 = IpcWriteOptions::Defaults();
+  IpcWriteOptions write_options2 = IpcWriteOptions::Defaults();
+  ASSERT_OK_AND_ASSIGN(write_options2.codec, 
util::Codec::Create(Compression::LZ4_FRAME));
+
+  // pre-computed compression ratios for record batches with 
Compression::LZ4_FRAME
+  std::vector<float> comp_ratios{1.0f, 0.64f, 0.79924363f};
+
+  std::vector<std::shared_ptr<RecordBatch>> batches(3);
+  // empty record batch
+  ASSERT_OK(MakeIntBatchSized(0, &batches[0]));
+  // record batch with int values
+  ASSERT_OK(MakeIntBatchSized(2000, &batches[1], 100));
+
+  // record batch with DictionaryArray
+  random::RandomArrayGenerator rg(/*seed=*/0);
+  int64_t length = 500;
+  int dict_size = 50;
+  std::shared_ptr<Array> dict = rg.String(dict_size, /*min_length=*/5, 
/*max_length=*/5, /*null_probability=*/0);
+  std::shared_ptr<Array> indices = rg.Int32(length, /*min=*/0, 
/*max=*/dict_size - 1, /*null_probability=*/0.1);
+  auto dict_type = dictionary(int32(), utf8());
+  auto dict_field = field("f1", dict_type);
+  ASSERT_OK_AND_ASSIGN(auto dict_array,
+                       DictionaryArray::FromArrays(dict_type, indices, dict));
+
+  auto schema = ::arrow::schema({field("f0", utf8()), dict_field});
+  batches[2] =
+    RecordBatch::Make(schema, length, {rg.String(500, 0, 10, 0.1), 
dict_array});
+
+  for(size_t i = 0; i < batches.size(); ++i) {

Review comment:
       Done as suggested. 
   But there is a slight difference. When a record-batch is serialized, 
buffer-sizes complemented to the multiple of 8. So when there is no 
compression, serialized record batch sizes can be slightly larger. In that 
case, raw-sizes are less than or equal to the serialized size. 
   
   In addition, when compression is used, if there is very little data (a few 
hundred bytes maybe), compressed size can actually be larger than the raw size. 
But I have not put this case in to the test case. So this is not a problem. 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] ahmet-uyar commented on a change in pull request #11872: ARROW-8823: [C++] Compute aggregate compression ratio when producing compressed IPC body messages

Reply via email to