klion26 commented on PR #8031: URL: https://github.com/apache/arrow-rs/pull/8031#issuecomment-3695333270
As this is included in #8480, which wants to finalize Variant Type support in Parquet, I want to push this forward to a final state. # Conclusion After playing with some benchmarks, I have some conclusions here - We can add a `HeaderWriter` for object/list builder(the headwriter is generic with some const parameter), listbuilder and objectbuilder use the same header writer, and it will improve the new benchmark(described below) by 3%~4%. - We need some benchmarks to better measure the object/list builder's work(currently, there are >50% times on JSON-related work) CC @alamb @scovich # Details - I re-ran the existing benchmarks for multiple solutions(including the generic with const parameter), and found that there wasn't much difference in the results. - Use `sample` and [`flamegraph` tools](https://github.com/brendangregg/FlameGraph) to generate the flamegraph, which shows that the `objectbuilder::finish` only consumes ~2% time, and JSON-related work consumes ~54% time, ObjectBuilder::insert ~ 14%, and ObjectBuilder::append_value 15% (flamegraph attached below) - I added a new benchmark(described below) that excludes the Json deserialize work for the ObjectBuilder, shows that `ObjectBuilder::finish` consumes ~27%, and `ObjectBuilder::append_value` consumes ~ 62% - I ran a benchmark for the single-depth object builder with 100/200 fields, which shows that the new solution will win with 3% ~ 4% improvement. - did not add a benchmark for the `ObjectBuilder::finish` because the `append_value` and `finish` are a single unit for the caller. ## Benchmark The benchmark results - `optimize-with-header-writer-const-generic` is optimized headerWriter and const generic(branch goes [here](https://github.com/klion26/arrow-rs/blob/b7c3078937f6e511a9e76c4b45e423da60585322/parquet-variant/src/builder/object.rs#L27-L68), and attach sample code below) - `main_with_object_bench` is the main branch with the new object builder benchmark(the generate code [here](https://github.com/klion26/arrow-rs/blob/7d176cc8dbb89f558df760d64bfe20f45aff16b4/parquet-variant-compute/benches/variant_kernels.rs#L449-L503), the benchmark [code here](https://github.com/klion26/arrow-rs/blob/7d176cc8dbb89f558df760d64bfe20f45aff16b4/parquet-variant-compute/benches/variant_kernels.rs#L75-L110), sample code below) - `optimize-with-packedu32iterator` is based on `main-with-object-bench` and uses the `packedu32Iteratro`(code here -- [PackedU32Iterator](https://github.com/klion26/arrow-rs/blob/5ddfba050d475026c5545ba66d675c0335986887/parquet-variant/src/builder.rs#L62-L100), and [the caller logic](https://github.com/klion26/arrow-rs/blob/5ddfba050d475026c5545ba66d675c0335986887/parquet-variant/src/builder/object.rs#L253-L398), and sample code below) - The result shows that `optimize-with-header-writer-const-generic` and `optimize-with-packedu32iterator` are both better than `main`(`optimize-with-packedu32iterator` is better), but as `optimize-with-header-writer-const-generic` is more easily understand(maybe maintain), wo I lean to use `optimize-with-header-writer-const-generic` ``` group optimize-with-header-writer-const-generic main-with-object-bench optimize-with-packedu32iterator ----- -------------------------------------------------------- --------------------------- ---------------------------------------------------- batch_json_string_to_variant json_list 8k string 1.00 27.5±0.89ms ? ?/sec 1.09 29.9±0.81ms ? ?/sec 1.03 28.3±0.76ms ? ?/sec batch_json_string_to_variant object - 1 depth(8000-100) random_json(2436 bytes per document) 1.01 204.1±4.21ms ? ?/sec 1.03 208.6±4.17ms ? ?/sec 1.00 201.8±4.59ms ? ?/sec batch_json_string_to_variant object - 1 depth(8000-200) random_json(4871 bytes per document) 1.02 389.8±7.24ms ? ?/sec 1.04 399.6±7.10ms ? ?/sec 1.00 383.8±7.35ms ? ?/sec batch_json_string_to_variant random_json(2633 bytes per document) 1.01 328.1±5.06ms ? ?/sec 1.04 337.8±6.06ms ? ?/sec 1.00 325.6±5.64ms ? ?/sec batch_json_string_to_variant repeated_struct 8k string 1.06 11.0±0.35ms ? ?/sec 1.05 11.0±0.33ms ? ?/sec 1.00 10.4±0.28ms ? ?/sec variant_get_primitive 1.01 1730.8±49.77ns ? ?/sec 1.03 1759.5±53.97ns ? ?/sec 1.00 1712.1±57.32ns ? ?/sec variant_get_shredded_utf8 1.02 1248.2±43.11ns ? ?/sec 1.04 1276.2±26.61ns ? ?/sec 1.00 1225.9±38.09ns ? ?/sec ``` # Ref <details> <summary>new benchmark logic</summary> ``` write!(output_buffer, "{{").unwrap(); for i in 0..*max_fields { let key_length = rng.random_range(1..=20); let key: String = (0..key_length) .map(|_| rng.sample(Alphanumeric) as char) .collect(); write!(output_buffer, "\"{key}\":").unwrap(); let total_weight = *null_weight + *string_weight + *number_weight + *boolean_weight; // Generate a random number to determine the type let mut random_value: usize = rng.random_range(0..total_weight); if random_value <= *null_weight { write!(output_buffer, "null").unwrap(); } else { random_value -= *null_weight; if random_value <= *string_weight { // Generate a random string between 1 and 20 characters let length = rng.random_range(1..=20); let random_string: String = (0..length) .map(|_| rng.sample(Alphanumeric) as char) .collect(); write!(output_buffer, "\"{random_string}\"",).unwrap(); } else { random_value -= *string_weight; if random_value <= *number_weight { // 50% chance of generating an integer or a float if rng.random_bool(0.5) { // Generate a random integer let random_integer: i64 = rng.random_range(-1000..1000); write!(output_buffer, "{random_integer}",).unwrap(); } else { // Generate a random float let random_float: f64 = rng.random_range(-1000.0..1000.0); write!(output_buffer, "{random_float}",).unwrap(); } } else { random_value -= *number_weight; if random_value <= *boolean_weight { // Generate a random boolean let random_boolean: bool = rng.random(); write!(output_buffer, "{random_boolean}",).unwrap(); } } } } if i < *max_fields - 1 { write!(output_buffer, ",").unwrap(); } } write!(&mut self.output_buffer, "}}").unwrap(); ``` </details> <details> <summary> PackedU32Iteratro </summary> ``` struct PackedU32Iterator<const PACKED_BYTES: usize, T: Iterator<Item = [u8; 4]>> { iterator: T, current_item: [u8; 4], current_byte: usize, // 0..3 } impl<const PACKED_BYTES: usize, T: Iterator<Item = [u8; 4]>> PackedU32Iterator<PACKED_BYTES, T> { fn new(packed_bytes: usize, iterator: T) -> Self { // eliminate corner cases in `next` by initializing // with a fake already-consumed "first" item Self { iterator, current_item: [0; 4], current_byte: packed_bytes, } } } impl<const PACKED_BYTES: usize, T: Iterator<Item = [u8; 4]>> Iterator for PackedU32Iterator<PACKED_BYTES, T> { type Item = u8; fn next(&mut self) -> Option<Self::Item> { if self.current_byte >= PACKED_BYTES { self.current_item = self.iterator.next()?; self.current_byte = 0; } let rval = self.current_item[self.current_byte]; self.current_byte += 1; Some(rval) } fn size_hint(&self) -> (usize, Option<usize>) { let lower = (PACKED_BYTES - self.current_byte) + PACKED_BYTES * self.iterator.size_hint().0; (lower, Some(lower)) } } ``` </details> <details> <summary> header writer with const generic </summary> ``` fn object_header<const LARGE_BIT: u8, const ID_SIZE: u8, const OFFSET_SIZE: u8>() -> u8 { (LARGE_BIT << (BASIC_TYPE_BITS + 4)) | ((ID_SIZE - 1) << (BASIC_TYPE_BITS + 2)) | ((OFFSET_SIZE - 1) << BASIC_TYPE_BITS) | VariantBasicType::Object as u8 } struct ObjectHeaderWriter<const OFFSET_SIZE: u8, const ID_SIZE: u8>(); impl<const OFFSET_SIZE: u8, const ID_SIZE: u8> ObjectHeaderWriter<OFFSET_SIZE, ID_SIZE> { fn write( dst: &mut Vec<u8>, num_fields: usize, field_ids: impl Iterator<Item = u32>, offsets: impl Iterator<Item = usize>, data_size: usize, ) { let header = match num_fields > u8::MAX as usize { true => object_header::<1, { ID_SIZE }, { OFFSET_SIZE }>(), false => object_header::<0, { ID_SIZE }, { OFFSET_SIZE }>(), }; dst.push(header); append_packed_u32::<ID_SIZE>(dst, num_fields as u32); for id in field_ids { append_packed_u32::<ID_SIZE>(dst, id); } for off in offsets { append_packed_u32::<OFFSET_SIZE>(dst, off as u32); } append_packed_u32::<OFFSET_SIZE>(dst, data_size as u32); } } fn append_packed_u32<const SIZE: u8>(dest: &mut Vec<u8>, value: u32) { let len = dest.len() + SIZE as usize; dest.extend(value.to_le_bytes()); dest.truncate(len); } ``` </details> > The flamegraph was obtained with the steps > 1. execute `cargo bench --no-run` to get the executable > 2. execute with the executable with the arguments like > 3. use `sudo sample $pid 60 -file /tmp/variant_kernels.sample` to generate the sample file > 4. use flamegraph tools to generate the flamegraph > use these steps to generate the flamegraph as can't install xcode for now :( The flamegraph with JSON deserialize  The flamegraph without the JSON deserialize  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
