Hello,

Thank you again for earlier help on improving overall ADLFS read latency using 
multiple threads which has worked out really well. 

I’ve incorporated buffering on the adls/writer implementation (upto 64 meg) . 
What I’m noticing is that the parquet_writer->WriteTable(table) latency 
dominates everything else on the output phase of the job (~65sec vs ~1.2min ) . 
 I could use multiple threads (like io/s3fs) but not sure if it will have any 
effect on parquet write table operation. 

Question: Is there anything else I can leverage inside parquet/writer subsystem 
to improve the core parquet/write/table latency ?  


schema:
  map<key,array<struct<…>>>
  struct<...>
  map<key,map<key,map<key, struct<…>>>>
  struct<…>
  binary
num_row_groups: 6
num_rows_per_row_group: ~8mil
write buffer size: 64 * 1024 * 1024 (~64 mb)
write compression: snappy
total write latency per row group: ~1.2min
 adls append/flush latency (minor factor)
Azure: ESv3/RAM: 256Gb/Cores: 8

Yesh

Reply via email to