I'm fairly certain there is room for improvement in the C++ implementation for writing single files to ADLFS. Others can correct me if I'm wrong but we don't do any kind of pipelined writes. I'd guess this is partly because there isn't much benefit when writing to local disk (writes are typically synchronous) but also because it's much easier to write multiple files.
Is writing multiple files a choice for you? I would guess using a dataset write with multiple files would be significantly more efficient than one large single file write on ADLFS. -Weston On Fri, Mar 26, 2021 at 6:28 PM Yeshwanth Sriram <[email protected]> wrote: > > Hello, > > Thank you again for earlier help on improving overall ADLFS read latency > using multiple threads which has worked out really well. > > I’ve incorporated buffering on the adls/writer implementation (upto 64 meg) . > What I’m noticing is that the parquet_writer->WriteTable(table) latency > dominates everything else on the output phase of the job (~65sec vs ~1.2min ) > . I could use multiple threads (like io/s3fs) but not sure if it will have > any effect on parquet write table operation. > > Question: Is there anything else I can leverage inside parquet/writer > subsystem to improve the core parquet/write/table latency ? > > > schema: > map<key,array<struct<…>>> > struct<...> > map<key,map<key,map<key, struct<…>>>> > struct<…> > binary > num_row_groups: 6 > num_rows_per_row_group: ~8mil > write buffer size: 64 * 1024 * 1024 (~64 mb) > write compression: snappy > total write latency per row group: ~1.2min > adls append/flush latency (minor factor) > Azure: ESv3/RAM: 256Gb/Cores: 8 > > Yesh
