I think there is probably room for improvement in the entire pipeline. Doing some more in depth profiling might inform which areas to target for optimization and/or parallelize. But I don't have any particular user configurable options. For the schema in question, some of the comments about future improvements for def/rep level generation [1] might apply.
-Micah [1] https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/path_internal.cc#L20 On Fri, Mar 26, 2021 at 9:47 PM Weston Pace <[email protected]> wrote: > I'm fairly certain there is room for improvement in the C++ > implementation for writing single files to ADLFS. Others can correct > me if I'm wrong but we don't do any kind of pipelined writes. I'd > guess this is partly because there isn't much benefit when writing to > local disk (writes are typically synchronous) but also because it's > much easier to write multiple files. > > Is writing multiple files a choice for you? I would guess using a > dataset write with multiple files would be significantly more > efficient than one large single file write on ADLFS. > > -Weston > > On Fri, Mar 26, 2021 at 6:28 PM Yeshwanth Sriram <[email protected]> > wrote: > > > > Hello, > > > > Thank you again for earlier help on improving overall ADLFS read latency > using multiple threads which has worked out really well. > > > > I’ve incorporated buffering on the adls/writer implementation (upto 64 > meg) . What I’m noticing is that the parquet_writer->WriteTable(table) > latency dominates everything else on the output phase of the job (~65sec vs > ~1.2min ) . I could use multiple threads (like io/s3fs) but not sure if it > will have any effect on parquet write table operation. > > > > Question: Is there anything else I can leverage inside parquet/writer > subsystem to improve the core parquet/write/table latency ? > > > > > > schema: > > map<key,array<struct<…>>> > > struct<...> > > map<key,map<key,map<key, struct<…>>>> > > struct<…> > > binary > > num_row_groups: 6 > > num_rows_per_row_group: ~8mil > > write buffer size: 64 * 1024 * 1024 (~64 mb) > > write compression: snappy > > total write latency per row group: ~1.2min > > adls append/flush latency (minor factor) > > Azure: ESv3/RAM: 256Gb/Cores: 8 > > > > Yesh >
