> > Was thinking if Arrow/parquet/encode/decode subsystem had an option to > pick (any two) from the following three options. > - ReadOptimized > - WriteOptimized > - ComputeOptimized
The only thing that I'm aware of that could potentially impact this is compression used (or not used I think). I think there might also be a configuration knob to turn dictionary encoding on/off (turning it off would reduce computation requirements). Number of rows per row-group might also impact this but probably to a lesser extent. As you experiment providing a flame-graph or similar profile could potentially highlight hot-spots that can be optimized. On Sun, Mar 28, 2021 at 10:58 AM Yeshwanth Sriram <[email protected]> wrote: > - Writing multiple files is an option. I’ve already tested processing > (read, filter,write) each row groups in separate threads and it definitely > provides me with under 2 minute latency for the whole job. But within each > processing unit the parquet write (I suppose parquet encode/serialize) > latency dominate all other latencies (including ADLFS writes) hence my > question if there is any additional options in parquet/writer that I could > leverage to bring down this latency. > > - ADLFS/sdk supports append(pos, bytes) and final flush (total bytes) > operation which makes it possible to append from different threads and > perform final flush operation after all futures are complete. But this > latency is a small factor for this particular poc. > > I’ll proceed to make comparison (latency) between existing spark based > solution with what I have so far and try publish this number here. Thank > you again for all the help. > > Was thinking if Arrow/parquet/encode/decode subsystem had an option to > pick (any two) from the following three options. > - ReadOptimized > - WriteOptimized > - ComputeOptimized > > Where > > RC -> Possibly ML training scenario > WC -> My current use case raw project/filter and write (no aggregations) > RW -> Reporting > > Yesh > > > > On Mar 27, 2021, at 2:12 AM, Antoine Pitrou <[email protected]> wrote: > > > > On Fri, 26 Mar 2021 18:47:26 -1000 > > Weston Pace <[email protected]> wrote: > >> I'm fairly certain there is room for improvement in the C++ > >> implementation for writing single files to ADLFS. Others can correct > >> me if I'm wrong but we don't do any kind of pipelined writes. I'd > >> guess this is partly because there isn't much benefit when writing to > >> local disk (writes are typically synchronous) but also because it's > >> much easier to write multiple files. > > > > Writes should be asynchronous most of the time. I don't know anything > > about ADLFS, though. > > > > Regards > > > > Antoine. > > > > > >> > >> Is writing multiple files a choice for you? I would guess using a > >> dataset write with multiple files would be significantly more > >> efficient than one large single file write on ADLFS. > >> > >> -Weston > >> > >> On Fri, Mar 26, 2021 at 6:28 PM Yeshwanth Sriram <[email protected]> > wrote: > >>> > >>> Hello, > >>> > >>> Thank you again for earlier help on improving overall ADLFS read > latency using multiple threads which has worked out really well. > >>> > >>> I’ve incorporated buffering on the adls/writer implementation (upto 64 > meg) . What I’m noticing is that the parquet_writer->WriteTable(table) > latency dominates everything else on the output phase of the job (~65sec vs > ~1.2min ) . I could use multiple threads (like io/s3fs) but not sure if it > will have any effect on parquet write table operation. > >>> > >>> Question: Is there anything else I can leverage inside parquet/writer > subsystem to improve the core parquet/write/table latency ? > >>> > >>> > >>> schema: > >>> map<key,array<struct<…>>> > >>> struct<...> > >>> map<key,map<key,map<key, struct<…>>>> > >>> struct<…> > >>> binary > >>> num_row_groups: 6 > >>> num_rows_per_row_group: ~8mil > >>> write buffer size: 64 * 1024 * 1024 (~64 mb) > >>> write compression: snappy > >>> total write latency per row group: ~1.2min > >>> adls append/flush latency (minor factor) > >>> Azure: ESv3/RAM: 256Gb/Cores: 8 > >>> > >>> Yesh > >> > > > > > > > >
