Thank you both. Did run profiler using mac/instruments/TimeProfile and as Micah and Weston pointed out the calls to gzip + dictionary encoding was dominant (not counting the network/io). I disabled both but kept the snappy compression and did see improvement. The overall job latency fell from 7minutes+ to 6minutes.
Something else that stood out was time spent in calls to “array->GetFieldByName” and “array->Slice” . This could be related to application logic but maybe related to the large number of rows in each row-group. @Weston - Regarding the three option I was thinking out loud sorry if it interfere with the main discussion. Yesh > On Mar 28, 2021, at 11:36 AM, Weston Pace <[email protected]> wrote: > > Sorry, I didn’t realize you are writing multiple files already. The flame > graphs Micah suggested would be extremely helpful. Can you also measure the > CPU utilization? If the CPU is not close to maxing out then another > possibility is that pipelined writes can help given ADLFS supports a high > number of concurrent writes. > > Also, regarding ReadOptimized/WriteOptimized/ComputeOptimized. What are you > thinking is the difference between the three? Other than potentially > enabling/disabling compression I’m not sure I follow that point. > > On Sun, Mar 28, 2021 at 8:12 AM Micah Kornfield <[email protected] > <mailto:[email protected]>> wrote: > Was thinking if Arrow/parquet/encode/decode subsystem had an option to pick > (any two) from the following three options. > - ReadOptimized > - WriteOptimized > - ComputeOptimized > > The only thing that I'm aware of that could potentially impact this is > compression used (or not used I think). I think there might also be a > configuration knob to turn dictionary encoding on/off (turning it off would > reduce computation requirements). Number of rows per row-group might also > impact this but probably to a lesser extent. > > As you experiment providing a flame-graph or similar profile could > potentially highlight hot-spots that can be optimized. > > On Sun, Mar 28, 2021 at 10:58 AM Yeshwanth Sriram <[email protected] > <mailto:[email protected]>> wrote: > - Writing multiple files is an option. I’ve already tested processing (read, > filter,write) each row groups in separate threads and it definitely provides > me with under 2 minute latency for the whole job. But within each processing > unit the parquet write (I suppose parquet encode/serialize) latency dominate > all other latencies (including ADLFS writes) hence my question if there is > any additional options in parquet/writer that I could leverage to bring down > this latency. > > - ADLFS/sdk supports append(pos, bytes) and final flush (total bytes) > operation which makes it possible to append from different threads and > perform final flush operation after all futures are complete. But this > latency is a small factor for this particular poc. > > I’ll proceed to make comparison (latency) between existing spark based > solution with what I have so far and try publish this number here. Thank you > again for all the help. > > Was thinking if Arrow/parquet/encode/decode subsystem had an option to pick > (any two) from the following three options. > - ReadOptimized > - WriteOptimized > - ComputeOptimized > > Where > > RC -> Possibly ML training scenario > WC -> My current use case raw project/filter and write (no aggregations) > RW -> Reporting > > Yesh > > > > On Mar 27, 2021, at 2:12 AM, Antoine Pitrou <[email protected] > > <mailto:[email protected]>> wrote: > > > > On Fri, 26 Mar 2021 18:47:26 -1000 > > Weston Pace <[email protected] <mailto:[email protected]>> wrote: > >> I'm fairly certain there is room for improvement in the C++ > >> implementation for writing single files to ADLFS. Others can correct > >> me if I'm wrong but we don't do any kind of pipelined writes. I'd > >> guess this is partly because there isn't much benefit when writing to > >> local disk (writes are typically synchronous) but also because it's > >> much easier to write multiple files. > > > > Writes should be asynchronous most of the time. I don't know anything > > about ADLFS, though. > > > > Regards > > > > Antoine. > > > > > >> > >> Is writing multiple files a choice for you? I would guess using a > >> dataset write with multiple files would be significantly more > >> efficient than one large single file write on ADLFS. > >> > >> -Weston > >> > >> On Fri, Mar 26, 2021 at 6:28 PM Yeshwanth Sriram <[email protected] > >> <mailto:[email protected]>> wrote: > >>> > >>> Hello, > >>> > >>> Thank you again for earlier help on improving overall ADLFS read latency > >>> using multiple threads which has worked out really well. > >>> > >>> I’ve incorporated buffering on the adls/writer implementation (upto 64 > >>> meg) . What I’m noticing is that the parquet_writer->WriteTable(table) > >>> latency dominates everything else on the output phase of the job (~65sec > >>> vs ~1.2min ) . I could use multiple threads (like io/s3fs) but not sure > >>> if it will have any effect on parquet write table operation. > >>> > >>> Question: Is there anything else I can leverage inside parquet/writer > >>> subsystem to improve the core parquet/write/table latency ? > >>> > >>> > >>> schema: > >>> map<key,array<struct<…>>> > >>> struct<...> > >>> map<key,map<key,map<key, struct<…>>>> > >>> struct<…> > >>> binary > >>> num_row_groups: 6 > >>> num_rows_per_row_group: ~8mil > >>> write buffer size: 64 * 1024 * 1024 (~64 mb) > >>> write compression: snappy > >>> total write latency per row group: ~1.2min > >>> adls append/flush latency (minor factor) > >>> Azure: ESv3/RAM: 256Gb/Cores: 8 > >>> > >>> Yesh > >> > > > > > > >
