Thank you both. Did run profiler using mac/instruments/TimeProfile and as Micah 
and Weston pointed out the calls to gzip + dictionary encoding was dominant 
(not counting the network/io). I disabled both but kept the snappy compression 
and did see improvement. The overall job latency fell from 7minutes+ to 
6minutes. 

Something else that stood out was time spent in calls to 
“array->GetFieldByName” and “array->Slice” . This could be related to 
application logic but maybe related to the large number of rows in each 
row-group.

@Weston - Regarding the three option I was thinking out loud sorry if it 
interfere with the main discussion.

Yesh




> On Mar 28, 2021, at 11:36 AM, Weston Pace <[email protected]> wrote:
> 
> Sorry, I didn’t realize you are writing multiple files already.  The flame 
> graphs Micah suggested would be extremely helpful.  Can you also measure the 
> CPU utilization?  If the CPU is not close to maxing out then another 
> possibility is that pipelined writes can help given ADLFS supports a high 
> number of concurrent writes.
> 
> Also, regarding ReadOptimized/WriteOptimized/ComputeOptimized.  What are you 
> thinking is the difference between the three?  Other than potentially 
> enabling/disabling compression I’m not sure I follow that point.
> 
> On Sun, Mar 28, 2021 at 8:12 AM Micah Kornfield <[email protected] 
> <mailto:[email protected]>> wrote:
> Was thinking if Arrow/parquet/encode/decode subsystem had an option to pick 
> (any two) from the following three options.
> - ReadOptimized
> - WriteOptimized
> - ComputeOptimized
> 
> The only thing that I'm aware of that could potentially impact this is 
> compression used (or not used I think).   I think there might also be a 
> configuration knob to turn dictionary encoding on/off (turning it off would 
> reduce computation requirements). Number of rows per row-group might also 
> impact this but probably to a lesser extent.  
> 
> As you experiment providing a flame-graph or similar profile could 
> potentially highlight hot-spots that can  be optimized. 
> 
> On Sun, Mar 28, 2021 at 10:58 AM Yeshwanth Sriram <[email protected] 
> <mailto:[email protected]>> wrote:
> - Writing multiple files is an option. I’ve already tested processing (read, 
> filter,write) each row groups in separate threads and it definitely provides 
> me with under 2 minute latency for the whole job. But within each processing 
> unit the parquet write (I suppose parquet encode/serialize) latency dominate 
> all other latencies (including ADLFS writes) hence my question if there is 
> any additional options in parquet/writer that I could leverage to bring down 
> this latency.
> 
> - ADLFS/sdk supports append(pos, bytes) and final flush (total bytes) 
> operation which makes it possible to append from different threads and 
> perform final flush operation after all futures are complete. But this 
> latency is a small factor for this particular poc.
> 
> I’ll proceed to make comparison (latency) between existing spark based 
> solution with what I have so far and try publish this number here. Thank you 
> again for all the help.
> 
> Was thinking if Arrow/parquet/encode/decode subsystem had an option to pick 
> (any two) from the following three options.
> - ReadOptimized
> - WriteOptimized
> - ComputeOptimized
> 
> Where 
> 
> RC ->  Possibly ML training scenario
> WC -> My current use case raw project/filter and write (no aggregations)
> RW -> Reporting
> 
> Yesh
> 
> 
> > On Mar 27, 2021, at 2:12 AM, Antoine Pitrou <[email protected] 
> > <mailto:[email protected]>> wrote:
> > 
> > On Fri, 26 Mar 2021 18:47:26 -1000
> > Weston Pace <[email protected] <mailto:[email protected]>> wrote:
> >> I'm fairly certain there is room for improvement in the C++
> >> implementation for writing single files to ADLFS.  Others can correct
> >> me if I'm wrong but we don't do any kind of pipelined writes.  I'd
> >> guess this is partly because there isn't much benefit when writing to
> >> local disk (writes are typically synchronous) but also because it's
> >> much easier to write multiple files.
> > 
> > Writes should be asynchronous most of the time.  I don't know anything
> > about ADLFS, though.
> > 
> > Regards
> > 
> > Antoine.
> > 
> > 
> >> 
> >> Is writing multiple files a choice for you?  I would guess using a
> >> dataset write with multiple files would be significantly more
> >> efficient than one large single file write on ADLFS.
> >> 
> >> -Weston
> >> 
> >> On Fri, Mar 26, 2021 at 6:28 PM Yeshwanth Sriram <[email protected] 
> >> <mailto:[email protected]>> wrote:
> >>> 
> >>> Hello,
> >>> 
> >>> Thank you again for earlier help on improving overall ADLFS read latency 
> >>> using multiple threads which has worked out really well.
> >>> 
> >>> I’ve incorporated buffering on the adls/writer implementation (upto 64 
> >>> meg) . What I’m noticing is that the parquet_writer->WriteTable(table) 
> >>> latency dominates everything else on the output phase of the job (~65sec 
> >>> vs ~1.2min ) .  I could use multiple threads (like io/s3fs) but not sure 
> >>> if it will have any effect on parquet write table operation.
> >>> 
> >>> Question: Is there anything else I can leverage inside parquet/writer 
> >>> subsystem to improve the core parquet/write/table latency ?
> >>> 
> >>> 
> >>> schema:
> >>>  map<key,array<struct<…>>>
> >>>  struct<...>
> >>>  map<key,map<key,map<key, struct<…>>>>
> >>>  struct<…>
> >>>  binary
> >>> num_row_groups: 6
> >>> num_rows_per_row_group: ~8mil
> >>> write buffer size: 64 * 1024 * 1024 (~64 mb)
> >>> write compression: snappy
> >>> total write latency per row group: ~1.2min
> >>> adls append/flush latency (minor factor)
> >>> Azure: ESv3/RAM: 256Gb/Cores: 8
> >>> 
> >>> Yesh  
> >> 
> > 
> > 
> > 
> 

Reply via email to