On Fri, 26 Mar 2021 18:47:26 -1000
Weston Pace <[email protected]> wrote:
> I'm fairly certain there is room for improvement in the C++
> implementation for writing single files to ADLFS.  Others can correct
> me if I'm wrong but we don't do any kind of pipelined writes.  I'd
> guess this is partly because there isn't much benefit when writing to
> local disk (writes are typically synchronous) but also because it's
> much easier to write multiple files.

Writes should be asynchronous most of the time.  I don't know anything
about ADLFS, though.

Regards

Antoine.


> 
> Is writing multiple files a choice for you?  I would guess using a
> dataset write with multiple files would be significantly more
> efficient than one large single file write on ADLFS.
> 
> -Weston
> 
> On Fri, Mar 26, 2021 at 6:28 PM Yeshwanth Sriram <[email protected]> 
> wrote:
> >
> > Hello,
> >
> > Thank you again for earlier help on improving overall ADLFS read latency 
> > using multiple threads which has worked out really well.
> >
> > I’ve incorporated buffering on the adls/writer implementation (upto 64 meg) 
> > . What I’m noticing is that the parquet_writer->WriteTable(table) latency 
> > dominates everything else on the output phase of the job (~65sec vs ~1.2min 
> > ) .  I could use multiple threads (like io/s3fs) but not sure if it will 
> > have any effect on parquet write table operation.
> >
> > Question: Is there anything else I can leverage inside parquet/writer 
> > subsystem to improve the core parquet/write/table latency ?
> >
> >
> > schema:
> >   map<key,array<struct<…>>>
> >   struct<...>
> >   map<key,map<key,map<key, struct<…>>>>
> >   struct<…>
> >   binary
> > num_row_groups: 6
> > num_rows_per_row_group: ~8mil
> > write buffer size: 64 * 1024 * 1024 (~64 mb)
> > write compression: snappy
> > total write latency per row group: ~1.2min
> >  adls append/flush latency (minor factor)
> > Azure: ESv3/RAM: 256Gb/Cores: 8
> >
> > Yesh  
> 



Reply via email to