Re: [C++] - Squeeze more out of parquet write(table) operation.

Micah Kornfield Mon, 29 Mar 2021 20:55:20 -0700

>
> Thank you both. Did run profiler using mac/instruments/TimeProfile and as
> Micah and Weston pointed out the calls to gzip + dictionary encoding was
> dominant (not counting the network/io). I disabled both but kept the snappy
> compression and did see improvement. The overall job latency fell from
> 7minutes+ to 6minutes.


Could you share the output?


> Something else that stood out was time spent in calls to
> “array->GetFieldByName” and “array->Slice” . This could be related to
> application logic but maybe related to the large number of rows in each
> row-group.


For slicing you can try adjusting the batch sizes on WriterProperties  (
https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/properties.h).
GetFieldByName I would guess is due to application code (I don't think this
is used in parquet code but could be wrong), generally looking up the
indices you need once is going to be the most efficient.

On Mon, Mar 29, 2021 at 8:10 PM Yeshwanth Sriram <[email protected]>
wrote:

> Thank you both. Did run profiler using mac/instruments/TimeProfile and as
> Micah and Weston pointed out the calls to gzip + dictionary encoding was
> dominant (not counting the network/io). I disabled both but kept the snappy
> compression and did see improvement. The overall job latency fell from
> 7minutes+ to 6minutes.
>
> Something else that stood out was time spent in calls to
> “array->GetFieldByName” and “array->Slice” . This could be related to
> application logic but maybe related to the large number of rows in each
> row-group.
>
> @Weston - Regarding the three option I was thinking out loud sorry if it
> interfere with the main discussion.
>
> Yesh
>
>
>
>
> On Mar 28, 2021, at 11:36 AM, Weston Pace <[email protected]> wrote:
>
> Sorry, I didn’t realize you are writing multiple files already.  The flame
> graphs Micah suggested would be extremely helpful.  Can you also measure
> the CPU utilization?  If the CPU is not close to maxing out then another
> possibility is that pipelined writes can help given ADLFS supports a high
> number of concurrent writes.
>
> Also, regarding ReadOptimized/WriteOptimized/ComputeOptimized.  What are
> you thinking is the difference between the three?  Other than potentially
> enabling/disabling compression I’m not sure I follow that point.
>
> On Sun, Mar 28, 2021 at 8:12 AM Micah Kornfield <[email protected]>
> wrote:
>
>> Was thinking if Arrow/parquet/encode/decode subsystem had an option to
>>> pick (any two) from the following three options.
>>> - ReadOptimized
>>> - WriteOptimized
>>> - ComputeOptimized
>>
>>
>> The only thing that I'm aware of that could potentially impact this is
>> compression used (or not used I think).   I think there might also be a
>> configuration knob to turn dictionary encoding on/off (turning it off would
>> reduce computation requirements). Number of rows per row-group might also
>> impact this but probably to a lesser extent.
>>
>> As you experiment providing a flame-graph or similar profile could
>> potentially highlight hot-spots that can  be optimized.
>>
>> On Sun, Mar 28, 2021 at 10:58 AM Yeshwanth Sriram <[email protected]>
>> wrote:
>>
>>> - Writing multiple files is an option. I’ve already tested processing
>>> (read, filter,write) each row groups in separate threads and it definitely
>>> provides me with under 2 minute latency for the whole job. But within each
>>> processing unit the parquet write (I suppose parquet encode/serialize)
>>> latency dominate all other latencies (including ADLFS writes) hence my
>>> question if there is any additional options in parquet/writer that I could
>>> leverage to bring down this latency.
>>>
>>> - ADLFS/sdk supports append(pos, bytes) and final flush (total bytes)
>>> operation which makes it possible to append from different threads and
>>> perform final flush operation after all futures are complete. But this
>>> latency is a small factor for this particular poc.
>>>
>>> I’ll proceed to make comparison (latency) between existing spark based
>>> solution with what I have so far and try publish this number here. Thank
>>> you again for all the help.
>>>
>>> Was thinking if Arrow/parquet/encode/decode subsystem had an option to
>>> pick (any two) from the following three options.
>>> - ReadOptimized
>>> - WriteOptimized
>>> - ComputeOptimized
>>>
>>> Where
>>>
>>> RC ->  Possibly ML training scenario
>>> WC -> My current use case raw project/filter and write (no aggregations)
>>> RW -> Reporting
>>>
>>> Yesh
>>>
>>>
>>> > On Mar 27, 2021, at 2:12 AM, Antoine Pitrou <[email protected]>
>>> wrote:
>>> >
>>> > On Fri, 26 Mar 2021 18:47:26 -1000
>>> > Weston Pace <[email protected]> wrote:
>>> >> I'm fairly certain there is room for improvement in the C++
>>> >> implementation for writing single files to ADLFS.  Others can correct
>>> >> me if I'm wrong but we don't do any kind of pipelined writes.  I'd
>>> >> guess this is partly because there isn't much benefit when writing to
>>> >> local disk (writes are typically synchronous) but also because it's
>>> >> much easier to write multiple files.
>>> >
>>> > Writes should be asynchronous most of the time.  I don't know anything
>>> > about ADLFS, though.
>>> >
>>> > Regards
>>> >
>>> > Antoine.
>>> >
>>> >
>>> >>
>>> >> Is writing multiple files a choice for you?  I would guess using a
>>> >> dataset write with multiple files would be significantly more
>>> >> efficient than one large single file write on ADLFS.
>>> >>
>>> >> -Weston
>>> >>
>>> >> On Fri, Mar 26, 2021 at 6:28 PM Yeshwanth Sriram <
>>> [email protected]> wrote:
>>> >>>
>>> >>> Hello,
>>> >>>
>>> >>> Thank you again for earlier help on improving overall ADLFS read
>>> latency using multiple threads which has worked out really well.
>>> >>>
>>> >>> I’ve incorporated buffering on the adls/writer implementation (upto
>>> 64 meg) . What I’m noticing is that the parquet_writer->WriteTable(table)
>>> latency dominates everything else on the output phase of the job (~65sec vs
>>> ~1.2min ) .  I could use multiple threads (like io/s3fs) but not sure if it
>>> will have any effect on parquet write table operation.
>>> >>>
>>> >>> Question: Is there anything else I can leverage inside
>>> parquet/writer subsystem to improve the core parquet/write/table latency ?
>>> >>>
>>> >>>
>>> >>> schema:
>>> >>>  map<key,array<struct<…>>>
>>> >>>  struct<...>
>>> >>>  map<key,map<key,map<key, struct<…>>>>
>>> >>>  struct<…>
>>> >>>  binary
>>> >>> num_row_groups: 6
>>> >>> num_rows_per_row_group: ~8mil
>>> >>> write buffer size: 64 * 1024 * 1024 (~64 mb)
>>> >>> write compression: snappy
>>> >>> total write latency per row group: ~1.2min
>>> >>> adls append/flush latency (minor factor)
>>> >>> Azure: ESv3/RAM: 256Gb/Cores: 8
>>> >>>
>>> >>> Yesh
>>> >>
>>> >
>>> >
>>> >
>>>
>>>
>

Re: [C++] - Squeeze more out of parquet write(table) operation.

Reply via email to