[
https://issues.apache.org/jira/browse/ARROW-15878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503359#comment-17503359
]
Yibo Cai edited comment on ARROW-15878 at 3/11/22, 2:07 AM:
------------------------------------------------------------
Compared two possible optimization approaches. Not very satisfying with the
results.
Benchmarked two secnarios:
- WriteCsvStringWithQuote is the best case, there is only one quote at string
end
- WriteCsvStringAllQuotes is the worse case, the whole string is filled with
quote
*appoach0(baseline): naive char by char copying*
{code:bash}
WriteCsvStringWithQuote/0 938246 ns 938230 ns 745
bytes_per_second=490.029M/s null_percent=0
WriteCsvStringWithQuote/1 1014895 ns 1014890 ns 688
bytes_per_second=448.751M/s null_percent=1
WriteCsvStringWithQuote/10 1060796 ns 1060780 ns 659
bytes_per_second=393.06M/s null_percent=10
WriteCsvStringWithQuote/50 891765 ns 891760 ns 786
bytes_per_second=269.686M/s null_percent=50
WriteCsvStringAllQuotes/0 1001146 ns 1001109 ns 699
bytes_per_second=785.086M/s null_percent=0
WriteCsvStringAllQuotes/1 1053971 ns 1053956 ns 664
bytes_per_second=738.526M/s null_percent=1
WriteCsvStringAllQuotes/10 1102326 ns 1102258 ns 655
bytes_per_second=645.272M/s null_percent=10
WriteCsvStringAllQuotes/50 894888 ns 894843 ns 781
bytes_per_second=451.882M/s null_percent=50
{code}
*approach1: continue finding next quote, memcpy*
- *best case: improves 20%*
- *worse case: drops 70%*
{code:bash}
WriteCsvStringWithQuote/0 785568 ns 785549 ns 889
bytes_per_second=585.272M/s null_percent=0
WriteCsvStringWithQuote/1 849845 ns 849834 ns 821
bytes_per_second=535.908M/s null_percent=1
WriteCsvStringWithQuote/10 885708 ns 885696 ns 790
bytes_per_second=470.76M/s null_percent=10
WriteCsvStringWithQuote/50 840687 ns 840662 ns 832
bytes_per_second=286.079M/s null_percent=50
WriteCsvStringAllQuotes/0 3606928 ns 3606876 ns 192
bytes_per_second=217.905M/s null_percent=0
WriteCsvStringAllQuotes/1 3765233 ns 3765083 ns 186
bytes_per_second=206.735M/s null_percent=1
WriteCsvStringAllQuotes/10 3686031 ns 3685964 ns 190
bytes_per_second=192.963M/s null_percent=10
WriteCsvStringAllQuotes/50 2362894 ns 2362807 ns 295
bytes_per_second=171.137M/s null_percent=50
{code}
*approach2: check 8 chars, memcpy if no quote, otherwise copy char by char*
- *best case: improves 10%*
- *worst case: no difference*
{code:bash}
WriteCsvStringWithQuote/0 862995 ns 862991 ns 809
bytes_per_second=532.751M/s null_percent=0
WriteCsvStringWithQuote/1 900671 ns 900650 ns 774
bytes_per_second=505.671M/s null_percent=1
WriteCsvStringWithQuote/10 896087 ns 896066 ns 779
bytes_per_second=465.312M/s null_percent=10
WriteCsvStringWithQuote/50 805413 ns 805363 ns 870
bytes_per_second=298.618M/s null_percent=50
WriteCsvStringAllQuotes/0 993539 ns 993503 ns 702
bytes_per_second=791.097M/s null_percent=0
WriteCsvStringAllQuotes/1 1043675 ns 1043650 ns 671
bytes_per_second=745.819M/s null_percent=1
WriteCsvStringAllQuotes/10 1041745 ns 1041702 ns 646
bytes_per_second=682.782M/s null_percent=10
WriteCsvStringAllQuotes/50 889888 ns 889870 ns 786
bytes_per_second=454.407M/s null_percent=50
{code}
was (Author: yibo):
Compared two possible optimization approaches. Not very satisfying with the
results.
Benchmarked two secnarios:
- WriteCsvStringWithQuote is the best case, there is only one quote at string
end
- WriteCsvStringAllQuotes is the worse case, the whole string is filled with
quote
*appoach0(baseline): naive char by char copying*
{code:bash}
WriteCsvStringWithQuote/0 938246 ns 938230 ns 745
bytes_per_second=490.029M/s null_percent=0
WriteCsvStringWithQuote/1 1014895 ns 1014890 ns 688
bytes_per_second=448.751M/s null_percent=1
WriteCsvStringWithQuote/10 1060796 ns 1060780 ns 659
bytes_per_second=393.06M/s null_percent=10
WriteCsvStringWithQuote/50 891765 ns 891760 ns 786
bytes_per_second=269.686M/s null_percent=50
WriteCsvStringAllQuotes/0 1001146 ns 1001109 ns 699
bytes_per_second=785.086M/s null_percent=0
WriteCsvStringAllQuotes/1 1053971 ns 1053956 ns 664
bytes_per_second=738.526M/s null_percent=1
WriteCsvStringAllQuotes/10 1102326 ns 1102258 ns 655
bytes_per_second=645.272M/s null_percent=10
WriteCsvStringAllQuotes/50 894888 ns 894843 ns 781
bytes_per_second=451.882M/s null_percent=50
{code}
*approach1: continue finding next quote, memcpy*
- best case: improves 20%
- worse case: drops 70%
{code:bash}
WriteCsvStringWithQuote/0 785568 ns 785549 ns 889
bytes_per_second=585.272M/s null_percent=0
WriteCsvStringWithQuote/1 849845 ns 849834 ns 821
bytes_per_second=535.908M/s null_percent=1
WriteCsvStringWithQuote/10 885708 ns 885696 ns 790
bytes_per_second=470.76M/s null_percent=10
WriteCsvStringWithQuote/50 840687 ns 840662 ns 832
bytes_per_second=286.079M/s null_percent=50
WriteCsvStringAllQuotes/0 3606928 ns 3606876 ns 192
bytes_per_second=217.905M/s null_percent=0
WriteCsvStringAllQuotes/1 3765233 ns 3765083 ns 186
bytes_per_second=206.735M/s null_percent=1
WriteCsvStringAllQuotes/10 3686031 ns 3685964 ns 190
bytes_per_second=192.963M/s null_percent=10
WriteCsvStringAllQuotes/50 2362894 ns 2362807 ns 295
bytes_per_second=171.137M/s null_percent=50
{code}
*approach2: check 8 chars, memcpy if no quote, otherwise copy char by char*
- best case: improves 10%
- worst case: no difference
{code:bash}
WriteCsvStringWithQuote/0 862995 ns 862991 ns 809
bytes_per_second=532.751M/s null_percent=0
WriteCsvStringWithQuote/1 900671 ns 900650 ns 774
bytes_per_second=505.671M/s null_percent=1
WriteCsvStringWithQuote/10 896087 ns 896066 ns 779
bytes_per_second=465.312M/s null_percent=10
WriteCsvStringWithQuote/50 805413 ns 805363 ns 870
bytes_per_second=298.618M/s null_percent=50
WriteCsvStringAllQuotes/0 993539 ns 993503 ns 702
bytes_per_second=791.097M/s null_percent=0
WriteCsvStringAllQuotes/1 1043675 ns 1043650 ns 671
bytes_per_second=745.819M/s null_percent=1
WriteCsvStringAllQuotes/10 1041745 ns 1041702 ns 646
bytes_per_second=682.782M/s null_percent=10
WriteCsvStringAllQuotes/50 889888 ns 889870 ns 786
bytes_per_second=454.407M/s null_percent=50
{code}
> [C++] Optimize csv writer for string with quotes
> ------------------------------------------------
>
> Key: ARROW-15878
> URL: https://issues.apache.org/jira/browse/ARROW-15878
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Yibo Cai
> Assignee: Yibo Cai
> Priority: Major
> Attachments:
> 0001-ARROW-15878-improve-csv-writer-for-string-with-quote.patch, wip.patch
>
>
> Escaping a string with quotes (put an extra quote before a quote) is the
> hotspot of csv writer [1]. This can probably be improved, possible approaches:
> - Find the next quote with memchr, then memcpy blocks without quotes.
> - Check if there are quotes with simd in 8 bytes or 16 bytes, do memcpy if
> no, otherwise go slow path.
> Should make sure the method doesn't decrease performance too much for strings
> with many quotes. And should be similar or better performance for short
> strings, which is common case.
> [1]
> [https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/writer.cc#L139]
--
This message was sent by Atlassian Jira
(v8.20.1#820001)