[ 
https://issues.apache.org/jira/browse/ARROW-15878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503359#comment-17503359
 ] 

Yibo Cai edited comment on ARROW-15878 at 3/11/22, 2:07 AM:
------------------------------------------------------------

Compared two possible optimization approaches. Not very satisfying with the 
results.
Benchmarked two secnarios:
 - WriteCsvStringWithQuote is the best case, there is only one quote at string 
end
 - WriteCsvStringAllQuotes is the worse case, the whole string is filled with 
quote

*appoach0(baseline): naive char by char copying*
{code:bash}
WriteCsvStringWithQuote/0      938246 ns       938230 ns          745 
bytes_per_second=490.029M/s null_percent=0
WriteCsvStringWithQuote/1     1014895 ns      1014890 ns          688 
bytes_per_second=448.751M/s null_percent=1
WriteCsvStringWithQuote/10    1060796 ns      1060780 ns          659 
bytes_per_second=393.06M/s null_percent=10
WriteCsvStringWithQuote/50     891765 ns       891760 ns          786 
bytes_per_second=269.686M/s null_percent=50
WriteCsvStringAllQuotes/0     1001146 ns      1001109 ns          699 
bytes_per_second=785.086M/s null_percent=0
WriteCsvStringAllQuotes/1     1053971 ns      1053956 ns          664 
bytes_per_second=738.526M/s null_percent=1
WriteCsvStringAllQuotes/10    1102326 ns      1102258 ns          655 
bytes_per_second=645.272M/s null_percent=10
WriteCsvStringAllQuotes/50     894888 ns       894843 ns          781 
bytes_per_second=451.882M/s null_percent=50
{code}
*approach1: continue finding next quote, memcpy*
 - *best case: improves 20%*
 - *worse case: drops 70%*

{code:bash}
WriteCsvStringWithQuote/0      785568 ns       785549 ns          889 
bytes_per_second=585.272M/s null_percent=0
WriteCsvStringWithQuote/1      849845 ns       849834 ns          821 
bytes_per_second=535.908M/s null_percent=1
WriteCsvStringWithQuote/10     885708 ns       885696 ns          790 
bytes_per_second=470.76M/s null_percent=10
WriteCsvStringWithQuote/50     840687 ns       840662 ns          832 
bytes_per_second=286.079M/s null_percent=50
WriteCsvStringAllQuotes/0     3606928 ns      3606876 ns          192 
bytes_per_second=217.905M/s null_percent=0
WriteCsvStringAllQuotes/1     3765233 ns      3765083 ns          186 
bytes_per_second=206.735M/s null_percent=1
WriteCsvStringAllQuotes/10    3686031 ns      3685964 ns          190 
bytes_per_second=192.963M/s null_percent=10
WriteCsvStringAllQuotes/50    2362894 ns      2362807 ns          295 
bytes_per_second=171.137M/s null_percent=50
{code}
*approach2: check 8 chars, memcpy if no quote, otherwise copy char by char*
 - *best case: improves 10%*
 - *worst case: no difference*

{code:bash}
WriteCsvStringWithQuote/0      862995 ns       862991 ns          809 
bytes_per_second=532.751M/s null_percent=0
WriteCsvStringWithQuote/1      900671 ns       900650 ns          774 
bytes_per_second=505.671M/s null_percent=1
WriteCsvStringWithQuote/10     896087 ns       896066 ns          779 
bytes_per_second=465.312M/s null_percent=10
WriteCsvStringWithQuote/50     805413 ns       805363 ns          870 
bytes_per_second=298.618M/s null_percent=50
WriteCsvStringAllQuotes/0      993539 ns       993503 ns          702 
bytes_per_second=791.097M/s null_percent=0
WriteCsvStringAllQuotes/1     1043675 ns      1043650 ns          671 
bytes_per_second=745.819M/s null_percent=1
WriteCsvStringAllQuotes/10    1041745 ns      1041702 ns          646 
bytes_per_second=682.782M/s null_percent=10
WriteCsvStringAllQuotes/50     889888 ns       889870 ns          786 
bytes_per_second=454.407M/s null_percent=50
{code}


was (Author: yibo):
Compared two possible optimization approaches. Not very satisfying with the 
results.
Benchmarked two secnarios:
- WriteCsvStringWithQuote is the best case, there is only one quote at string 
end
- WriteCsvStringAllQuotes is the worse case, the whole string is filled with 
quote

*appoach0(baseline): naive char by char copying*
{code:bash}
WriteCsvStringWithQuote/0      938246 ns       938230 ns          745 
bytes_per_second=490.029M/s null_percent=0
WriteCsvStringWithQuote/1     1014895 ns      1014890 ns          688 
bytes_per_second=448.751M/s null_percent=1
WriteCsvStringWithQuote/10    1060796 ns      1060780 ns          659 
bytes_per_second=393.06M/s null_percent=10
WriteCsvStringWithQuote/50     891765 ns       891760 ns          786 
bytes_per_second=269.686M/s null_percent=50
WriteCsvStringAllQuotes/0     1001146 ns      1001109 ns          699 
bytes_per_second=785.086M/s null_percent=0
WriteCsvStringAllQuotes/1     1053971 ns      1053956 ns          664 
bytes_per_second=738.526M/s null_percent=1
WriteCsvStringAllQuotes/10    1102326 ns      1102258 ns          655 
bytes_per_second=645.272M/s null_percent=10
WriteCsvStringAllQuotes/50     894888 ns       894843 ns          781 
bytes_per_second=451.882M/s null_percent=50
{code}

*approach1: continue finding next quote, memcpy*
- best case: improves 20%
- worse case: drops 70%

{code:bash}
WriteCsvStringWithQuote/0      785568 ns       785549 ns          889 
bytes_per_second=585.272M/s null_percent=0
WriteCsvStringWithQuote/1      849845 ns       849834 ns          821 
bytes_per_second=535.908M/s null_percent=1
WriteCsvStringWithQuote/10     885708 ns       885696 ns          790 
bytes_per_second=470.76M/s null_percent=10
WriteCsvStringWithQuote/50     840687 ns       840662 ns          832 
bytes_per_second=286.079M/s null_percent=50
WriteCsvStringAllQuotes/0     3606928 ns      3606876 ns          192 
bytes_per_second=217.905M/s null_percent=0
WriteCsvStringAllQuotes/1     3765233 ns      3765083 ns          186 
bytes_per_second=206.735M/s null_percent=1
WriteCsvStringAllQuotes/10    3686031 ns      3685964 ns          190 
bytes_per_second=192.963M/s null_percent=10
WriteCsvStringAllQuotes/50    2362894 ns      2362807 ns          295 
bytes_per_second=171.137M/s null_percent=50
{code}

*approach2: check 8 chars, memcpy if no quote, otherwise copy char by char*
- best case: improves 10%
- worst case: no difference

{code:bash}
WriteCsvStringWithQuote/0      862995 ns       862991 ns          809 
bytes_per_second=532.751M/s null_percent=0
WriteCsvStringWithQuote/1      900671 ns       900650 ns          774 
bytes_per_second=505.671M/s null_percent=1
WriteCsvStringWithQuote/10     896087 ns       896066 ns          779 
bytes_per_second=465.312M/s null_percent=10
WriteCsvStringWithQuote/50     805413 ns       805363 ns          870 
bytes_per_second=298.618M/s null_percent=50
WriteCsvStringAllQuotes/0      993539 ns       993503 ns          702 
bytes_per_second=791.097M/s null_percent=0
WriteCsvStringAllQuotes/1     1043675 ns      1043650 ns          671 
bytes_per_second=745.819M/s null_percent=1
WriteCsvStringAllQuotes/10    1041745 ns      1041702 ns          646 
bytes_per_second=682.782M/s null_percent=10
WriteCsvStringAllQuotes/50     889888 ns       889870 ns          786 
bytes_per_second=454.407M/s null_percent=50
{code}

> [C++] Optimize csv writer for string with quotes
> ------------------------------------------------
>
>                 Key: ARROW-15878
>                 URL: https://issues.apache.org/jira/browse/ARROW-15878
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Yibo Cai
>            Assignee: Yibo Cai
>            Priority: Major
>         Attachments: 
> 0001-ARROW-15878-improve-csv-writer-for-string-with-quote.patch, wip.patch
>
>
> Escaping a string with quotes (put an extra quote before a quote) is the 
> hotspot of csv writer [1]. This can probably be improved, possible approaches:
>  - Find the next quote with memchr, then memcpy blocks without quotes.
>  - Check if there are quotes with simd in 8 bytes or 16 bytes, do memcpy if 
> no, otherwise go slow path.
> Should make sure the method doesn't decrease performance too much for strings 
> with many quotes. And should be similar or better performance for short 
> strings, which is common case.
> [1] 
> [https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/writer.cc#L139]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to