[ 
https://issues.apache.org/jira/browse/ARROW-15878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibo Cai updated ARROW-15878:
-----------------------------
    Description: 
Escaping a string with quotes (put an extra quote before a quote) is the 
hotspot of csv writer [1]. This can probably be improved, possible approaches:
 - Find the next quote with memchr, then memcpy blocks without quotes.
 - Check if there are quotes with simd in 8 bytes or 16 bytes, do memcpy if no, 
otherwise go slow path.

Should make sure the method doesn't decrease performance too much for strings 
with many quotes. And should be similar or better performance for short 
strings, which is common case.

[1] 
[https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/writer.cc#L139]

  was:
Escaping a string with quotes (put an extra quote before a quote) is the 
hotspot of csv writer [1]. This can probably be improved, possible approaches:
 - Find the next quote with memchr, then memcpy blocks without quotes.
 - Check if there are quotes with simd in 8 bytes or 16 bytes, do memcpy if no, 
otherwise go slow path.

Should make sure the method doesn't decrease performance too much for strings 
with many quotes. And should be similar performance for short strings, which is 
common case.

[1] 
[https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/writer.cc#L139]


> [C++] Optimize csv writer for string with quotes
> ------------------------------------------------
>
>                 Key: ARROW-15878
>                 URL: https://issues.apache.org/jira/browse/ARROW-15878
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Yibo Cai
>            Assignee: Yibo Cai
>            Priority: Major
>
> Escaping a string with quotes (put an extra quote before a quote) is the 
> hotspot of csv writer [1]. This can probably be improved, possible approaches:
>  - Find the next quote with memchr, then memcpy blocks without quotes.
>  - Check if there are quotes with simd in 8 bytes or 16 bytes, do memcpy if 
> no, otherwise go slow path.
> Should make sure the method doesn't decrease performance too much for strings 
> with many quotes. And should be similar or better performance for short 
> strings, which is common case.
> [1] 
> [https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/writer.cc#L139]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to