New submission from Piotr Tokarski <[email protected]>:
Let's consider the following CSV content: "a|b\nc| 'd\ne|' f". The real
delimiter in this case is '|' character while ' ' is sniffed. Find verbose
example attached.
Problem lays in csv.py file in the following code:
```
matches = []
for restr in (r'(?P<delim>[^\w\n"\'])(?P<space>
?)(?P<quote>["\']).*?(?P=quote)(?P=delim)', # ,".*?",
r'(?:^|\n)(?P<quote>["\']).*?(?P=quote)(?P<delim>[^\w\n"\'])(?P<space> ?)', #
".*?",
r'(?P<delim>[^\w\n"\'])(?P<space>
?)(?P<quote>["\']).*?(?P=quote)(?:$|\n)', # ,".*?"
r'(?:^|\n)(?P<quote>["\']).*?(?P=quote)(?:$|\n)'):
# ".*?" (no delim, no space)
regexp = re.compile(restr, re.DOTALL | re.MULTILINE)
matches = regexp.findall(data)
if matches:
break
```
What makes matches non-empty and farther processing happens with delimiter
falsely set to ' '.
----------
components: Library (Lib)
messages: 397821
nosy: pt12lol
priority: normal
severity: normal
status: open
title: CSV sniffing falsely detects space as a delimiter
type: behavior
versions: Python 3.8
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue44677>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com