Kai Fricke created ARROW-17641:
----------------------------------

             Summary: [python] Deserializing ParseOptions does not set up 
invalid row handler correctly
                 Key: ARROW-17641
                 URL: https://issues.apache.org/jira/browse/ARROW-17641
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 9.0.0
            Reporter: Kai Fricke


Serializing and deserializing a {{csv.ParseOptions}} object with an 
{{invalid_row_handler}} will render the handler unusable. This is likely 
because the setter is not called correctly in the {{__setstate__}} method.

Reproduction script:

 
{code:python}
import cloudpickle
from pyarrow import csv


invalid_csv = """f1,f2
3,4
5,6
\x00\x00
7,8"""

source = "test.csv"
with open(source, "w") as f:
    f.write(invalid_csv)


def read_file(path, parse_options):
    # Uncomment this for a fix!
    # parse_options.invalid_row_handler = parse_options.invalid_row_handler

    with open(path, "rb") as f:
        return csv.read_csv(f, parse_options=parse_options)


parse_options = csv.ParseOptions(delimiter=",", invalid_row_handler=lambda i: 
"skip")

# Will succeed
print(read_file(source, parse_options=parse_options))

parse_options = cloudpickle.loads(cloudpickle.dumps(parse_options))

# Will fail
print(read_file(source, parse_options=parse_options))


{code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to