[ https://issues.apache.org/jira/browse/ARROW-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney updated ARROW-25: ------------------------------ Summary: [C++] Implement delimited file scanner / CSV reader (was: C++: Implement delimited file scanner / CSV reader) > [C++] Implement delimited file scanner / CSV reader > --------------------------------------------------- > > Key: ARROW-25 > URL: https://issues.apache.org/jira/browse/ARROW-25 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ > Reporter: Wes McKinney > Priority: Major > > Like Parquet and binary file formats, text files will be an important data > medium for converting to and from in-memory Arrow data. > pandas has some (Apache-compatible) business logic we can learn from here (as > one of the gold-standard CSV readers in production use) > https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.h > https://github.com/pydata/pandas/blob/master/pandas/parser.pyx > While very fast, this this should be largely written from scratch to target > the Arrow memory layout, but we can reuse certain aspects like the tokenizer > DFA (which originally came from the Python interpreter csv module > implementation) > https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.c#L713 -- This message was sent by Atlassian JIRA (v7.6.3#76005)