[ https://issues.apache.org/jira/browse/ARROW-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney updated ARROW-25: ------------------------------ Fix Version/s: 0.11.0 > [C++] Implement delimited file scanner / CSV reader > --------------------------------------------------- > > Key: ARROW-25 > URL: https://issues.apache.org/jira/browse/ARROW-25 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ > Reporter: Wes McKinney > Assignee: Antoine Pitrou > Priority: Major > Labels: csv, pull-request-available > Fix For: 0.11.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Like Parquet and binary file formats, text files will be an important data > medium for converting to and from in-memory Arrow data. > pandas has some (Apache-compatible) business logic we can learn from here (as > one of the gold-standard CSV readers in production use) > https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.h > https://github.com/pydata/pandas/blob/master/pandas/parser.pyx > While very fast, this this should be largely written from scratch to target > the Arrow memory layout, but we can reuse certain aspects like the tokenizer > DFA (which originally came from the Python interpreter csv module > implementation) > https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.c#L713 -- This message was sent by Atlassian JIRA (v7.6.3#76005)