[ https://issues.apache.org/jira/browse/BEAM-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Saurabh Joshi updated BEAM-10030: --------------------------------- Component/s: sdk-java-core > Add CSVIO for Java SDK > ---------------------- > > Key: BEAM-10030 > URL: https://issues.apache.org/jira/browse/BEAM-10030 > Project: Beam > Issue Type: New Feature > Components: io-ideas, sdk-java-core > Reporter: Saurabh Joshi > Priority: P2 > Attachments: CSVSource.java > > > Apache Beam has TextIO class which can read text based files line by line, > delimited by either a carriage return, newline, or a carriage return and a > newline. This approach does not support CSV files which have records that > span multiple lines. This is because there could be fields where there is a > newline inside the double quotes. > This Stackoverflow question is relevant for a feature that should be added to > Apache Beam: > [https://stackoverflow.com/questions/51439189/how-to-read-large-csv-with-beam] > I can think of two libraries we could use for handling CSV files. The first > one is using Apache Commons CSV library. Here is some example code which can > use CSVRecord class for reading and writing CSV records: > {{{color:#172b4d}{{PipelineOptions options = > PipelineOptionsFactory.create();}} > {{Pipeline pipeline = Pipeline.create(options);}} > {{PCollection<CSVRecord> records = pipeline.apply("ReadCSV", > CSVIO.read().from("input.csv"));}} > records.apply("WriteCSV", CSVIO.write().to("output.csv"));{color}}} > Another library we could use is Jackson CSV, which allows users to specify > schemas for the columns: > [https://github.com/FasterXML/jackson-dataformats-text/tree/master/csv] > The crux of the problem is this: can we read and write large CSV files in > parallel, by splitting the records and distribute it to many workers? If so, > would it be good to have a feature where Apache Beam supports reading/writing > CSV files? -- This message was sent by Atlassian Jira (v8.3.4#803005)