[ https://issues.apache.org/jira/browse/BEAM-51?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545921#comment-17545921 ]
Danny McCormick commented on BEAM-51: ------------------------------------- This isn't actually real, but this issue has been migrated to https://github.com/apache/beam/issues/17832 > Implement a CSV file reader > --------------------------- > > Key: BEAM-51 > URL: https://issues.apache.org/jira/browse/BEAM-51 > Project: Beam > Issue Type: New Feature > Components: io-ideas > Reporter: Dan Halperin > Priority: P3 > > We should implement a CSV-based source. > One possibility would be to support the same options as BigQuery. > https://cloud.google.com/bigquery/preparing-data-for-bigquery#dataformats > These options are: > fieldDelimiter: allowing a custom delimiter... csv vs tsv, etc. My guess is > this is critical. One common delimiter that people use is 'thorn' (รพ). > quote: Custom quote char. By default, this is '"', but this allows users to > set it to something else, or, perhaps more commonly, remove it entirely (by > setting it to the empty string). For example, tab-separated files generally > don't need quotes. > allowQuotedNewlines: whether you can quote newlines. In the official CSV RFC, > newlines can be quoted.. that is, you can have "a", "b\n", "c" in a single > line. This makes splitting of large csv files impossible, so we should > disallow quoted newlines by default unless the user really wants them (in > which case, they'll get worse performance). > allowJaggedRows: This allows inferring null if not enough columns are > specified. Otherwise we give an error for the row. > ignoreUnknownValues: The opposite of allowJaggedRows, this means that if a > user has _too_ many values for the schema, we will ignore the ones we don't > recognize, rather than reporting an error for the row. > skipHeaderRows: How many header lines are in the file. > encoding: UTF8-vs latin1, etc. > compression: gzip, bzip, etc. -- This message was sent by Atlassian Jira (v8.20.7#820007)