[ https://issues.apache.org/jira/browse/BEAM-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340207#comment-16340207 ]
Eugene Kirpichov commented on BEAM-2776: ---------------------------------------- Reducing priority: This is easy to do manually using FileIO.match() + readMatches() and possibly doesn't warrant changes to TextIO, unless someone has a compelling argument to the contrary. > TextIO should support reading header lines > ------------------------------------------ > > Key: BEAM-2776 > URL: https://issues.apache.org/jira/browse/BEAM-2776 > Project: Beam > Issue Type: Bug > Components: sdk-java-core, sdk-py-core > Reporter: Eugene Kirpichov > Priority: Minor > > Users frequently request the ability to skip some header rows when reading > text files. > https://stackoverflow.com/questions/28450554/skipping-header-rows-is-it-possible-with-cloud-dataflow > https://stackoverflow.com/questions/43551876/how-do-i-read-and-transform-csv-headers-before-bigqueryio-write > https://stackoverflow.com/questions/41297704/reading-csv-header-with-dataflow > https://stackoverflow.com/questions/45554466/google-cloud-dataflow-apache-beam-how-to-process-gzipped-csv-files-with-a-he > https://stackoverflow.com/questions/44045744/how-do-i-skip-header-files-when-reading-from-google-cloud-storage-in-a-dataflow > This is also relevant for reading file formats such as VCF, see thread > https://lists.apache.org/thread.html/dc7e5c3ff20d9270f06c1a298ad949da018a83f900b22d58f6b4c468@%3Cdev.beam.apache.org%3E > Python supports this partially https://github.com/apache/beam/pull/1771/files > via skip_header_lines, but the header lines can have useful content, and the > number of header lines is not fixed (in VCF). > We should figure out a good API for this and support this natively in TextIO. > The API decisions would be: > - How do we specify how much of the beginning of each file is the header: > options could be e.g. a certain number of lines; or lines that start with a > certain character; or a custom predicate. > - How do we make the header contents accessible to a user of TextIO. Since > the header can be different in each file, we can't return it as a > PCollectionView<List<String>>. Instead I suppose, when you use a header, > you'd need to specify a SerializableFunction<KV<List<String>, String>, T> or > something like that for parsing (header, line) -> user type. Note that > currently TextIO.Read does not support returning a user type anyway, so > that'd need to be done too. -- This message was sent by Atlassian JIRA (v7.6.3#76005)