[ https://issues.apache.org/jira/browse/BEAM-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16145054#comment-16145054 ]
Ryan Skraba commented on BEAM-2802: ----------------------------------- Hello! Just to chime in -- I'm part of the team that asked Étienne to investigate this. We have some experience with data formats used by customers to contain tabular data in text files. I couldn't pass judgement on whether they're "good ideas", just that this is a valid use case! There's _probably_ a satisfactory standard for CSV-like data somewhere, but it's definitely not universal. In any case, a lot of CSV formats just aren't appropriate for splitting or big data (looking at you, RFC-4801). The crux of the problem is having newlines inside the record value (or more general, having the record-delimiter inside the field). We've encountered solutions like using {{\000}} for record delimiters, or control characters outside of ascii data (like the {{^B}} above used to distinguish *real* newline record delimiters from newlines in the record). We've encountered record separators like {{\n\-\-\n}} to separate records on different lines, or just {{\-\-}} for a stream of whitespace-free data. All of these are human-readable without much difficulty, and (unfortunately) easy enough to have been invented and implemented in existing tools and systems. I've mentioned tabular and CSV-like data -- we're only interested in having TextIO extracting the record correctly here. Splitting the record into fields can and should occur downstream in a ParDo. All of the existing features of TextIO (such as compression, watching, dynamic destinations) apply to text files that use a custom delimiter, so it seems like a natural place to add this functionality. Custom record delimiters are a common option in unix command line tools, as well as configurable in the Hadoop TextInputFormat so it shouldn't be unexpected or confusing for the user to have this option in TextIO. The performance impact should be negligible with Étienne's proposal above. I would doubt that there would be measurable impact if you aren't using a custom delimiter (although this can be checked). **TL;DR:** These formats are found "in the wild" and that a fixed, multi-byte custom delimiter is probably the single best step to connecting a lot of these formats into a Beam job. > TextIO should allow specifying a custom delimiter > ------------------------------------------------- > > Key: BEAM-2802 > URL: https://issues.apache.org/jira/browse/BEAM-2802 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions > Reporter: Etienne Chauchot > Assignee: Etienne Chauchot > Priority: Minor > > Currently TextIO use {{\r}} {{\n}} or {{\r\n}} or a mix of the two to split a > text file into PCollection elements. It might happen that a record is spread > across more than one line. In that case we should be able to specify a custom > record delimiter to be used in place of the default ones. -- This message was sent by Atlassian JIRA (v6.4.14#64029)