[ 
https://issues.apache.org/jira/browse/BEAM-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16145054#comment-16145054
 ] 

Ryan Skraba commented on BEAM-2802:
-----------------------------------

Hello!  Just to chime in -- I'm part of the team that asked Étienne to 
investigate this.  We have some experience with data formats used by customers 
to contain tabular data in text files.   

I couldn't pass judgement on whether they're "good ideas", just that this is a 
valid use case!  There's _probably_ a satisfactory standard for CSV-like data 
somewhere, but it's definitely not universal.  In any case, a lot of CSV 
formats just aren't appropriate for splitting or big data (looking at you, 
RFC-4801).

The crux of the problem is having newlines inside the record value (or more 
general, having the record-delimiter inside the field).  We've encountered 
solutions like using {{\000}} for record delimiters, or control characters 
outside of ascii data (like the {{^B}} above used to distinguish *real* newline 
record delimiters from newlines in the record).  We've encountered record 
separators like {{\n\-\-\n}} to separate records on different lines, or just 
{{\-\-}} for a stream of whitespace-free data.  All of these are human-readable 
without much difficulty, and (unfortunately) easy enough to have been invented 
and implemented in existing tools and systems.

I've mentioned tabular and CSV-like data -- we're only interested in having 
TextIO extracting the record correctly here.  Splitting the record into fields 
can and should occur downstream in a ParDo.

All of the existing features of TextIO (such as compression, watching, dynamic 
destinations) apply to text files that use a custom delimiter, so it seems like 
a natural place to add this functionality.  Custom record delimiters are a 
common option in unix command line tools, as well as configurable in the Hadoop 
TextInputFormat so it shouldn't be unexpected or confusing for the user to have 
this option in TextIO.

The performance impact should be negligible with Étienne's proposal above.  I 
would doubt that there would be measurable impact if you aren't using a custom 
delimiter (although this can be checked).

**TL;DR:** These formats are found "in the wild" and that a fixed, multi-byte 
custom delimiter is probably the single best step to connecting a lot of these 
formats into a Beam job.


> TextIO should allow specifying a custom delimiter
> -------------------------------------------------
>
>                 Key: BEAM-2802
>                 URL: https://issues.apache.org/jira/browse/BEAM-2802
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-extensions
>            Reporter: Etienne Chauchot
>            Assignee: Etienne Chauchot
>            Priority: Minor
>
> Currently TextIO use {{\r}} {{\n}} or {{\r\n}} or a mix of the two to split a 
> text file into PCollection elements. It might happen that a record is spread 
> across more than one line. In that case we should be able to specify a custom 
> record delimiter to be used in place of the default ones.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to