[ 
https://issues.apache.org/jira/browse/BEAM-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144944#comment-16144944
 ] 

Etienne Chauchot commented on BEAM-2802:
----------------------------------------

I agree the {{\002\n}} example seems odd. But IMHO we should allow the user to 
add whatever he wants as a delimiter, we should not limit him as long as this 
liberty has a limited performance / maintenance cost. 

For the text file definition: I think that as long as a file does not contain 
binary and is not in a common format like XML or JSON, it can be considered as 
a pure text file no matter if it can be easily read by a human or if it has new 
line delimiters.

IMHO, I also think that files with custom delimiters are quite a common user 
problem (we have several client use cases and the other related PR tends to 
also prove its need). IMHO, I think it merits being included in the Beam SDK.

I would be reluctant to chip a specific TextIO to clients for maintenance 
reasons.

I could create another IO but I find it's a pity to duplicate all the structure 
of an IO whereas the change is 15 lines of code in 
TextIO#findSeparatorBounds(). Maybe I can submit the PR that updates the TextIO 
and we can discuss it towards maintenance, performance and code location 
aspects in the PR.


> TextIO should allow specifying a custom delimiter
> -------------------------------------------------
>
>                 Key: BEAM-2802
>                 URL: https://issues.apache.org/jira/browse/BEAM-2802
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-extensions
>            Reporter: Etienne Chauchot
>            Assignee: Etienne Chauchot
>            Priority: Minor
>
> Currently TextIO use {{\r}} {{\n}} or {{\r\n}} or a mix of the two to split a 
> text file into PCollection elements. It might happen that a record is spread 
> across more than one line. In that case we should be able to specify a custom 
> record delimiter to be used in place of the default ones.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to