[ https://issues.apache.org/jira/browse/BEAM-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141566#comment-16141566 ]
Etienne Chauchot commented on BEAM-2802: ---------------------------------------- I have read the PR above and I agree with your comments in particular the ones on performance and maintenance (complex generic code). Comparing to the above PR what I have started coding: - performance: I only add one loop in {{findSeparatorBounds()}} to iterate over the bytes of the custom delimiter if it is set. If it is not set, the current code (\r\n) is called. - maintenance: I decided not to make the code generic to support parsing both {{\r\n}} and the custom delimiter but rather to keep the code path as simple as possible with simple {{if (customDelimiter != null)}} (see above) - no set of delimiters but rather a {{byte[]}} delimiter that is either set or not set. - there is only one delimiter at a time, either {{\r\n}} or custom delimiter because we do not want to split twice if there is both new lines and custom separator in the text file in particular if a record is spread across more than one line One could argue that the custom delimiter is not needed because we could split using new lines and then use a {{ParDo}} to split once again using the custom delimiter. But if a record is spread between multi-line, then this approach will generate 2 records in the output Collection. > TextIO should allow specifying a custom delimiter > ------------------------------------------------- > > Key: BEAM-2802 > URL: https://issues.apache.org/jira/browse/BEAM-2802 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions > Reporter: Etienne Chauchot > Assignee: Etienne Chauchot > Priority: Minor > > Currently TextIO use {{\r}} {{\n}} or {{\r\n}} or a mix of the two to split a > text file into PCollection elements. It might happen that a record is spread > across more than one line. In that case we should be able to specify a custom > record delimiter to be used in place of the default ones. -- This message was sent by Atlassian JIRA (v6.4.14#64029)