[jira] [Updated] (FLINK-9749) Rework Bucketing Sink
[ https://issues.apache.org/jira/browse/FLINK-9749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fokko Driesprong updated FLINK-9749: Description: The BucketingSink has a series of deficits at the moment. Due to the long list of issues, I would suggest to add a new StreamingFileSink with a new and cleaner design h3. Encoders, Parquet, ORC - It only efficiently supports row-wise data formats (avro, json, sequence files). - Efforts to add (columnar) compression for blocks of data is inefficient, because blocks cannot span checkpoints due to persistence-on-checkpoint. - The encoders are part of the {{flink-connector-filesystem project}}, rather than in orthogonal formats projects. This blows up the dependencies of the {{flink-connector-filesystem project}} project. As an example, the rolling file sink has dependencies on Hadoop and Avro, which messes up dependency management. h3. Use of FileSystems - The BucketingSink works only on Hadoop's FileSystem abstraction not support Flink's own FileSystem abstraction and cannot work with the packaged S3, maprfs, and swift file systems - The sink hence needs Hadoop as a dependency - The sink relies on "trying out" whether truncation works, which requires write access to the users working directory - The sink relies on enumerating and counting files, rather than maintaining its own state, making less efficient h3. Correctness and Efficiency on S3 - The BucketingSink relies on strong consistency in the file enumeration, hence may work incorrectly on S3. - The BucketingSink relies on persisting streams at intermediate points. This is not working properly on S3, hence there may be data loss on S3. h3. .valid-length companion file - The valid length file makes it hard for consumers of the data and should be dropped We track this design in a series of sub issues. was: The BucketingSink has a series of deficits at the moment. Due to the long list of issues, I would suggest to add a new StreamingFileSink with a new and cleaner design h3. Encoders, Parquet, ORC - It only efficiently supports row-wise data formats (avro, jso, sequence files. - Efforts to add (columnar) compression for blocks of data is inefficient, because blocks cannot span checkpoints due to persistence-on-checkpoint. - The encoders are part of the \{{flink-connector-filesystem project}}, rather than in orthogonal formats projects. This blows up the dependencies of the \{{flink-connector-filesystem project}} project. As an example, the rolling file sink has dependencies on Hadoop and Avro, which messes up dependency management. h3. Use of FileSystems - The BucketingSink works only on Hadoop's FileSystem abstraction not support Flink's own FileSystem abstraction and cannot work with the packaged S3, maprfs, and swift file systems - The sink hence needs Hadoop as a dependency - The sink relies on "trying out" whether truncation works, which requires write access to the users working directory - The sink relies on enumerating and counting files, rather than maintaining its own state, making less efficient h3. Correctness and Efficiency on S3 - The BucketingSink relies on strong consistency in the file enumeration, hence may work incorrectly on S3. - The BucketingSink relies on persisting streams at intermediate points. This is not working properly on S3, hence there may be data loss on S3. h3. .valid-length companion file - The valid length file makes it hard for consumers of the data and should be dropped We track this design in a series of sub issues. > Rework Bucketing Sink > - > > Key: FLINK-9749 > URL: https://issues.apache.org/jira/browse/FLINK-9749 > Project: Flink > Issue Type: New Feature > Components: Connectors / FileSystem >Reporter: Stephan Ewen >Assignee: Kostas Kloudas >Priority: Major > > The BucketingSink has a series of deficits at the moment. > Due to the long list of issues, I would suggest to add a new > StreamingFileSink with a new and cleaner design > h3. Encoders, Parquet, ORC > - It only efficiently supports row-wise data formats (avro, json, sequence > files). > - Efforts to add (columnar) compression for blocks of data is inefficient, > because blocks cannot span checkpoints due to persistence-on-checkpoint. > - The encoders are part of the {{flink-connector-filesystem project}}, > rather than in orthogonal formats projects. This blows up the dependencies of > the {{flink-connector-filesystem project}} project. As an example, the > rolling file sink has dependencies on Hadoop and Avro, which messes up > dependency management. > h3. Use of FileSystems > - The BucketingSink works only on Hadoop's FileSystem abstraction not > support Flink's own FileSystem abstraction and cannot work with the packaged > S3, maprfs, and
[jira] [Updated] (FLINK-9749) Rework Bucketing Sink
[ https://issues.apache.org/jira/browse/FLINK-9749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Metzger updated FLINK-9749: -- Component/s: (was: Connectors / Common) Connectors / FileSystem > Rework Bucketing Sink > - > > Key: FLINK-9749 > URL: https://issues.apache.org/jira/browse/FLINK-9749 > Project: Flink > Issue Type: New Feature > Components: Connectors / FileSystem >Reporter: Stephan Ewen >Assignee: Kostas Kloudas >Priority: Major > > The BucketingSink has a series of deficits at the moment. > Due to the long list of issues, I would suggest to add a new > StreamingFileSink with a new and cleaner design > h3. Encoders, Parquet, ORC > - It only efficiently supports row-wise data formats (avro, jso, sequence > files. > - Efforts to add (columnar) compression for blocks of data is inefficient, > because blocks cannot span checkpoints due to persistence-on-checkpoint. > - The encoders are part of the \{{flink-connector-filesystem project}}, > rather than in orthogonal formats projects. This blows up the dependencies of > the \{{flink-connector-filesystem project}} project. As an example, the > rolling file sink has dependencies on Hadoop and Avro, which messes up > dependency management. > h3. Use of FileSystems > - The BucketingSink works only on Hadoop's FileSystem abstraction not > support Flink's own FileSystem abstraction and cannot work with the packaged > S3, maprfs, and swift file systems > - The sink hence needs Hadoop as a dependency > - The sink relies on "trying out" whether truncation works, which requires > write access to the users working directory > - The sink relies on enumerating and counting files, rather than maintaining > its own state, making less efficient > h3. Correctness and Efficiency on S3 > - The BucketingSink relies on strong consistency in the file enumeration, > hence may work incorrectly on S3. > - The BucketingSink relies on persisting streams at intermediate points. > This is not working properly on S3, hence there may be data loss on S3. > h3. .valid-length companion file > - The valid length file makes it hard for consumers of the data and should > be dropped > We track this design in a series of sub issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-9749) Rework Bucketing Sink
[ https://issues.apache.org/jira/browse/FLINK-9749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tzu-Li (Gordon) Tai updated FLINK-9749: --- Fix Version/s: (was: 1.6.3) > Rework Bucketing Sink > - > > Key: FLINK-9749 > URL: https://issues.apache.org/jira/browse/FLINK-9749 > Project: Flink > Issue Type: New Feature > Components: Streaming Connectors >Reporter: Stephan Ewen >Assignee: Kostas Kloudas >Priority: Major > > The BucketingSink has a series of deficits at the moment. > Due to the long list of issues, I would suggest to add a new > StreamingFileSink with a new and cleaner design > h3. Encoders, Parquet, ORC > - It only efficiently supports row-wise data formats (avro, jso, sequence > files. > - Efforts to add (columnar) compression for blocks of data is inefficient, > because blocks cannot span checkpoints due to persistence-on-checkpoint. > - The encoders are part of the \{{flink-connector-filesystem project}}, > rather than in orthogonal formats projects. This blows up the dependencies of > the \{{flink-connector-filesystem project}} project. As an example, the > rolling file sink has dependencies on Hadoop and Avro, which messes up > dependency management. > h3. Use of FileSystems > - The BucketingSink works only on Hadoop's FileSystem abstraction not > support Flink's own FileSystem abstraction and cannot work with the packaged > S3, maprfs, and swift file systems > - The sink hence needs Hadoop as a dependency > - The sink relies on "trying out" whether truncation works, which requires > write access to the users working directory > - The sink relies on enumerating and counting files, rather than maintaining > its own state, making less efficient > h3. Correctness and Efficiency on S3 > - The BucketingSink relies on strong consistency in the file enumeration, > hence may work incorrectly on S3. > - The BucketingSink relies on persisting streams at intermediate points. > This is not working properly on S3, hence there may be data loss on S3. > h3. .valid-length companion file > - The valid length file makes it hard for consumers of the data and should > be dropped > We track this design in a series of sub issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-9749) Rework Bucketing Sink
[ https://issues.apache.org/jira/browse/FLINK-9749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann updated FLINK-9749: - Fix Version/s: (was: 1.6.1) > Rework Bucketing Sink > - > > Key: FLINK-9749 > URL: https://issues.apache.org/jira/browse/FLINK-9749 > Project: Flink > Issue Type: New Feature > Components: Streaming Connectors >Reporter: Stephan Ewen >Assignee: Kostas Kloudas >Priority: Major > Fix For: 1.6.2 > > > The BucketingSink has a series of deficits at the moment. > Due to the long list of issues, I would suggest to add a new > StreamingFileSink with a new and cleaner design > h3. Encoders, Parquet, ORC > - It only efficiently supports row-wise data formats (avro, jso, sequence > files. > - Efforts to add (columnar) compression for blocks of data is inefficient, > because blocks cannot span checkpoints due to persistence-on-checkpoint. > - The encoders are part of the \{{flink-connector-filesystem project}}, > rather than in orthogonal formats projects. This blows up the dependencies of > the \{{flink-connector-filesystem project}} project. As an example, the > rolling file sink has dependencies on Hadoop and Avro, which messes up > dependency management. > h3. Use of FileSystems > - The BucketingSink works only on Hadoop's FileSystem abstraction not > support Flink's own FileSystem abstraction and cannot work with the packaged > S3, maprfs, and swift file systems > - The sink hence needs Hadoop as a dependency > - The sink relies on "trying out" whether truncation works, which requires > write access to the users working directory > - The sink relies on enumerating and counting files, rather than maintaining > its own state, making less efficient > h3. Correctness and Efficiency on S3 > - The BucketingSink relies on strong consistency in the file enumeration, > hence may work incorrectly on S3. > - The BucketingSink relies on persisting streams at intermediate points. > This is not working properly on S3, hence there may be data loss on S3. > h3. .valid-length companion file > - The valid length file makes it hard for consumers of the data and should > be dropped > We track this design in a series of sub issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-9749) Rework Bucketing Sink
[ https://issues.apache.org/jira/browse/FLINK-9749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann updated FLINK-9749: - Fix Version/s: 1.6.2 > Rework Bucketing Sink > - > > Key: FLINK-9749 > URL: https://issues.apache.org/jira/browse/FLINK-9749 > Project: Flink > Issue Type: New Feature > Components: Streaming Connectors >Reporter: Stephan Ewen >Assignee: Kostas Kloudas >Priority: Major > Fix For: 1.6.2 > > > The BucketingSink has a series of deficits at the moment. > Due to the long list of issues, I would suggest to add a new > StreamingFileSink with a new and cleaner design > h3. Encoders, Parquet, ORC > - It only efficiently supports row-wise data formats (avro, jso, sequence > files. > - Efforts to add (columnar) compression for blocks of data is inefficient, > because blocks cannot span checkpoints due to persistence-on-checkpoint. > - The encoders are part of the \{{flink-connector-filesystem project}}, > rather than in orthogonal formats projects. This blows up the dependencies of > the \{{flink-connector-filesystem project}} project. As an example, the > rolling file sink has dependencies on Hadoop and Avro, which messes up > dependency management. > h3. Use of FileSystems > - The BucketingSink works only on Hadoop's FileSystem abstraction not > support Flink's own FileSystem abstraction and cannot work with the packaged > S3, maprfs, and swift file systems > - The sink hence needs Hadoop as a dependency > - The sink relies on "trying out" whether truncation works, which requires > write access to the users working directory > - The sink relies on enumerating and counting files, rather than maintaining > its own state, making less efficient > h3. Correctness and Efficiency on S3 > - The BucketingSink relies on strong consistency in the file enumeration, > hence may work incorrectly on S3. > - The BucketingSink relies on persisting streams at intermediate points. > This is not working properly on S3, hence there may be data loss on S3. > h3. .valid-length companion file > - The valid length file makes it hard for consumers of the data and should > be dropped > We track this design in a series of sub issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-9749) Rework Bucketing Sink
[ https://issues.apache.org/jira/browse/FLINK-9749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chesnay Schepler updated FLINK-9749: Fix Version/s: (was: 1.6.0) 1.6.1 > Rework Bucketing Sink > - > > Key: FLINK-9749 > URL: https://issues.apache.org/jira/browse/FLINK-9749 > Project: Flink > Issue Type: New Feature > Components: Streaming Connectors >Reporter: Stephan Ewen >Assignee: Kostas Kloudas >Priority: Major > Fix For: 1.6.1 > > > The BucketingSink has a series of deficits at the moment. > Due to the long list of issues, I would suggest to add a new > StreamingFileSink with a new and cleaner design > h3. Encoders, Parquet, ORC > - It only efficiently supports row-wise data formats (avro, jso, sequence > files. > - Efforts to add (columnar) compression for blocks of data is inefficient, > because blocks cannot span checkpoints due to persistence-on-checkpoint. > - The encoders are part of the \{{flink-connector-filesystem project}}, > rather than in orthogonal formats projects. This blows up the dependencies of > the \{{flink-connector-filesystem project}} project. As an example, the > rolling file sink has dependencies on Hadoop and Avro, which messes up > dependency management. > h3. Use of FileSystems > - The BucketingSink works only on Hadoop's FileSystem abstraction not > support Flink's own FileSystem abstraction and cannot work with the packaged > S3, maprfs, and swift file systems > - The sink hence needs Hadoop as a dependency > - The sink relies on "trying out" whether truncation works, which requires > write access to the users working directory > - The sink relies on enumerating and counting files, rather than maintaining > its own state, making less efficient > h3. Correctness and Efficiency on S3 > - The BucketingSink relies on strong consistency in the file enumeration, > hence may work incorrectly on S3. > - The BucketingSink relies on persisting streams at intermediate points. > This is not working properly on S3, hence there may be data loss on S3. > h3. .valid-length companion file > - The valid length file makes it hard for consumers of the data and should > be dropped > We track this design in a series of sub issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-9749) Rework Bucketing Sink
[ https://issues.apache.org/jira/browse/FLINK-9749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann updated FLINK-9749: - Fix Version/s: (was: 1.7.0) 1.6.0 > Rework Bucketing Sink > - > > Key: FLINK-9749 > URL: https://issues.apache.org/jira/browse/FLINK-9749 > Project: Flink > Issue Type: New Feature > Components: Streaming Connectors >Reporter: Stephan Ewen >Assignee: Kostas Kloudas >Priority: Major > Fix For: 1.6.0 > > > The BucketingSink has a series of deficits at the moment. > Due to the long list of issues, I would suggest to add a new > StreamingFileSink with a new and cleaner design > h3. Encoders, Parquet, ORC > - It only efficiently supports row-wise data formats (avro, jso, sequence > files. > - Efforts to add (columnar) compression for blocks of data is inefficient, > because blocks cannot span checkpoints due to persistence-on-checkpoint. > - The encoders are part of the \{{flink-connector-filesystem project}}, > rather than in orthogonal formats projects. This blows up the dependencies of > the \{{flink-connector-filesystem project}} project. As an example, the > rolling file sink has dependencies on Hadoop and Avro, which messes up > dependency management. > h3. Use of FileSystems > - The BucketingSink works only on Hadoop's FileSystem abstraction not > support Flink's own FileSystem abstraction and cannot work with the packaged > S3, maprfs, and swift file systems > - The sink hence needs Hadoop as a dependency > - The sink relies on "trying out" whether truncation works, which requires > write access to the users working directory > - The sink relies on enumerating and counting files, rather than maintaining > its own state, making less efficient > h3. Correctness and Efficiency on S3 > - The BucketingSink relies on strong consistency in the file enumeration, > hence may work incorrectly on S3. > - The BucketingSink relies on persisting streams at intermediate points. > This is not working properly on S3, hence there may be data loss on S3. > h3. .valid-length companion file > - The valid length file makes it hard for consumers of the data and should > be dropped > We track this design in a series of sub issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-9749) Rework Bucketing Sink
[ https://issues.apache.org/jira/browse/FLINK-9749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann updated FLINK-9749: - Fix Version/s: (was: 1.6.0) 1.7.0 > Rework Bucketing Sink > - > > Key: FLINK-9749 > URL: https://issues.apache.org/jira/browse/FLINK-9749 > Project: Flink > Issue Type: New Feature > Components: Streaming Connectors >Reporter: Stephan Ewen >Assignee: Kostas Kloudas >Priority: Major > Fix For: 1.7.0 > > > The BucketingSink has a series of deficits at the moment. > Due to the long list of issues, I would suggest to add a new > StreamingFileSink with a new and cleaner design > h3. Encoders, Parquet, ORC > - It only efficiently supports row-wise data formats (avro, jso, sequence > files. > - Efforts to add (columnar) compression for blocks of data is inefficient, > because blocks cannot span checkpoints due to persistence-on-checkpoint. > - The encoders are part of the \{{flink-connector-filesystem project}}, > rather than in orthogonal formats projects. This blows up the dependencies of > the \{{flink-connector-filesystem project}} project. As an example, the > rolling file sink has dependencies on Hadoop and Avro, which messes up > dependency management. > h3. Use of FileSystems > - The BucketingSink works only on Hadoop's FileSystem abstraction not > support Flink's own FileSystem abstraction and cannot work with the packaged > S3, maprfs, and swift file systems > - The sink hence needs Hadoop as a dependency > - The sink relies on "trying out" whether truncation works, which requires > write access to the users working directory > - The sink relies on enumerating and counting files, rather than maintaining > its own state, making less efficient > h3. Correctness and Efficiency on S3 > - The BucketingSink relies on strong consistency in the file enumeration, > hence may work incorrectly on S3. > - The BucketingSink relies on persisting streams at intermediate points. > This is not working properly on S3, hence there may be data loss on S3. > h3. .valid-length companion file > - The valid length file makes it hard for consumers of the data and should > be dropped > We track this design in a series of sub issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-9749) Rework Bucketing Sink
[ https://issues.apache.org/jira/browse/FLINK-9749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen updated FLINK-9749: Description: The BucketingSink has a series of deficits at the moment. Due to the long list of issues, I would suggest to add a new StreamingFileSink with a new and cleaner design h3. Encoders, Parquet, ORC - It only efficiently supports row-wise data formats (avro, jso, sequence files. - Efforts to add (columnar) compression for blocks of data is inefficient, because blocks cannot span checkpoints due to persistence-on-checkpoint. - The encoders are part of the \{{flink-connector-filesystem project}}, rather than in orthogonal formats projects. This blows up the dependencies of the \{{flink-connector-filesystem project}} project. As an example, the rolling file sink has dependencies on Hadoop and Avro, which messes up dependency management. h3. Use of FileSystems - The BucketingSink works only on Hadoop's FileSystem abstraction not support Flink's own FileSystem abstraction and cannot work with the packaged S3, maprfs, and swift file systems - The sink hence needs Hadoop as a dependency - The sink relies on "trying out" whether truncation works, which requires write access to the users working directory - The sink relies on enumerating and counting files, rather than maintaining its own state, making less efficient h3. Correctness and Efficiency on S3 - The BucketingSink relies on strong consistency in the file enumeration, hence may work incorrectly on S3. - The BucketingSink relies on persisting streams at intermediate points. This is not working properly on S3, hence there may be data loss on S3. h3. .valid-length companion file - The valid length file makes it hard for consumers of the data and should be dropped We track this design in a series of sub issues. was: The BucketingSink has a series of deficits at the moment. Due to the long list of issues, I would suggest to add a new StreamingFileSink with a new and cleaner design h3. Encoders, Parquet, ORC - It only efficiently supports row-wise data formats (avro, jso, sequence files. - Efforts to add (columnar) compression for blocks of data is inefficient, because blocks cannot span checkpoints due to persistence-on-checkpoint. - The encoders are part of the \{{flink-connector-filesystem project}}, rather than in orthogonal formats projects. This blows up the dependencies of the \{{flink-connector-filesystem project}} project. As an example, the rolling file sink has dependencies on Hadoop and Avro, which messes up dependency management. h3. Use of FileSystems - The BucketingSink works only on Hadoop's FileSystem abstraction not support Flink's own FileSystem abstraction and cannot work with the packaged S3, maprfs, and swift file systems - The sink hence needs Hadoop as a dependency - The sink relies on "trying out" whether truncation works, which requires write access to the users working directory - The sink relies on enumerating and counting files, rather than maintaining its own state, making less efficient h3. Correctness and Efficiency on S3 - The BucketingSink relies on strong consistency in the file enumeration, hence may work incorrectly on S3. - The BucketingSink relies on persisting streams at intermediate points. This is not working properly on S3, hence there may be data loss on S3. h3. .valid-length companion file - The valid length file makes it hard for consumers of the data and should be dropped > Rework Bucketing Sink > - > > Key: FLINK-9749 > URL: https://issues.apache.org/jira/browse/FLINK-9749 > Project: Flink > Issue Type: New Feature > Components: Streaming Connectors >Reporter: Stephan Ewen >Assignee: Kostas Kloudas >Priority: Major > Fix For: 1.6.0 > > > The BucketingSink has a series of deficits at the moment. > Due to the long list of issues, I would suggest to add a new > StreamingFileSink with a new and cleaner design > h3. Encoders, Parquet, ORC > - It only efficiently supports row-wise data formats (avro, jso, sequence > files. > - Efforts to add (columnar) compression for blocks of data is inefficient, > because blocks cannot span checkpoints due to persistence-on-checkpoint. > - The encoders are part of the \{{flink-connector-filesystem project}}, > rather than in orthogonal formats projects. This blows up the dependencies of > the \{{flink-connector-filesystem project}} project. As an example, the > rolling file sink has dependencies on Hadoop and Avro, which messes up > dependency management. > h3. Use of FileSystems > - The BucketingSink works only on Hadoop's FileSystem abstraction not > support Flink's own FileSystem abstraction and cannot work with the packaged > S3, maprfs, and swift file systems > - The