Daniel Halperin created BEAM-60:
-----------------------------------

             Summary: FileBasedSource/IOChannelFactory: Custom glob expansion
                 Key: BEAM-60
                 URL: https://issues.apache.org/jira/browse/BEAM-60
             Project: Beam
          Issue Type: New Feature
          Components: sdk-java-core
            Reporter: Daniel Halperin
            Assignee: Davor Bonaci


Many cloud and distributed filesystems are eventually consistent, for instance 
Amazon s3 and Google Cloud Storage.

To work around this, many systems that produce files such as Beam's 
FileBasedSinks, or Google BigQuery will provide methods to determine the number 
and set of files produced. E.g.,

* Beam FileBasedSink uses -00000-of-NNNNN
* BigQuery export jobs uses -000000 -000001 -000002 ... until an empty file is 
produced
* Another system may produce a .filelist suffix that contains a list of all 
files.

Users should be able to supply a glob to FileBasedSource but additionally 
supply a "glob expander" that can provide a custom implementation for file 
expansion. That way, e.g., Beam pipelines can be run back-to-back-to-back where 
each consumes the output of the previous, on an inconsistent filesystem, 
without data loss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to