Thanks for the thorough answers. It all sounds good to me.

On Tue, Dec 6, 2016 at 12:57 PM, Pei He <pe...@google.com.invalid> wrote:

> Thanks Kenn for the feedback and questions.
>
> I responded inline.
>
> On Mon, Dec 5, 2016 at 7:49 PM, Kenneth Knowles <k...@google.com.invalid>
> wrote:
>
> > I really like this document. It is easy to read and informative. Three
> > things not addressed by the document:
> >
> > 1. Major Beam use cases. I'm sure we have a few in the SDK that could be
> > outlined in terms of the new API with pseudocode.
>
>
> (I am writing pseudocode directly with FileSystem interface to demonstrate.
> However, clients will use the utility FileSystems. This is for us to have a
> layer between the file systems providers' interface and the client
> interface. We can add utility functions to FileSystems for common use
> patterns as needed.)
>
> Major Beam use cases are the followings:
> A. FileBasedSource:
> // a. Get input URIs and file sizes from users provided specs.
> // Note: I updated the match() to be a bulk operation after I sent my last
> email.
> List<MatchResult> results = match(specList);
> List<Metadata> inputMetadataList = FluentIterable.from(results)
>     .transformAndConcat(
>         new Function<MatchResult, Metadata>() {
>           @Override
>           public Iterable<Metadata> apply(MatchResult result) {
>             return Arrays.asList(result.metadata());
>           });
>
> // b. Read from a start offset to support the source splitting.
> SeekableByteChannel seekChannel = open(fileUri);
> seekChannel.position(source.getStartOffset());
> seekChannel.read(...);
>
> B. FileBasedSink:
> // bulk rename temporary files to output files
> rename(tempUris, outputUris);
>
> C. General file operations:
> a. resolve paths
> b. create file to write, open file to read (for example in tests).
> c. bulk delete files/directories
>
>
>
> 2. Related work. How does this differ from other filesystem APIs and why?
>
> We need three sets of functionalities:
> 1. resolve paths.
> 2. read and write channels.
> 3. bulk files management operations(bulk delete/rename/match).
>
> And, they are available from Java nio, hadoop FileSystem APIs, and other
> standard library such as java.net.URI.
>
> Current IOChannelFactory interface uses Java nio for (1) and (2), and
> define its own interface for (3).
>
> In my redesign, I made the following choices:
> For (1), I replaced Java nio with URI, because it is standardized and
> precise and doesn't require additional implementation of a Path interface
> from file system providers.
>
> For (2), I kept the uses of Java nio (Writable/SeekableByteChannel), since
> I don't see any things that need to improve and I don't see any better
> alternatives (hadoop's FSDataInput/OutputStream provide same
> functionalities, but requires additional dependencies).
>
> For (3), reasons that I didn't choose Java nio or hadoop are:
> 1. Beam needs bulk operations API for better performance, however Java nio
> and hadoop FileSystems are single file based API.
> 2. Have APIs that are File systems agnostic. For example, we can use URI
> instead of Path.
> 3. Have APIs that are minimum, and easy to implement by file system
> providers.
> 4. Introducing less dependencies.
> 5. It is easy to build an adaptor based on Java nio or hadoop interfaces.
>
> 3. Discussion of non-Java languages. It would be good to know what classes
> > in e.g. Python we might use in place of URI, SeekableByteChannel, etc.
>
> I don't want to mislead people here without a thorough investigation. You
> can see from your second question, that would require iterations on design
> and prototyping.
>
> I didn't introduce any Java specific requirements in the redesign.
> Resolving paths, seeking with channels or streams, file management
> operations are languages independent. And, I pretty sure there are python
> libraries for that.
>
> However, I am happy to hear thoughts and get help from people working on
> the python sdk.
>
>
> > On Mon, Dec 5, 2016 at 4:41 PM, Pei He <pe...@google.com.invalid> wrote:
> >
> > > I have received a lot of comments in "Part 1: IOChannelFactory
> > > Redesign" [1]. And, I have updated the design based on the feedback.
> > >
> > > Now, I feel it is close to be ready for implementation, and I would
> like
> > to
> > > summarize the changes:
> > > 1. Replaced FilePath with URI for resolving files paths.
> > > 2. Required match(String spec) to handle ambiguities in users provided
> > > strings (see the match() java doc in the design doc for details).
> > > 3. Changed Metadata to use Future.get() paradigm, and removed
> > exception().
> > > 4. Changed methods on FileSystem interface to be protected (visible for
> > > implementors), and created FileSystems utility (visible for callers).
> > > 5.  Simplified FileSystem interface by moving operation options, such
> as
> > > DeleteOptions, MatchOptions, to the FileSystems utility.
> > > 6. Simplified FileSystem interface by requiring certain behaviors, such
> > as
> > > creating recursively, throwing for missing files.
> > >
> > > Any thoughts / feedback?
> > > --
> > > Pei
> > >
> > > [1]
> > > https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-
> > > XJsVG3qel2lhdKTknmZ_7M/edit#
> > >
> > > On Wed, Nov 30, 2016 at 1:32 PM, Pei He <pe...@google.com> wrote:
> > >
> > > > Thanks JB for the feedback.
> > > >
> > > > Yes, we should provide a hadoop.fs.FileSystem adaptor. As you said,
> it
> > > > will make a range of file system available in Beam.
> > > >
> > > > And, people can choose to implement BeamFileSystem directly to get
> the
> > > > best performance (For example, providing bulk operations.)
> > > >
> > > > --
> > > > Pei
> > > >
> > > >
> > > >
> > > > On Tue, Nov 29, 2016 at 11:11 AM, Jean-Baptiste Onofré <
> > j...@nanthrax.net>
> > > > wrote:
> > > >
> > > >> Hi Pei,
> > > >>
> > > >> rethinking about that, I understand that the purpose of the Beam
> > > >> filesystem is to avoid to bring a bunch of dependencies into the
> core.
> > > That
> > > >> makes perfect sense.
> > > >>
> > > >> So, I agree that a Beam filesystem abstract is fine.
> > > >>
> > > >> My point is that we should provide a HadoopFilesystem
> extension/plugin
> > > >> for Beam filesystem asap: that would help us to support a good range
> > of
> > > >> filesystems quickly.
> > > >>
> > > >> Just my $0.01 ;)
> > > >>
> > > >> Regards
> > > >> JB
> > > >>
> > > >>
> > > >> On 11/17/2016 08:18 PM, Pei He wrote:
> > > >>
> > > >>> Hi JB,
> > > >>> My proposals are based on the current IOChannelFactory, and how
> they
> > > are
> > > >>> used in FileBasedSink.
> > > >>>
> > > >>> Let's me spend more time to investigate Hadoop FileSystem
> interface.
> > > >>> --
> > > >>> Pei
> > > >>>
> > > >>> On Thu, Nov 17, 2016 at 1:21 AM, Jean-Baptiste Onofré <
> > j...@nanthrax.net
> > > >
> > > >>> wrote:
> > > >>>
> > > >>> By the way, Pei, for the record: why introducing BeamFileSystem and
> > not
> > > >>>> using the Hadoop FileSystem interface ?
> > > >>>>
> > > >>>> Thanks
> > > >>>> Regards
> > > >>>> JB
> > > >>>>
> > > >>>> On 11/17/2016 01:09 AM, Pei He wrote:
> > > >>>>
> > > >>>> Hi,
> > > >>>>>
> > > >>>>> I am working on BEAM-59
> > > >>>>> <https://issues.apache.org/jira/browse/BEAM-59>
> "IOChannelFactory
> > > >>>>> redesign". The goals are:
> > > >>>>>
> > > >>>>> 1. Support file-based IOs (TextIO, AvorIO) with user-defined file
> > > >>>>> system.
> > > >>>>>
> > > >>>>> 2. Support configuring any user-defined file system.
> > > >>>>>
> > > >>>>> And, I drafted the design proposal in two parts to address them
> in
> > > >>>>> order:
> > > >>>>>
> > > >>>>> Part 1: IOChannelFactory Redesign
> > > >>>>> <https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJ
> > > >>>>> sVG3qel2lhdKTknmZ_7M/edit#>
> > > >>>>>
> > > >>>>> Summary:
> > > >>>>>
> > > >>>>> Old API: WritableByteChannel create(String spec, String
> mimeType);
> > > >>>>>
> > > >>>>> New API: WritableByteChannel create(URI uri, CreateOptions
> > options);
> > > >>>>>
> > > >>>>> Noticeable proposed changes:
> > > >>>>>
> > > >>>>>
> > > >>>>>    1.
> > > >>>>>
> > > >>>>>    Includes the options parameter in most methods to specify
> > > behaviors.
> > > >>>>>    2.
> > > >>>>>
> > > >>>>>    Replace String with URI to include scheme for
> files/directories
> > > >>>>>    locations.
> > > >>>>>    3.
> > > >>>>>
> > > >>>>>    Require file systems to provide a SeekableByteChannel for
> read.
> > > >>>>>    4.
> > > >>>>>
> > > >>>>>    Additional methods, such as getMetadata(), rename() e.t.c
> > > >>>>>
> > > >>>>>
> > > >>>>> Part 2: Configurable BeamFileSystem
> > > >>>>> <https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4
> > > >>>>> q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs>
> > > >>>>>
> > > >>>>> Summary:
> > > >>>>>
> > > >>>>> Old API: IOChannelUtils.getFactory(glob).match(glob);
> > > >>>>>
> > > >>>>> New API: BeamFileSystems.getFileSystem(glob,
> config).match(glob);
> > > >>>>>
> > > >>>>>
> > > >>>>> Looking for comments and feedback.
> > > >>>>>
> > > >>>>> Thanks
> > > >>>>>
> > > >>>>> --
> > > >>>>>
> > > >>>>> Pei
> > > >>>>>
> > > >>>>>
> > > >>>>> --
> > > >>>> Jean-Baptiste Onofré
> > > >>>> jbono...@apache.org
> > > >>>> http://blog.nanthrax.net
> > > >>>> Talend - http://www.talend.com
> > > >>>>
> > > >>>>
> > > >>>
> > > >> --
> > > >> Jean-Baptiste Onofré
> > > >> jbono...@apache.org
> > > >> http://blog.nanthrax.net
> > > >> Talend - http://www.talend.com
> > > >>
> > > >
> > > >
> > >
> >
>

Reply via email to