Re: I'm back and ready to help grow our community!

2018-05-17 Thread Dmitry Demeshchuk
While this may be a bit off topic, I still want to say this.

Congratulations on your graduation, Gris!

On Thu, May 17, 2018 at 2:19 PM, Griselda Cuevas <g...@google.com> wrote:

> Hi Everyone,
>
>
> I was absent from the mailing list, slack channel and our Beam community
> for the past six weeks, the reason was that I took a leave to focus on
> finishing my Masters Degree, which I finally did on May 15th.
>
>
> I graduated as a Masters of Engineering in Operations Research with a
> concentration in Data Science from UC Berkeley. I'm glad to be part of this
> community and I'd like to share this accomplishment with you so I'm adding
> two pictures of that day :)
>
>
> Given that I've seen so many new folks around, I'd like to use this
> opportunity to re-introduce myself. I'm Gris Cuevas and I work at Google.
> Now that I'm back, I'll continue to work on supporting our community in two
> main streams: Contribution Experience & Events, Meetups, and Conferences.
>
>
> It's good to be back and I look forward to collaborating with you.
>
>
> Cheers,
>
> Gris
>



-- 
Best regards,
Dmitry Demeshchuk.


Re: Strata Conference this March 6-8

2018-01-16 Thread Dmitry Demeshchuk
Probably won't be attending the conference, but totally down for a BoF.

On Tue, Jan 16, 2018 at 4:58 PM, Holden Karau <hol...@pigscanfly.ca> wrote:

> Do interested folks have any timing constraints around a BoF?
>
> On Tue, Jan 16, 2018 at 4:30 PM, Jesse Anderson <je...@bigdatainstitute.io
> > wrote:
>
>> +1 to BoF. I don't know if any Beam talks will be on the schedule.
>>
>> > We could do an informal BoF at the Philz nearby or similar?
>>
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>



-- 
Best regards,
Dmitry Demeshchuk.


Re: Streaming support available on Beam Python DIrectRunner

2017-07-12 Thread Dmitry Demeshchuk
Awesome work!

Please sign me in for a Python Streaming Alpha, we've been looking forward
for SDoFn support coming in.

In fact, we would be glad to assist on completing it faster, if there's
some grunt work you can hand off.

On Wed, Jul 12, 2017 at 10:07 AM, Charles Chen <c...@google.com.invalid>
wrote:

> We recently checked in the last few changes needed to support streaming
> pipelines on the Beam Python DirectRunner (BEAM-1265
> <https://issues.apache.org/jira/browse/BEAM-1265>).  As of HEAD (1-2 weeks
> ago) and the 2.1.0 RC, Python SDK users can now write their pipelines in
> streaming mode and run them locally on their own machine.
>
> Check out the streaming wordcount example here (streaming_wordcount.py
> <https://github.com/apache/beam/blob/master/sdks/python/
> apache_beam/examples/streaming_wordcount.py>)
> and please kick the tires, try out the new functionality and report any
> bugs you may encounter.  Use the "--streaming" PipelineOption to enable
> this new functionality.
>
> Currently, the I/Os supported are the TestStream
> <https://github.com/apache/beam/blob/master/sdks/python/
> apache_beam/testing/test_stream.py>
> and
> Google Cloud PubSub I/O.  Chamikara is working on implementing
> SplittableDoFn as the Python streaming source API so that it will be easy
> to write new streaming sources.  Python streaming support for other runners
> like Cloud Dataflow and Flink will be provided through the FnAPI (please
> contact me if you would be interested in joining the Python Streaming Alpha
> for Google Cloud Dataflow).
>
> For reference, here are some of the relevant PRs checked in for this
> effort:
>
> https://github.com/apache/beam/pull/3318
> https://github.com/apache/beam/pull/3362
> https://github.com/apache/beam/pull/3370
> https://github.com/apache/beam/pull/3405
> https://github.com/apache/beam/pull/3409
> https://github.com/apache/beam/pull/3440
> https://github.com/apache/beam/pull/3444
> https://github.com/apache/beam/pull/3499
>
> Best,
> Charles
>



-- 
Best regards,
Dmitry Demeshchuk.


Re: Passing pipeline options into PTransforms and Filesystems in Python

2017-07-11 Thread Dmitry Demeshchuk
Yeah, I think the original point about ValueProviders was to raise my
awareness of separation between pipeline build time and run time. Indeed,
whether we use ValueProviders or not, we would still need to figure out a
way to get the actual credentials values into the FileSystem object.

This many also be tangential, but it seems like I may be trying to raise a
wrong problem here and, instead of discussing pipeline options visibility,
we should be discussing the problem of accessing a specific cloud provider,
maybe from just the filesystem perspective, maybe from both filesystem and
other sources/sinks perspective. And the ways of configuring global options
for PTransforms may not have much do do with it.

On Tue, Jul 11, 2017 at 5:01 PM, Sourabh Bajaj <
sourabhba...@google.com.invalid> wrote:

> I'm not sure ValueProviders address the issue of getting credentials to
> underlying libraries or FileSystem though as they are only exposed at the
> PTransform level.
>
> Eg. If I was using Flink on AWS and reading data from GCS we currently
> don't have a way for TextIO to get credentials it can use to read from GCS.
> We just rely on other libraries for doing that work and they assume you've
> gcloud tool installed. This is partially caused due to TextIO not exposing
> an option to pass an extra credential object when accessing the FileSystem.
>
> On a tangential note we currently rely on credentials being passed as part
> of the serialized object such as in the JdbcIO; the password is just part
> of the connection string and then serialized with the DoFn itself. It might
> be worth considering exposing a credential provider system similar to value
> providers (or a type of value provider) where one could use a KMS if they
> choose to.
>
> On Tue, Jul 11, 2017 at 4:49 PM Sourabh Bajaj <sourabhba...@google.com>
> wrote:
>
> > We do the latter of treating constants as StaticValueProviders in the
> > pipeline right now.
> >
> > On Tue, Jul 11, 2017 at 4:47 PM Dmitry Demeshchuk <dmi...@postmates.com>
> > wrote:
> >
> >> Thanks a lot for the input, folks!
> >>
> >> Also, thanks for telling me about the concept of ValueProvider, Kenneth!
> >> This was a good reminder to myself that some stuff that's described in
> the
> >> Dataflow docs (I discovered
> >> https://cloud.google.com/dataflow/docs/templates/creating-templates
> after
> >> having read your reply) doesn't necessarily exist in the Beam
> >> documentation.
> >>
> >> I do agree with Thomas' (and Robert's, in the JIRA bug) point that we
> may
> >> often want to supply separate credentials for separate steps. It
> increases
> >> the verbosity, and raises a question of what to do about filesystems
> >> (ReadFromText and WriteToText), but it also has a lot of value.
> >>
> >> As of accessing pipeline options, what if PTransforms were treating
> >> pipeline options as a NestedValueProvider of a sort?
> >>
> >> class MyDoFn(beam.DoFn):
> >> def process(self, item):
> >> # We fetch pipeline options in runtime
> >> # or, it could look like opts = self.pipeline_options()
> >> opts = self.pipeline_options.get()
> >>
> >> ​
> >> Alternatively, we could treat each individual option as a ValueProvider
> >> object, even if really it's just a constant.
> >>
> >>
> >> On Tue, Jul 11, 2017 at 4:00 PM, Robert Bradshaw <
> >> rober...@google.com.invalid> wrote:
> >>
> >> > Templates, including ValueProviders, were recently added to the Python
> >> > SDK. +1 to pursuing this train of thought (and as I mentioned on the
> >> > bug, and has been mentioned here, we don't want to add PipelineOptions
> >> > access to PTransforms/at construction time).
> >> >
> >> > On Tue, Jul 11, 2017 at 3:21 PM, Kenneth Knowles
> <k...@google.com.invalid
> >> >
> >> > wrote:
> >> > > Hi Dmitry,
> >> > >
> >> > > This is a very worthwhile discussion that has recently come up on
> >> > > StackOverflow, here: https://stackoverflow.com/a/45024542/4820657
> >> > >
> >> > > We actually recently _removed_ the PipelineOptions from
> >> Pipeline.apply in
> >> > > Java since they tend to cause transforms to have implicit changes
> that
> >> > make
> >> > > them non-portable. Baking in credentials would probably fall into
> this
> >> > > category.
> >> > >
> >> > > The other aspect to thi

Re: Passing pipeline options into PTransforms and Filesystems in Python

2017-07-11 Thread Dmitry Demeshchuk
Thanks a lot for the input, folks!

Also, thanks for telling me about the concept of ValueProvider, Kenneth!
This was a good reminder to myself that some stuff that's described in the
Dataflow docs (I discovered
https://cloud.google.com/dataflow/docs/templates/creating-templates after
having read your reply) doesn't necessarily exist in the Beam documentation.

I do agree with Thomas' (and Robert's, in the JIRA bug) point that we may
often want to supply separate credentials for separate steps. It increases
the verbosity, and raises a question of what to do about filesystems
(ReadFromText and WriteToText), but it also has a lot of value.

As of accessing pipeline options, what if PTransforms were treating
pipeline options as a NestedValueProvider of a sort?

class MyDoFn(beam.DoFn):
def process(self, item):
# We fetch pipeline options in runtime
# or, it could look like opts = self.pipeline_options()
opts = self.pipeline_options.get()

​
Alternatively, we could treat each individual option as a ValueProvider
object, even if really it's just a constant.


On Tue, Jul 11, 2017 at 4:00 PM, Robert Bradshaw <
rober...@google.com.invalid> wrote:

> Templates, including ValueProviders, were recently added to the Python
> SDK. +1 to pursuing this train of thought (and as I mentioned on the
> bug, and has been mentioned here, we don't want to add PipelineOptions
> access to PTransforms/at construction time).
>
> On Tue, Jul 11, 2017 at 3:21 PM, Kenneth Knowles <k...@google.com.invalid>
> wrote:
> > Hi Dmitry,
> >
> > This is a very worthwhile discussion that has recently come up on
> > StackOverflow, here: https://stackoverflow.com/a/45024542/4820657
> >
> > We actually recently _removed_ the PipelineOptions from Pipeline.apply in
> > Java since they tend to cause transforms to have implicit changes that
> make
> > them non-portable. Baking in credentials would probably fall into this
> > category.
> >
> > The other aspect to this is that we want to be able to build a pipeline
> and
> > run it later, in an environment chosen when we decide to run it. So
> > PipelineOptions are really for running, not building, a Pipeline. You can
> > still use them for arg parsing and passing specific values to transforms
> -
> > that is essentially orthogonal and just accidentally conflated.
> >
> > I can't speak to the state of Python SDK's maturity in this regard, but
> > there is a concept of a "ValueProvider" that is a deferred value that can
> > be specified by PipelineOptions when you run your pipeline. This may be
> > what you want. You build a PTransform passing some of its configuration
> > parameters as ValueProvider and at run time you set them to actual values
> > that are passed to the UDFs in your pipeline.
> >
> > Hope this helps. Despite not being deeply involved in Python, I wanted to
> > lay out the territory so someone else could comment further without
> having
> > to go into background.
> >
> > Kenn
> >
> > On Tue, Jul 11, 2017 at 3:03 PM, Dmitry Demeshchuk <dmi...@postmates.com
> >
> > wrote:
> >
> >> Hi folks,
> >>
> >> Sometimes, it would be very useful if PTransforms had access to global
> >> pipeline options, such as various credentials, settings and so on.
> >>
> >> Per conversation in https://issues.apache.org/jira/browse/BEAM-2572,
> I'd
> >> like to kick off a discussion about that.
> >>
> >> This would be beneficial for at least one major use case: support for
> >> different cloud providers (AWS, Azure, etc) and an ability to specify
> each
> >> provider's credentials just once in the pipeline options.
> >>
> >> It looks like the trickiest part is not to make the PTransform objects
> have
> >> access to pipeline options (we could possibly just modified the
> >> Pipeline.apply
> >> <https://github.com/apache/beam/blob/master/sdks/python/
> >> apache_beam/pipeline.py#L355>
> >> method), but to actually pass these options down the road, such as to
> DoFn
> >> objects and FileSystem objects.
> >>
> >> I'm still in the process of reading the code and understanding of what
> this
> >> could look like, so any input would be really appreciated.
> >>
> >> Thank you.
> >>
> >> --
> >> Best regards,
> >> Dmitry Demeshchuk.
> >>
>



-- 
Best regards,
Dmitry Demeshchuk.


Passing pipeline options into PTransforms and Filesystems in Python

2017-07-11 Thread Dmitry Demeshchuk
Hi folks,

Sometimes, it would be very useful if PTransforms had access to global
pipeline options, such as various credentials, settings and so on.

Per conversation in https://issues.apache.org/jira/browse/BEAM-2572, I'd
like to kick off a discussion about that.

This would be beneficial for at least one major use case: support for
different cloud providers (AWS, Azure, etc) and an ability to specify each
provider's credentials just once in the pipeline options.

It looks like the trickiest part is not to make the PTransform objects have
access to pipeline options (we could possibly just modified the
Pipeline.apply
<https://github.com/apache/beam/blob/master/sdks/python/apache_beam/pipeline.py#L355>
method), but to actually pass these options down the road, such as to DoFn
objects and FileSystem objects.

I'm still in the process of reading the code and understanding of what this
could look like, so any input would be really appreciated.

Thank you.

-- 
Best regards,
Dmitry Demeshchuk.


Re: Docs/guidelines on writing filesystem sources and sinks

2017-07-06 Thread Dmitry Demeshchuk
Hi Stephen,

Thanks for the detailed reply!

Some comments inline.

On Thu, Jul 6, 2017 at 5:21 PM, Stephen Sisk <s...@google.com> wrote:

> Hi Dmitry,
>
> I'm excited to hear that you'd like to do this work. If you haven't
> already, I'd first suggest that you open a JIRA issue to make sure other
> folks know you're working on this.
>

Will do tomorrow, thanks for the suggestion. The code is currently not a
part of Beam, but I'd be more than happy to push it upstream.


>
> I was involved in working on the recent java HDFS file system
> implementation, so I'll try and share what I know - I suspect knowledge
> about this is scattered around a bit, so hopefully others will chime in as
> well.
>
> > 1. Are there any official or non-official guidelines or docs on writing
> filesystems? Even Java-specific ones may be really useful.
> I don't know of any guides for writing IOs. I believe folks should be
> helpful here on the mailing list for specific questions, but there aren't
> that many that are experts in file system implementations. It's not
> expected to be a frequent task, so no one has tried to document it (it also
> means your contribution will have a wide impact!) If you wanted to write up
> your notes from the process, it'd likely be highly helpful to others.
>
> https://issues.apache.org/jira/browse/BEAM-2005 documents the work that
> we did to add the java Hadoop FileSystem implementation, so that might be a
> good guide - it has links to PRs, you can find out about design questions
> that came up there, etc.. The Hadoop FileSystem is relatively new, so
> reviewing its commit history may be very informative.
>

I'll check it out, thanks! The main reason I'm looking for more concrete
guidelines is that a lot of internal filesystem-related mechanisms are not
obvious at all: for example, the fact that there's a temporary file created
first and then it's moved elsewhere. Some of these functions in my
implementation are suboptimal or are not doing anything because they don't
seem to be immediately useful, but due to the complexity of the
higher-level usage of FileSystem subclasses I'm likely making some mistakes
right now.


>
> > 2. Are there any existing generic test suites that every filesystem is
> supposed to pass? Again, even if they exist only in Java world, I'd still
> be down for trying to adopt them in Python SDK too.
>
> I don't know of any. If you put together a test plan, we'd be happy to
> discuss it. The tests for the java Hadoop FileSystem represent the current
> thinking, but could likely be expanded on.
>

I can try thinking of something, but, on a second thought, different
filesystems have different characteristics and guarantees, so the same
tests that pass for HDFS may be not necessarily pass for S3 (due to its
eventual consistency), and I'm sure Google Storage and local filesystem
will also have their own quirks. My hope was that some kind of a plan
already existed, but looks like that's not the case, and now I can see why.

I'll try to reflect on this idea and see if I can pull together a doc with
at least some basic acceptance tests and ways to apply them to the existing
filesystems. Will start a new thread if/when I end up doing that.


>
> > 3. Are there any established ideas of how to pass AWS credentials to
> Beam for making the S3 filesystem actually work?
>
> Looks like you already found the past discussions of this on the mailing
> list, that was what I would refer you to.
>
> > I also stumbled upon a problem that I can't really pass additional
> configuration to a filesystem,
> We had a similar problem with the hadoop configuration object - inside of
> the hadoop filesystem registrar, we read the pipeline options to see if
> there is configuration info there, as well as some default hadoop
> configuration file locations. See https://github.com/apache/
> beam/blob/master/sdks/java/io/hadoop-file-system/src/main/
> java/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.java#L45
>

Thanks, that's actually the ideal approach for me! I wasn't sure if
pipeline options were accessible from inside transformations, but looks
like they are. This makes a really good case for supporting the entire AWS
stack conveniently by providing some extra pipeline option, like
"aws_config" or something.


>
> The python folks will have to comment if that's the type of solution they
> want you to use though.
>
> I hope this helps!
>
> Stephen
>
>
> On Thu, Jul 6, 2017 at 4:42 PM Dmitry Demeshchuk <dmi...@postmates.com>
> wrote:
>
>> I also stumbled upon a problem that I can't really pass additional
>> configuration to a filesystem, e.g.
>>
>> lines = pipeline | 'read' >> ReadFromText('s3://my-bucket/kinglear.txt',
>> aws_config=AWSConf

Re: Docs/guidelines on writing filesystem sources and sinks

2017-07-06 Thread Dmitry Demeshchuk
I also stumbled upon a problem that I can't really pass additional
configuration to a filesystem, e.g.

lines = pipeline | 'read' >> ReadFromText('s3://my-bucket/kinglear.txt',
aws_config=AWSConfig())

because the ReadFromText class relies on PTransform's constructor, which
has a pre-defined set of arguments.

This is probably becoming a cross-topic for the dev list (have I added it
in the right way?)

On Thu, Jul 6, 2017 at 1:27 PM, Dmitry Demeshchuk <dmi...@postmates.com>
wrote:

> Hi folks,
>
> I'm working on an S3 filesystem for the Python SDK, which already works in
> case of a happy path for both reading and writing, but I feel like there
> are quite a few edge cases that I'm likely missing.
>
> So far, my approach has been: "look at the generic FileSystem
> implementation, look at how gcsio.py and gcsfilesystem.py are written, try
> to copy their approach as much as possible, at least for getting to the
> proof of concept".
>
> That said, I'd like to know a few things:
>
> 1. Are there any official or non-official guidelines or docs on writing
> filesystems? Even Java-specific ones may be really useful.
>
> 2. Are there any existing generic test suites that every filesystem is
> supposed to pass? Again, even if they exist only in Java world, I'd still
> be down for trying to adopt them in Python SDK too.
>
> 3. Are there any established ideas of how to pass AWS credentials to Beam
> for making the S3 filesystem actually work? I currently rely on the
> existing environment variables, which boto just picks up, but sounds like
> setting them up in runners like Dataflow or Spark would be troublesome.
> I've seen this discussion a couple times in the list, but couldn't tell if
> any closure was found. My personal preference would be having AWS settings
> passed in some global context (pipeline options, perhaps?), but there may
> be exceptions to that (say, people want to use different credentials for
> different AWS operations).
>
> Thanks!
>
> --
> Best regards,
> Dmitry Demeshchuk.
>



-- 
Best regards,
Dmitry Demeshchuk.