Re: Pipeline AttributeError on Python3

2019-11-21 Thread Thomas Weise
We are currently verifying the patch. Will report back tomorrow.

On Thu, Nov 21, 2019 at 8:40 AM Valentyn Tymofieiev 
wrote:

> That would be helpful, thanks a lot! It should be a straightforward patch.
> Also, thanks Guenther, for sharing your investigation on
> https://bugs.python.org/issue34572, it was very helpful.
>
> On Thu, Nov 21, 2019 at 8:25 AM Thomas Weise  wrote:
>
>> Valentyn, thanks a lot for following up on this.
>>
>> If the change can be cherry picked in isolation, we should be able to
>> verify this soon (with 2.16).
>>
>>
>> On Thu, Nov 21, 2019 at 8:12 AM Valentyn Tymofieiev 
>> wrote:
>>
>>> To close the loop here: To my knowledge this issue affects all Python 3
>>> users of Portable Flink/Spark runners, and Dataflow Python Streaming users,
>>> including users on Python 3.7.3 and newer versions.
>>>
>>> The issue is addressed on Beam master, and we have a cherry-pick out for
>>> Beam 2.17.0.
>>>
>>> Workaround options for users on 2.16.0 and earlier SDKs:
>>>
>>> - Patch the SDK you are using with
>>> https://github.com/apache/beam/pull/10167.
>>> - Temporarily switch to Python 2 until 2.17.0. We have not seen the
>>> issue on Python 2, so it may be rare on non-existent on Python 2.
>>> - Pass --experiments worker_threads=1 . This option may work only for
>>> some, but not all pipelines.
>>>
>>> See BEAM-8651 <https://issues.apache.org/jira/browse/BEAM-8651> for
>>> details on the issue.
>>>
>>> On Wed, Nov 13, 2019 at 11:55 AM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>>
>>>> I also opened https://issues.apache.org/jira/browse/BEAM-8651 to track
>>>> this issue and any recommendation for the users that will come out of it.
>>>>
>>>> On Thu, Nov 7, 2019 at 6:25 PM Valentyn Tymofieiev 
>>>> wrote:
>>>>
>>>>>  I think we have heard of this issue from the same source:
>>>>>
>>>>> This looks exactly like a race condition that we've encountered on
>>>>>> Python 3.7.1: There's a bug in some older 3.7.x releases that breaks the
>>>>>> thread-safety of the unpickler, as concurrent unpickle threads can 
>>>>>> access a
>>>>>> module before it has been fully imported. See
>>>>>> https://bugs.python.org/issue34572 for more information.
>>>>>>
>>>>>> The traceback shows a Python 3.6 venv so this could be a different
>>>>>> issue (the unpickle bug was introduced in version 3.7). If it's the same
>>>>>> bug then upgrading to Python 3.7.3 or higher should fix that issue. One
>>>>>> potential workaround is to ensure that all of the modules get imported
>>>>>> during the initialization of the sdk_worker, as this bug only affects
>>>>>> imports done by the unpickler.
>>>>>
>>>>>
>>>>> The symptoms do sound similar, so I would try to reproduce your issue
>>>>> on 3.7.3 and see if it is gone, or try to reproduce
>>>>> https://bugs.python.org/issue34572 in the version of interpreter you
>>>>> use. If this doesn't help, you can try to reproduce the race using your
>>>>> input.
>>>>>
>>>>> To get the output of serialized do fn, you could do the following:
>>>>> 1. Patch https://github.com/apache/beam/pull/10036.
>>>>> 2. Set logging level to DEBUG, see:
>>>>> https://github.com/apache/beam/blob/90d587843172143c15ed392513e396b74569a98c/sdks/python/apache_beam/examples/wordcount.py#L137
>>>>> .
>>>>> 3. Check for log output for payload of your transform, it may look
>>>>> like:
>>>>>
>>>>> transforms {
>>>>>   key: "ref_AppliedPTransform_write/Write/WriteImpl/PreFinalize_42"
>>>>>   value {
>>>>> spec {
>>>>>   urn: "beam:transform:pardo:v1"
>>>>>   payload: "\n\347\006\n\275\006\n
>>>>> beam:dofn:pickled_python_info:v1\032\230\006eNptU1tPFTEQPgqIFFT
>>>>> 
>>>>>
>>>>> Then you can extract the output of pickled fn:
>>>>>
>>>>> from apache_beam.utils import proto_utils
>>>>> from apache_beam.portability.api import beam_runner_api_pb2
>>>>> from apache_beam.internal import pickler
>&g

Re: slides?

2019-11-15 Thread Thomas Weise
It would be great to have an index for those materials.

Maybe as cwiki page, which is easy to edit and watch. Similar to:

https://cwiki.apache.org/confluence/display/BEAM/Design+Documents

Thomas


On Fri, Nov 15, 2019 at 10:52 AM Kenneth Knowles  wrote:

> We have a section for this:
> https://beam.apache.org/community/presentation-materials/.
>
> Right now "Presentation Materials" has the appearance of carefully curated
> stuff from a core team. That was probably true three years ago, but now it
> is simply out of date. A lot of the material is so old that it is actually
> incorrect. It would be good to invite more people to maintain this.
>
> Perhaps:
>
>  - "Presentation Materials" -> "Presentations"
>  - Replace the Google Drive link with readable HTML page with a (possibly
> large) archive of events and slides, with licenses so new folks can
> copy/paste/modify to kick start their talk
>
> This has the added benefit that there is a clear date and author on each
> piece, so you know how old the material is and can put it in context. And
> link to video of people presenting with the deck, too. That can be better
> than long written speaker notes.
>
> Kenn
>
> On Thu, Nov 14, 2019 at 10:00 PM Austin Bennett <
> whatwouldausti...@gmail.com> wrote:
>
>> Hi Dev and User,
>>
>> Wondering if people would find a benefit from collecting slides from
>> Meetups/Talks?
>>
>> Seems that this could be appropriate on the website, for instance.  Not
>> sure whether this has been asked previously, so bringing it to the group.
>>
>> Cheers,
>> Austin
>>
>


Re: The state of external transforms in Beam

2019-11-03 Thread Thomas Weise
This thread was very helpful to find more detail in
https://jira.apache.org/jira/browse/BEAM-7870

It would be great to have cross-language current state mentioned as top
level entry on https://beam.apache.org/roadmap/


On Mon, Sep 16, 2019 at 6:07 PM Chamikara Jayalath 
wrote:

> Thanks for the nice write up Chad.
>
> On Mon, Sep 16, 2019 at 12:17 PM Robert Bradshaw 
> wrote:
>
>> Thanks for bringing this up again. My thoughts on the open questions
>> below.
>>
>> On Mon, Sep 16, 2019 at 11:51 AM Chad Dombrova  wrote:
>> > That commit solves 2 problems:
>> >
>> > Adds the pubsub Java deps so that they’re available in our portable
>> pipeline
>> > Makes the coder for the PubsubIO message-holder type, PubsubMessage,
>> available as a standard coder. This is required because both PubsubIO.Read
>> and PubsubIO.Write expand to ParDos which pass along these PubsubMessage
>> objects, but only “standard” (i.e. portable) coders can be used, so we have
>> to hack it to make PubsubMessage appear as a standard coder.
>> >
>> > More details:
>> >
>> > There’s a similar magic commit required for Kafka external transforms
>> > The Jira issue for this problem is here:
>> https://jira.apache.org/jira/browse/BEAM-7870
>> > For problem #2 above there seems to be some consensus forming around
>> using Avro or schema/row coders to send compound types in a portable way.
>> Here’s the PR for making row coders portable
>> > https://github.com/apache/beam/pull/9188
>>
>> +1. Note that this doesn't mean that the IO itself must produce rows;
>> part of the Schema work in Java is to make it easy to automatically
>> convert from various Java classes to schemas transparently, so this
>> same logic that would allow one to apply an SQL filter directly to a
>> Kafka/PubSub read would allow cross-language. Even if that doesn't
>> work, we need not uglify the Java API; we can have an
>> option/alternative transform that appends the convert-to-Row DoFn for
>> easier use by external (though the goal of the former work is to make
>> this step unnecissary).
>>
>
> Updating all IO connectors / transforms to have a version that
> produces/consumes a PCollection is infeasible so I agree that we need
> an automatic conversion to/from PCollection possibly by injecting
> PTransfroms during ExternalTransform expansion.
>
>>
>> > I don’t really have any ideas for problem #1
>>
>> The crux of the issue here is that the jobs API was not designed with
>> cross-language in mind, and so the artifact API ties artifacts to jobs
>> rather than to environments. To solve this we need to augment the
>> notion of environment to allow the specification of additional
>> dependencies (e.g. jar files in this specific case, or better as
>> maven/pypi/... dependencies (with version ranges) such that
>> environment merging and dependency resolution can be sanely done), and
>> a way for the expansion service to provide such dependencies.
>>
>> Max wrote up a summary of the prior discussions at
>>
>> https://docs.google.com/document/d/1XaiNekAY2sptuQRIXpjGAyaYdSc-wlJ-VKjl04c8N48/edit#heading=h.900gc947qrw8
>>
>> In the short term, one can build a custom docker image that has all
>> the requisite dependencies installed.
>>
>> This touches on a related but separable issue that one may want to run
>> some of these transforms "natively" in the same process as the runner
>> (e.g. a Java IO in the Flink Java Runner) rather than via docker.
>> (Similarly with subprocess.) Exactly how that works with environment
>> specifications is also a bit TBD, but my proposal has been that these
>> are best viewed as runner-specific substitutions of standard
>> environments.
>>
>
> We need a permanent solution for this but for now we have a temporary
> solution where additional jar files can be specified through an experiment
> when running a Python pipeline:
> https://github.com/apache/beam/blob/9678149872de2799ea1643f834f2bec88d346af8/sdks/python/apache_beam/io/external/xlang_parquetio_test.py#L55
>
> Thanks,
> Cham
>
>
>>
>> > So the portability expansion system works, and now it’s time to sand
>> off some of the rough corners. I’d love to hear others’ thoughts on how to
>> resolve some of these remaining issues.
>>
>> +1
>>
>>
>> On Mon, Sep 16, 2019 at 11:51 AM Chad Dombrova  wrote:
>> >
>> > Hi all,
>> > There was some interest in this topic at the Beam Summit this week
>> (btw, great job to everyone involved!), so I thought I’d try to summarize
>> the current state of things.
>> > First, let me explain the idea behind an external transforms for the
>> uninitiated.
>> >
>> > Problem:
>> >
>> > there’s a transform that you want to use, but it’s not available in
>> your desired language. IO connectors are a good example: there are many
>> available in the Java SDK, but not so much in Python or Go.
>> >
>> > Solution:
>> >
>> > Create a stub transform in your desired language (e.g. Python) whose
>> primary role is to serialize the parameters passed to that transform
>> > When you run 

Re: How do you write portable runner pipeline on separate python code ?

2019-09-13 Thread Thomas Weise
I agree that loopback would be preferable for this purpose. I just wasn't
aware this even works with the portable Flink runner. Is it one of the best
guarded secrets? ;-)

Kyle, can you please post the pipeline options you would use for Flink?


On Thu, Sep 12, 2019 at 5:57 PM Kyle Weaver  wrote:

> I prefer loopback because a) it writes output files to the local
> filesystem, as the user expects, and b) you don't have to pull or build
> docker images, or even have docker installed on your system -- which is one
> less point of failure.
>
> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
>
>
> On Thu, Sep 12, 2019 at 5:48 PM Thomas Weise  wrote:
>
>> This should become much better with 2.16 when we have the Docker images
>> prebuilt.
>>
>> Docker is probably still the best option for Python on a JVM based runner
>> in a local environment that does not have a development setup.
>>
>>
>> On Thu, Sep 12, 2019 at 1:09 PM Kyle Weaver  wrote:
>>
>>> +dev  I think we should probably point new users
>>> of the portable Flink/Spark runners to use loopback or some other
>>> non-docker environment, as Docker adds some operational complexity that
>>> isn't really needed to run a word count example. For example, Yu's pipeline
>>> errored here because the expected Docker container wasn't built before
>>> running.
>>>
>>> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
>>>
>>>
>>> On Thu, Sep 12, 2019 at 11:27 AM Robert Bradshaw 
>>> wrote:
>>>
>>>> On this note, making local files easy to read is something we'd
>>>> definitely like to improve, as the current behavior is quite surprising.
>>>> This could be useful not just for running with docker and the portable
>>>> runner locally, but more generally when running on a distributed system
>>>> (e.g. a Flink/Spark cluster or Dataflow). It would be very convenient if we
>>>> could automatically stage local files to be read as artifacts that could be
>>>> consumed by any worker (possibly via external directory mounting in the
>>>> local docker case rather than an actual copy), and conversely copy small
>>>> outputs back to the local machine (with the similar optimization for local
>>>> docker).
>>>>
>>>> At the very least, however, obvious messaging when the local filesystem
>>>> is used from within docker, which is often a (non-obvious and hard to
>>>> debug) mistake should be added.
>>>>
>>>>
>>>> On Thu, Sep 12, 2019 at 10:34 AM Lukasz Cwik  wrote:
>>>>
>>>>> When you use a local filesystem path and a docker environment, "/tmp"
>>>>> is written inside the container. You can solve this issue by:
>>>>> * Using a "remote" filesystem such as HDFS/S3/GCS/...
>>>>> * Mounting an external directory into the container so that any
>>>>> "local" writes appear outside the container
>>>>> * Using a non-docker environment such as external or process.
>>>>>
>>>>> On Thu, Sep 12, 2019 at 4:51 AM Yu Watanabe 
>>>>> wrote:
>>>>>
>>>>>> Hello.
>>>>>>
>>>>>> I would like to ask for help with my sample code using portable
>>>>>> runner using apache flink.
>>>>>> I was able to work out the wordcount.py using this page.
>>>>>>
>>>>>> https://beam.apache.org/roadmap/portability/
>>>>>>
>>>>>> I got below two files under /tmp.
>>>>>>
>>>>>> -rw-r--r-- 1 ywatanabe ywatanabe185 Sep 12 19:56
>>>>>> py-wordcount-direct-1-of-2
>>>>>> -rw-r--r-- 1 ywatanabe ywatanabe190 Sep 12 19:56
>>>>>> py-wordcount-direct-0-of-2
>>>>>>
>>>>>> Then I wrote sample code with below steps.
>>>>>>
>>>>>> 1.Install apache_beam using pip3 separate from source code directory.
>>>>>> 2. Wrote sample code as below and named it
>>>>>> "test-protable-runner.py".  Placed it separate directory from source 
>>>>>> code.
>>>>>>
>>>>>> ---
>>>>>> (python) ywatanabe@debian-09-00:~$ ls -ltr
>>>>>> total 16
>>>>>> drw

Re: How do you write portable runner pipeline on separate python code ?

2019-09-12 Thread Thomas Weise
This should become much better with 2.16 when we have the Docker images
prebuilt.

Docker is probably still the best option for Python on a JVM based runner
in a local environment that does not have a development setup.


On Thu, Sep 12, 2019 at 1:09 PM Kyle Weaver  wrote:

> +dev  I think we should probably point new users of
> the portable Flink/Spark runners to use loopback or some other non-docker
> environment, as Docker adds some operational complexity that isn't really
> needed to run a word count example. For example, Yu's pipeline errored here
> because the expected Docker container wasn't built before running.
>
> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
>
>
> On Thu, Sep 12, 2019 at 11:27 AM Robert Bradshaw 
> wrote:
>
>> On this note, making local files easy to read is something we'd
>> definitely like to improve, as the current behavior is quite surprising.
>> This could be useful not just for running with docker and the portable
>> runner locally, but more generally when running on a distributed system
>> (e.g. a Flink/Spark cluster or Dataflow). It would be very convenient if we
>> could automatically stage local files to be read as artifacts that could be
>> consumed by any worker (possibly via external directory mounting in the
>> local docker case rather than an actual copy), and conversely copy small
>> outputs back to the local machine (with the similar optimization for local
>> docker).
>>
>> At the very least, however, obvious messaging when the local filesystem
>> is used from within docker, which is often a (non-obvious and hard to
>> debug) mistake should be added.
>>
>>
>> On Thu, Sep 12, 2019 at 10:34 AM Lukasz Cwik  wrote:
>>
>>> When you use a local filesystem path and a docker environment, "/tmp" is
>>> written inside the container. You can solve this issue by:
>>> * Using a "remote" filesystem such as HDFS/S3/GCS/...
>>> * Mounting an external directory into the container so that any "local"
>>> writes appear outside the container
>>> * Using a non-docker environment such as external or process.
>>>
>>> On Thu, Sep 12, 2019 at 4:51 AM Yu Watanabe 
>>> wrote:
>>>
 Hello.

 I would like to ask for help with my sample code using portable runner
 using apache flink.
 I was able to work out the wordcount.py using this page.

 https://beam.apache.org/roadmap/portability/

 I got below two files under /tmp.

 -rw-r--r-- 1 ywatanabe ywatanabe185 Sep 12 19:56
 py-wordcount-direct-1-of-2
 -rw-r--r-- 1 ywatanabe ywatanabe190 Sep 12 19:56
 py-wordcount-direct-0-of-2

 Then I wrote sample code with below steps.

 1.Install apache_beam using pip3 separate from source code directory.
 2. Wrote sample code as below and named it "test-protable-runner.py".
 Placed it separate directory from source code.

 ---
 (python) ywatanabe@debian-09-00:~$ ls -ltr
 total 16
 drwxr-xr-x 18 ywatanabe ywatanabe 4096 Sep 12 19:06 beam (<- source
 code directory)
 -rw-r--r--  1 ywatanabe ywatanabe  634 Sep 12 20:25
 test-portable-runner.py

 ---
 3. Executed the code with "python3 test-protable-ruuner.py"


 ==
 #!/usr/bin/env

 import apache_beam as beam
 from apache_beam.options.pipeline_options import PipelineOptions
 from apache_beam.io import WriteToText


 def printMsg(line):

 print("OUTPUT: {0}".format(line))

 return line

 options = PipelineOptions(["--runner=PortableRunner",
 "--job_endpoint=localhost:8099", "--shutdown_sources_on_final_watermark"])

 p = beam.Pipeline(options=options)

 output = ( p | 'create' >> beam.Create(["a", "b", "c"])
  | beam.Map(printMsg)
  )

 output | 'write' >> WriteToText('/tmp/sample.txt')

 ===

 Job seemed to went all the way to "FINISHED" state.

 ---
 [DataSource (Impulse) (1/1)] INFO
 org.apache.flink.runtime.taskmanager.Task - Registering task at network:
 DataSource (Impulse) (1/1) (9df528f4d493d8c3b62c8be23dc889a8) [DEPLOYING].
 [DataSource (Impulse) (1/1)] INFO
 org.apache.flink.runtime.taskmanager.Task - DataSource (Impulse) (1/1)
 (9df528f4d493d8c3b62c8be23dc889a8) switched from DEPLOYING to RUNNING.
 [flink-akka.actor.default-dispatcher-3] INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource
 

Re: Beam Summits!

2019-03-01 Thread Thomas Weise
Update: organizers are looking for new dates for the Summit in SF,
currently trending towards October.

For Beam Summit Europe see:
https://twitter.com/matthiasbaetens/status/1098854758893273088

On Wed, Jan 23, 2019 at 9:09 PM Austin Bennett 
wrote:

> Hi All,
>
> PMC approval still pending for Summit in SF (so things may change), but
> wanted to get a preliminary CfP out there to start to get sense of interest
> -- giving the targeted dates are approaching.  Much of this
> delay/uncertainty my fault and I should have done more before the holidays
> and my long vacation in from end of December through mid-January.  This CfP
> will remain open for some time, and upon/after approval will make sure to
> give notice for a CfP deadline.
>
> Please submit talks via:
>
> https://docs.google.com/forms/d/e/1FAIpQLSfD0qhoS2QrDbtK1E85gATGQCgRGKhQcLIkiiAsPW9G_7Um_Q/viewform?usp=sf_link
>
> Would very much encourage anyone that can lead
> hands-on/tutorials/workshops for full day, half-day, focused couple hours,
> etc to apply, as well as any technical talks and/or use cases.  Again,
> tentative dates(s) 3 and 4 April 2019.
>
> Thanks,
> Austin
>
>
> On Mon, Jan 21, 2019 at 7:58 PM Austin Bennett <
> whatwouldausti...@gmail.com> wrote:
>
>> Hi All,
>>
>> Other projects/Summits like Kafka and Spark offer add-on days to summits
>> for training.  I'm wondering the appetite/interest for hands-on sessions
>> for working with Beam, and whether we think that'd be helpful.  Are there
>> people that would benefit from a beginning with Beam day, or a more
>> advanced/specialized session.  This was on the original agenda for London,
>> but hadn't materialized, seeing if we think there is interest to make this
>> worth putting together/making-available.
>>
>> Furthermore, it had been mentioned that an introduction to contributing
>> to Beam might also be beneficial.  Also curious to hear whether that would
>> be of interest to people here (or for those that those here know, but
>> aren't following these distribution channels for themselves -- since
>> following dev@ or even user@ is potentially a more focused selection of
>> those with an interest in Beam.
>>
>> Thanks,
>> Austin
>>
>>
>>
>> On Wed, Dec 19, 2018 at 3:05 PM Austin Bennett <
>> whatwouldausti...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I really enjoyed Beam Summit in London (Thanks Matthias!), and there was
>>> much enthusiasm for continuations.  We had selected that location in a
>>> large part due to the growing community there, and we have users in a
>>> variety of locations.  In our 2019 calendar,
>>> https://docs.google.com/spreadsheets/d/1CloF63FOKSPM6YIuu8eExjhX6xrIiOp5j4zPbSg3Apo/
>>> shared in the past weeks, 3 Summits are tentatively slotted for this year.
>>> Wanting to start running this by the group to get input.
>>>
>>> * Beam Summit NA, in San Francisco, approx 3 April 2019 (following Flink
>>> Forward).  I can organize.
>>> * Beam Summit Europe, in Stockholm, this was the runner up in voting
>>> falling behind London.  Or perhaps Berlin?  October-ish 2019
>>> * Beam Summit Asia, in Tokyo ??
>>>
>>> What are general thoughts on locations/dates?
>>>
>>> Looking forward to convening in person soon.
>>>
>>> Cheers,
>>> Austin
>>>
>>


Re: Beam Summits!

2019-01-06 Thread Thomas Weise
For the event in SF in April it would be necessary to get going soon.

For the type of information that will be required, you could take a look at
the proposal from the London event [1].  It was created with ~ 3 months
lead time. Note the draft mistakenly used "Apache" in the event name -
let's avoid that this time around. Also let's make it clear(er) who is
organizing vs. sponsoring the event.

Also see the event branding policy [2]. Note that approval for the event
will be required from the PMC and when using Apache marks also from the ASF
VP, Brand (or the PMC chair).

Thanks,
Thomas

[1]
https://docs.google.com/document/d/1h0y85vxt0AGYdz6SZCbV2jzUGs46_M-keUZTMsm2R0I/edit
[2] https://www.apache.org/foundation/marks/events





On Thu, Jan 3, 2019 at 12:39 PM Austin Bennett 
wrote:

> Hi Matthias, etc,
>
> Trying to get thoughts on formalizing a process for getting proposals
> together.  I look forward to the potential day that there are many people
> that want (rather than just willing) to host a summit in a given region in
> a given year.  Perhaps too forward looking.
>
> Also, you mentioned planning London wound up with a tight time window.  If
> shooting for April in SF, seems  the clock might be starting to tick.  Any
> advice for how much time needed?  And guidance on getting whatever formal
> needed through Apache - and does this also necessarily involve a Beam PMC
> or community vote (probably more related to the first paragraph)?
>
> Thanks,
> Austin
>
> On Thu, Dec 20, 2018, 1:09 AM Matthias Baetens  wrote:
>
>> Great stuff, thanks for the overview, Austin.
>>
>> For EU, there are things to say for both Stockholm and Berlin, but I
>> think it makes sense to do it on the back of another conference (larger
>> chance of people being in town with the same interest). I like Thomas
>> comment - we will attract more people from the US if we don't let it
>> conflict with the big events there. +1 for doing it around the time of
>> Berlin Buzzwords.
>>
>> For Asia, I'd imagine Singapore would be an option as well. I'll reach
>> out to some people that are based there to get a grasp on the size of the
>> community there.
>>
>> Best,
>> -M
>>
>>
>>
>> On Thu, 20 Dec 2018 at 05:08, Thomas Weise  wrote:
>>
>>> I think for EU there is a proposal to have it next to Berlin Buzzwords
>>> in June. That would provide better spacing and avoid conflict with
>>> ApacheCon.
>>>
>>> Thomas
>>>
>>>
>>> On Wed, Dec 19, 2018 at 3:09 PM Suneel Marthi 
>>> wrote:
>>>
>>>> How about Beam Summit in Berlin on Sep 6 immediately following Flink
>>>> Forward Berlin on the previous 2 days.
>>>>
>>>> Same may be for Asia also following Flink Forward Asia where and
>>>> whenever it happens.
>>>>
>>>> On Wed, Dec 19, 2018 at 6:06 PM Austin Bennett <
>>>> whatwouldausti...@gmail.com> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I really enjoyed Beam Summit in London (Thanks Matthias!), and there
>>>>> was much enthusiasm for continuations.  We had selected that location in a
>>>>> large part due to the growing community there, and we have users in a
>>>>> variety of locations.  In our 2019 calendar,
>>>>> https://docs.google.com/spreadsheets/d/1CloF63FOKSPM6YIuu8eExjhX6xrIiOp5j4zPbSg3Apo/
>>>>> shared in the past weeks, 3 Summits are tentatively slotted for this year.
>>>>> Wanting to start running this by the group to get input.
>>>>>
>>>>> * Beam Summit NA, in San Francisco, approx 3 April 2019 (following
>>>>> Flink Forward).  I can organize.
>>>>> * Beam Summit Europe, in Stockholm, this was the runner up in voting
>>>>> falling behind London.  Or perhaps Berlin?  October-ish 2019
>>>>> * Beam Summit Asia, in Tokyo ??
>>>>>
>>>>> What are general thoughts on locations/dates?
>>>>>
>>>>> Looking forward to convening in person soon.
>>>>>
>>>>> Cheers,
>>>>> Austin
>>>>>
>>>>


Re: Beam Summits!

2018-12-19 Thread Thomas Weise
I think for EU there is a proposal to have it next to Berlin Buzzwords in
June. That would provide better spacing and avoid conflict with ApacheCon.

Thomas


On Wed, Dec 19, 2018 at 3:09 PM Suneel Marthi  wrote:

> How about Beam Summit in Berlin on Sep 6 immediately following Flink
> Forward Berlin on the previous 2 days.
>
> Same may be for Asia also following Flink Forward Asia where and whenever
> it happens.
>
> On Wed, Dec 19, 2018 at 6:06 PM Austin Bennett <
> whatwouldausti...@gmail.com> wrote:
>
>> Hi All,
>>
>> I really enjoyed Beam Summit in London (Thanks Matthias!), and there was
>> much enthusiasm for continuations.  We had selected that location in a
>> large part due to the growing community there, and we have users in a
>> variety of locations.  In our 2019 calendar,
>> https://docs.google.com/spreadsheets/d/1CloF63FOKSPM6YIuu8eExjhX6xrIiOp5j4zPbSg3Apo/
>> shared in the past weeks, 3 Summits are tentatively slotted for this year.
>> Wanting to start running this by the group to get input.
>>
>> * Beam Summit NA, in San Francisco, approx 3 April 2019 (following Flink
>> Forward).  I can organize.
>> * Beam Summit Europe, in Stockholm, this was the runner up in voting
>> falling behind London.  Or perhaps Berlin?  October-ish 2019
>> * Beam Summit Asia, in Tokyo ??
>>
>> What are general thoughts on locations/dates?
>>
>> Looking forward to convening in person soon.
>>
>> Cheers,
>> Austin
>>
>


Re: Stackoverflow Questions

2018-11-06 Thread Thomas Weise
It should be valuable to get a digest of SO questions on user@ (and dev@)
but I think this should be one-way, i.e. respective questions should be
answered on SO.

Thomas

On Mon, Nov 5, 2018 at 12:57 PM Kenneth Knowles  wrote:

> +user@
>
> I think we'd better ask user@ before we subscribe the list to a regular
> automated email. Daily might be OK for dev@ but I would guess that user@
> might prefer less frequent. It will have a predictable subject so it should
> be easy to filter if someone is not interested.
>
> Kenn
>
> On Mon, Nov 5, 2018 at 12:53 PM Ankur Goenka  wrote:
>
>> +1 for the daily/weekly digest to user@
>>
>> On Mon, Nov 5, 2018 at 10:52 AM Maximilian Michels 
>> wrote:
>>
>>> Great idea! I'd prefer a daily/weekly digest if possible.
>>>
>>> On 05.11.18 19:44, Tim Robertson wrote:
>>> > Thanks for raising this Anton
>>> >
>>> >   It would be very easy to forward new SO questions to the user@
>>> > list, or a new list if we're worried about the noise.
>>> >
>>> >
>>> > +1 (preference on user@ until there are too many)
>>> >
>>> >
>>> >
>>> > On Mon, Nov 5, 2018 at 7:18 PM Scott Wegner >> > > wrote:
>>> >
>>> > I like the idea of working to improve the our presence on Q sites
>>> > like StackOverflow. SO is a great resource and much more
>>> > discoverable / searchable than a mail archive.
>>> >
>>> > One idea on how to improve our presence: StackOverflow supports
>>> > setting up email subscriptions [1] for particular tags. It would be
>>> > very easy to forward new SO questions to the user@ list, or a new
>>> > list if we're worried about the noise.
>>> >
>>> > [1] https://stackexchange.com/filters/new
>>> >
>>> > On Mon, Nov 5, 2018 at 9:54 AM Jean-Baptiste Onofré <
>>> j...@nanthrax.net
>>> > > wrote:
>>> >
>>> > That's "classic" in the Apache projects. And yes, most of the
>>> > time, we
>>> > periodically send or ask the dev to check the questions on
>>> other
>>> > channels like stackoverflow.
>>> >
>>> > It makes sense to send a reminder or a list of open questions
>>> on the
>>> > user mailing list (users can help each other too).
>>> >
>>> > Regards
>>> > JB
>>> >
>>> > On 05/11/2018 18:25, Anton Kedin wrote:
>>> >  > Hi dev@,
>>> >  >
>>> >  > I was looking at stackoverflow questions tagged with
>>> > `apache-beam` [1]
>>> >  > and wanted to ask your opinion. It feels like it's easier
>>> for
>>> > some users
>>> >  > to ask questions on stackoverflow than on user@. Overall
>>> > frequency
>>> >  > between the two channels seems comparable but a lot of
>>> > stackoverflow
>>> >  > questions are not answered while questions on user@ get
>>> some
>>> > attention
>>> >  > most of the time. Would it make sense to increase dev@
>>> > visibility into
>>> >  > stackoverflow, e.g. by sending periodic digest or some
>>> other way?
>>> >  >
>>> >  > [1] https://stackoverflow.com/questions/tagged/apache-beam
>>> >  >
>>> >  > Regards,
>>> >  > Anton
>>> >
>>> > --
>>> > Jean-Baptiste Onofré
>>> > jbono...@apache.org 
>>> > http://blog.nanthrax.net
>>> > Talend - http://www.talend.com
>>> >
>>> >
>>> >
>>> > --
>>> >
>>> >
>>> >
>>> >
>>> > Got feedback? tinyurl.com/swegner-feedback
>>> > 
>>> >
>>>
>>


Re: Flink 1.6 Support

2018-10-30 Thread Thomas Weise
There has not been any decision to move to 1.6.x for the next release yet.

There has been related general discussion about upgrading runners recently
[1]

Overall we need to consider the support for newer Flink versions that users
find (the Flink version in distributions and what users typically have in
their deployment stacks). These upgrades are not automatic/cheap/fast, so
there is a balance to strike.

The good news is that with Beam 2.8.0 you should be able to make a build
for 1.6.x with just a version number change [2]  (Other compile differences
have been cleaned up.)

[1]
https://lists.apache.org/thread.html/0588ed783767991aa36b00b8529bbd29b3a8958ee6e82fca83ac2938@%3Cdev.beam.apache.org%3E
[2]
https://github.com/apache/beam/blob/v2.8.0/runners/flink/build.gradle#L49


On Tue, Oct 30, 2018 at 10:50 AM Lukasz Cwik  wrote:

> +dev 
>
> On Tue, Oct 30, 2018 at 10:30 AM Jins George 
> wrote:
>
>> Hi Community,
>>
>> Noticed that the Beam 2.8 release comes with flink  1.5.x dependency.
>> Are there any plans to upgrade flink to  1.6.x  in next beam release. (
>> I am looking for the better k8s  support in Flink 1.6)
>>
>> Thanks,
>>
>> Jins George
>>
>>


Re: Pass-through transform for accessing runner-specific functionality from Beam?

2018-10-17 Thread Thomas Weise
With the portable runner it is possible to create Flink native transforms
to expose features of Flink.

You can find an example for a source here:
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/flink/flink_streaming_impulse.py
Another one that uses the Flink Kafka/Kinesis connectors in the Lyft fork:
https://github.com/lyft/beam/blob/release-2.8.0-lyft/sdks/python/custom-source-example.py

Our need was to have a streaming connector that the Python SDK does not
offer. It is a customization and works by adding a transform wrapper to the
SDK you are using (above Python) and then add a translation to the runner
to handle that custom URN. Currently this requires to augment the Flink
runner, although it should not be hard to make it pluggable (i.e. drop an
annotated translator class into the job server class path).

Again, this will only with the portable Flink runner and make the pipeline
non-portable, which is what you were interested in.

Thanks,
Thomas


On Wed, Oct 17, 2018 at 2:44 PM Lukasz Cwik  wrote:

> +t...@apache.org has been doing something very similar but using it
> support native Flink IO within Apache Beam within the company he works for.
> Note that the community had a discussion about runner specific extensions
> and is currently leaning[1] towards having support for them for internal
> use cases but not allowing those extensions to be part of Apache Beam
> publicly.
>
> 1:
> https://lists.apache.org/thread.html/38b796c4c49823cf946affdb1a457ddf1d142403803b9c6a32442057@%3Cdev.beam.apache.org%3E
>
> On Wed, Oct 17, 2018 at 1:36 PM Urban, Jaroslav <
> jaroslav.ur...@teradata.com> wrote:
>
>> Dear Co-Beamers,
>>
>>
>>
>> I am curious about the possibility of using Complex Event Processing
>> (CEP) package in Flink from Apache Beam with the Flink Runner.
>>
>>
>>
>> I have an architectural question.
>>
>>
>>
>> I know that Apache Beam and its reference model (
>> https://beam.apache.org/documentation/runners/capability-matrix/) does
>> not have any kind of support for complex event processing (aka pattern
>> matching on streams of events and their attributes). There is merely a JIRA
>> ticket suggesting a future development in this direction (
>> https://issues.apache.org/jira/browse/BEAM-3767).
>>
>>
>>
>> Does anyone know if there is currently some kind of “pass-through”
>> transform in Apache Beam that would make it possible to access the existing
>> CEP functionality in Flink. Or –for that matter- enable accessing
>> runner-specific features in general?
>>
>>
>>
>> What do you think? Any ideas?
>>
>>
>>
>> Best regards
>>
>>
>>
>> *Jaroslav Urban*
>>
>> Consultant (DWH, OpenSource, Cloud)
>> jaroslav.ur...@teradata.com
>>
>>


Re: Modular IO presentation at Apachecon

2018-09-26 Thread Thomas Weise
Thanks for sharing. I'm looking forward to see the recording of the talk
(hopefully!).

This will be very helpful for Beam users. IO still is typically the
unexpectedly hard and time consuming part of authoring pipelines.


On Wed, Sep 26, 2018 at 2:48 PM Alan Myrvold  wrote:

> Thanks for the slides.
> Really enjoyed the talk in person, especially the concept that IO is a
> transformation, and a source or sink are not special and the splittable
> DoFn explanation.
>
> On Wed, Sep 26, 2018 at 2:17 PM Ismaël Mejía  wrote:
>
>> Hello, today Eugene and me did a talk about about modular APIs for IO
>> at ApacheCon. This talk introduces some common patterns that we have
>> found while creating IO connectors and also presents recent ideas like
>> dynamic destinations, sequential writes among others using FileIO as a
>> use case.
>>
>> In case you guys want to take a look, here is a copy of the slides, we
>> will probably add this to the IO authoring documentation too.
>>
>> https://s.apache.org/beam-modular-io-talk
>>
>


Re: Unsubscribe

2018-07-24 Thread Thomas Weise
To unsubscribe, please use the -unsubscribe addresses listed on
https://beam.apache.org/community/contact-us/



On Tue, Jul 24, 2018 at 6:34 AM Chandan Biswas 
wrote:

>
>


Re: "Radically modular data ingestion APIs in Apache Beam" @ Strata - slides available

2018-03-08 Thread Thomas Weise
Eugene,

I actually had one question regarding the application of SDF for the Kafka
consumer. Reading through a topic partition can be parallel by splitting a
partition into multiple restrictions (for use cases where order does not
matter). But how would the tail read be managed? I assume there would not
be a new restriction whenever new records arrive (added latency)? The
examples on slide 40 show an end offset for Kafka, but for a continuous
read there wouldn't be an end offset?

Thanks,
Thomas


On Thu, Mar 8, 2018 at 2:59 PM, Thomas Weise <t...@apache.org> wrote:

> Great, thanks for sharing!
>
>
> On Thu, Mar 8, 2018 at 12:16 PM, Eugene Kirpichov <kirpic...@google.com>
> wrote:
>
>> Oops that's just the template I used. Thanks for noticing, will
>> regenerate the PDF and reupload when I get to it.
>>
>>
>> On Thu, Mar 8, 2018, 11:59 AM Dan Halperin <dhalp...@apache.org> wrote:
>>
>>> Looks like it was a good talk! Why is it Google Confidential &
>>> Proprietary, though?
>>>
>>> Dan
>>>
>>> On Thu, Mar 8, 2018 at 11:49 AM, Eugene Kirpichov <kirpic...@google.com>
>>> wrote:
>>>
>>>> Hey all,
>>>>
>>>> The slides for my yesterday's talk at Strata San Jose
>>>> https://conferences.oreilly.com/strata/strata-ca/public
>>>> /schedule/detail/63696 have been posted on the talk page. They may be
>>>> of interest both to users and IO authors.
>>>>
>>>> Thanks.
>>>>
>>>
>>>
>