Re: Beam Emitted Metrics Reference

2020-03-02 Thread Kenneth Knowles
Also seems like each IO would benefit from an entry in a Transform Catalog
with description of any IO-specific metrics it emits. (technically these
may not be what you mean by "framework-emitted metrics")

Kenn

On Mon, Mar 2, 2020 at 9:40 AM Alex Amato  wrote:

> MonitoringInfoSpecs is effectively a list of metrics
> ,
> but its purpose is to simply define how SDKs should populate MonitoringInfo
> protos for a RunnerHarness to interpret.
>
> These metrics are provided by the Java and Python SDKs, and Go will soon
> provide all of them as well.
>
> But there is no requirement for a particular runner to support any of
> these metrics. The DataflowRunner will support these and they metrics are
> accessible via Dataflow APIs. I am not sure the state of other runners.
>
>
>
>
> On Mon, Mar 2, 2020 at 2:47 AM Etienne Chauchot 
> wrote:
>
>> Hi,
>>
>> There is a doc about metrics here:
>> https://beam.apache.org/documentation/programming-guide/#metrics
>>
>> You can also export the metrics to sinks (REST http endpoint and
>> Graphite), see MetricsOptions class for configuration.
>>
>> Still, there is no doc for export on website, I'll add some
>>
>> Best
>>
>> Etienne
>> On 28/02/2020 01:07, Pablo Estrada wrote:
>>
>> Hi Daniel!
>> I think +Alex Amato  had tried to have an inventory
>> of metrics at some point.
>> Other than that, I don't think we have a document outlining them.
>>
>> Can you talk about what you plan to do with them? Do you plan to export
>> them somehow? Do you plan to add your own?
>> Best
>> -P.
>>
>> On Thu, Feb 27, 2020 at 11:33 AM Daniel Chen  wrote:
>>
>>> Hi all,
>>>
>>> I some questions about the reference to the framework metrics emitted by
>>> Beam. I would like to leverage these metrics to allow better monitoring of
>>> by Beam jobs but cannot find any references to the description or a
>>> complete set of emitted metrics.
>>>
>>> Do we have this information documented anywhere?
>>>
>>> Thanks,
>>> Daniel
>>>
>>


Re: Error logging from fn_api_runners

2020-03-02 Thread Robert Bradshaw
Yeah, this was an oversight on my part. I don't think we need to log
this at all. https://github.com/apache/beam/pull/11021 for anyone to
look at.

On Mon, Mar 2, 2020 at 2:44 PM Heejong Lee  wrote:
>
> I think it should be either info or debug but not error.
>
> On Mon, Mar 2, 2020 at 2:35 PM Ning Kang  wrote:
>>
>> Hi,
>>
>> I just observed some error level loggings like these:
>> ```
>> ERROR:apache_beam.runners.portability.fn_api_runner:created 1 workers 
>> {'worker_5': 
>> > at 0x127fdaa58>}
>> ERROR:apache_beam.runners.portability.fn_api_runner:created 1 workers 
>> {'worker_5': 
>> > at 0x127fdaa58>}
>> ERROR:apache_beam.runners.portability.fn_api_runner:created 1 workers 
>> {'worker_5': 
>> > at 0x127fdaa58>}
>> ERROR:apache_beam.runners.portability.fn_api_runner:created 1 workers 
>> {'worker_5': 
>> > at 0x127fdaa58>}
>> ```
>> It's coming from this PR.
>> ```
>>
>> def get_worker_handlers(
>> self,
>> environment_id,  # type: Optional[str]
>> num_workers  # type: int
>> ):
>>   # type: (...) -> List[WorkerHandler]
>>   if environment_id is None:
>> # Any environment will do, pick one arbitrarily.
>> environment_id = next(iter(self._environments.keys()))
>>   environment = self._environments[environment_id]
>>
>>   # assume all environments except EMBEDDED_PYTHON use gRPC.
>>   if environment.urn == python_urns.EMBEDDED_PYTHON:
>> # special case for EmbeddedWorkerHandler: there's no need for a gRPC
>> # server, but to pass the type check on WorkerHandler.create() we
>> # make like we have a GrpcServer instance.
>> self._grpc_server = cast(GrpcServer, None)
>>   elif self._grpc_server is None:
>> self._grpc_server = GrpcServer(
>> self._state, self._job_provision_info, self)
>>
>>   worker_handler_list = self._cached_handlers[environment_id]
>>   if len(worker_handler_list) < num_workers:
>> for _ in range(len(worker_handler_list), num_workers):
>>   worker_handler = WorkerHandler.create(
>>   environment,
>>   self._state,
>>   self._job_provision_info,
>>   self._grpc_server)
>>   _LOGGER.info(
>>   "Created Worker handler %s for environment %s",
>>   worker_handler,
>>   environment)
>>   self._cached_handlers[environment_id].append(worker_handler)
>>   self._workers_by_id[worker_handler.worker_id] = worker_handler
>>   worker_handler.start_worker()
>>   _LOGGER.error("created %s workers %s", num_workers, self._workers_by_id)
>>   return self._cached_handlers[environment_id][:num_workers]
>>
>> ```
>> Is this supposed to be an info level logging?
>>
>> Thanks!
>>
>> Ning.


Re: Python Static Typing: Next Steps

2020-03-02 Thread Robert Bradshaw
It seems people are conflating git pre-commit hooks (which IMHO should
ideally be in the sub-second range, and run when an author does "git
commit") with jenkins pre-commit tests (for which minutes is nothing
compared to what we already do). I am +1 to adding mypy to the latter
for sure, and think we should probably hold off for the former.

On Mon, Mar 2, 2020 at 4:38 PM Udi Meiri  wrote:
>
> Off-topic: Python lint via pre-commit should be much faster. (I wrote my own 
> modified-file-only lint in the past)
>
> On Mon, Mar 2, 2020 at 2:08 PM Kyle Weaver  wrote:
>>
>> > Python lint takes 4-5mins to complete. I think if the mypy analysis is 
>> > really on the order of 10s, the additional time won't matter and could 
>> > always be enabled.
>>
>> +1 of course it would be nice to make mypy as fast as possible, but I don't 
>> think speed needs to be a blocker. The productivity gains we'd get from 
>> reliable type analysis more than offset the cost IMO.
>>
>> On Mon, Mar 2, 2020 at 2:03 PM Luke Cwik  wrote:
>>>
>>> Python lint takes 4-5mins to complete. I think if the mypy analysis is 
>>> really on the order of 10s, the additional time won't matter and could 
>>> always be enabled.
>>>
>>> On Mon, Mar 2, 2020 at 1:21 PM Chad Dombrova  wrote:
>
> I believe that mypy via pre-commit hook will be faster than 10s since it 
> only applies to modified files.


 Correct, with a few caveats:

 pre-commit can be setup to only run if a python file changes.  so 
 modifying a java file won't trigger mypy to run.
 if *any* python file changes mypy has to run on the whole codebase, 
 because a change to one file can affect the others (i.e. a function arg 
 type changes).  it's not really meaningful to run mypy on a single file.
 the mypy daemon tracks which files have changed, and runs incremental 
 updates.  so if we setup the precommit hook to run the daemon, we should 
 see that get appreciably faster.  I'll do some tests and report back.


Re: Java SplittableDoFn Watermark API

2020-03-02 Thread Robert Bradshaw
I don't have a strong preference for using a provider/having a set of
tightly coupled methods in Java, other than that we be consistent (and
we already use the methods style for restrictions).

On Mon, Mar 2, 2020 at 3:32 PM Luke Cwik  wrote:
>
> Jan, there are some parts of Apache Beam the watermarks package will likely 
> rely on (@Experimental annotation, javadoc links) but fundamentally should 
> not rely on core and someone could create a separate package for this.

I think it does make sense for a set of common watermark trackers to
be shipped with core (e.g. manual, monotonic, and eventually a
probabilistic one).

> Ismael, the unification of bounded/unbounded within SplittableDoFn has always 
> been a goal. There are a set of features that BoundedSources are unlikely to 
> use but would still be allowed to use them. For example, bounded sources may 
> want to have support for checkpointing since I could foresee an BoundedSource 
> that can notice that a certain resource becomes unavailable and can only 
> process it later. The choice of which watermark estimator to use is a likely 
> point of difference between bounded and unbounded SDFs since bounded SDFs 
> would be able to use a very simple estimator where the watermark is held at 
> -infinity and only advances to +infinity once there is no more data to 
> process. But even though unbounded SDFs are likely to be the most common 
> users of varied watermark estimators, a bounded SDF may still want to advance 
> the watermark as they read records so that runners that are more "streaming" 
> (for example micro batch) could process the entire pipeline in parallel vs 
> other runners that execute one whole segment of the pipeline at a time.

Put another way, the value of watermark trackers is to allow
processing to continue downstream before the source has completed
reading. This is of course essential for streaming, but If the source
is read to completion before downstream stages start (as is the case
for most batch runners) it is not needed. What this unification does
allow, however, is a source to be written in such a way that can be
efficiently used in both batch and streaming mode.

> Currently the watermark that is reported as part of the PollResult is passed 
> to the ProcessContext.updateWatermark [1, 2] function and instead that call 
> would be redirected to the ManualWatermarkEstimator.setWatermark function [3].
>
> 1: 
> https://github.com/apache/beam/blob/a16725593b84b84b37bc67cd202d1ac8b724c6f4/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Watch.java#L757
> 2: 
> https://github.com/apache/beam/blob/a16725593b84b84b37bc67cd202d1ac8b724c6f4/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/DoFn.java#L275
> 3: 
> https://github.com/apache/beam/blob/fb42666a4e1aec0413f161c742d8f010ef9fe9f2/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/splittabledofn/ManualWatermarkEstimator.java#L45
>
> On Fri, Feb 28, 2020 at 6:09 AM Ismaël Mejía  wrote:
>>
>> I just realized that the HBaseIO example is not a good one because we can
>> already have Watch like behavior as we do for Partition discovery in 
>> HCatalogIO.
>> Still I am interested on your views on bounded/unbounded unification.
>>
>> Interesting question2: How this will annotations connect with the Watch
>> transform Polling patterns?
>> https://github.com/apache/beam/blob/650e6cd9c707472c34055382a1356cf22d14ee5e/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Watch.java#L178
>>
>>
>> On Fri, Feb 28, 2020 at 10:47 AM Ismaël Mejía  wrote:
>>>
>>> Really interesting! Implementing correctly the watermark has been a common
>>> struggle for IO authors, to the point that some IOs still have issues around
>>> that. So +1 for this, in particular if we can get to reuse common patterns.
>>> I was not aware of Boyuan's work around this, really nice.
>>>
>>> One aspect I have always being confused about since I read the SDF proposal
>>> documents is if we could get to have a single API for both Bounded and 
>>> Unbounded
>>> IO by somehow assuming that with a BoundedSDF is an UnboundedSDF special 
>>> case.
>>> Could WatermarkEstimator help in this direction?
>>>
>>> One quick case that I can think is to make the current HBaseIO SDF to work 
>>> in an
>>> unbounded manner, for example to 'watch and read new tables'.
>>>
>>>
>>> On Thu, Feb 27, 2020 at 11:43 PM Luke Cwik  wrote:

 See this doc[1] and blog[2] for some context about SplittableDoFns.

 To support watermark reporting within the Java SDK for SplittableDoFns, we 
 need a way to have SDF authors to report watermark estimates over the 
 element and restriction pair that they are processing.

 For UnboundedSources, it was found to be a pain point to ask each SDF 
 author to write their own watermark estimation which typically prevented 
 re-use. Therefore we would like to have a "library" of watermark 
 estimators that help SDF authors perform this estimat

Re: Python Static Typing: Next Steps

2020-03-02 Thread Udi Meiri
Off-topic: Python lint via pre-commit should be much faster. (I wrote my
own modified-file-only lint in the past)

On Mon, Mar 2, 2020 at 2:08 PM Kyle Weaver  wrote:

> > Python lint takes 4-5mins to complete. I think if the mypy analysis is
> really on the order of 10s, the additional time won't matter and could
> always be enabled.
>
> +1 of course it would be nice to make mypy as fast as possible, but I
> don't think speed needs to be a blocker. The productivity gains we'd get
> from reliable type analysis more than offset the cost IMO.
>
> On Mon, Mar 2, 2020 at 2:03 PM Luke Cwik  wrote:
>
>> Python lint takes 4-5mins to complete. I think if the mypy analysis is
>> really on the order of 10s, the additional time won't matter and could
>> always be enabled.
>>
>> On Mon, Mar 2, 2020 at 1:21 PM Chad Dombrova  wrote:
>>
>>> I believe that mypy via pre-commit hook will be faster than 10s since it
 only applies to modified files.

>>>
>>> Correct, with a few caveats:
>>>
>>>- pre-commit can be setup to only run if a python file changes.  so
>>>modifying a java file won't trigger mypy to run.
>>>- if *any* python file changes mypy has to run on the whole
>>>codebase, because a change to one file can affect the others (i.e. a
>>>function arg type changes).  it's not really meaningful to run mypy on a
>>>single file.
>>>- the mypy daemon tracks which files have changed, and runs
>>>incremental updates.  so if we setup the precommit hook to run the 
>>> daemon,
>>>we should see that get appreciably faster.  I'll do some tests and report
>>>back.
>>>
>>>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Java SplittableDoFn Watermark API

2020-03-02 Thread Luke Cwik
Jan, there are some parts of Apache Beam the watermarks package will likely
rely on (@Experimental annotation, javadoc links) but fundamentally should
not rely on core and someone could create a separate package for this.

Ismael, the unification of bounded/unbounded within SplittableDoFn has
always been a goal. There are a set of features that BoundedSources are
unlikely to use but would still be allowed to use them. For example,
bounded sources may want to have support for checkpointing since I could
foresee an BoundedSource that can notice that a certain resource becomes
unavailable and can only process it later. The choice of which watermark
estimator to use is a likely point of difference between bounded and
unbounded SDFs since bounded SDFs would be able to use a very simple
estimator where the watermark is held at -infinity and only advances
to +infinity once there is no more data to process. But even though
unbounded SDFs are likely to be the most common users of varied watermark
estimators, a bounded SDF may still want to advance the watermark as they
read records so that runners that are more "streaming" (for example micro
batch) could process the entire pipeline in parallel vs other runners that
execute one whole segment of the pipeline at a time.

Currently the watermark that is reported as part of the PollResult is
passed to the ProcessContext.updateWatermark [1, 2] function and instead
that call would be redirected to the ManualWatermarkEstimator.setWatermark
function [3].

1:
https://github.com/apache/beam/blob/a16725593b84b84b37bc67cd202d1ac8b724c6f4/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Watch.java#L757
2:
https://github.com/apache/beam/blob/a16725593b84b84b37bc67cd202d1ac8b724c6f4/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/DoFn.java#L275
3:
https://github.com/apache/beam/blob/fb42666a4e1aec0413f161c742d8f010ef9fe9f2/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/splittabledofn/ManualWatermarkEstimator.java#L45

On Fri, Feb 28, 2020 at 6:09 AM Ismaël Mejía  wrote:

> I just realized that the HBaseIO example is not a good one because we can
> already have Watch like behavior as we do for Partition discovery in
> HCatalogIO.
> Still I am interested on your views on bounded/unbounded unification.
>
> Interesting question2: How this will annotations connect with the Watch
> transform Polling patterns?
>
> https://github.com/apache/beam/blob/650e6cd9c707472c34055382a1356cf22d14ee5e/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Watch.java#L178
>
>
> On Fri, Feb 28, 2020 at 10:47 AM Ismaël Mejía  wrote:
>
>> Really interesting! Implementing correctly the watermark has been a common
>> struggle for IO authors, to the point that some IOs still have issues
>> around
>> that. So +1 for this, in particular if we can get to reuse common
>> patterns.
>> I was not aware of Boyuan's work around this, really nice.
>>
>> One aspect I have always being confused about since I read the SDF
>> proposal
>> documents is if we could get to have a single API for both Bounded and
>> Unbounded
>> IO by somehow assuming that with a BoundedSDF is an UnboundedSDF special
>> case.
>> Could WatermarkEstimator help in this direction?
>>
>> One quick case that I can think is to make the current HBaseIO SDF to
>> work in an
>> unbounded manner, for example to 'watch and read new tables'.
>>
>>
>> On Thu, Feb 27, 2020 at 11:43 PM Luke Cwik  wrote:
>>
>>> See this doc[1] and blog[2] for some context about SplittableDoFns.
>>>
>>> To support watermark reporting within the Java SDK for SplittableDoFns,
>>> we need a way to have SDF authors to report watermark estimates over the
>>> element and restriction pair that they are processing.
>>>
>>> For UnboundedSources, it was found to be a pain point to ask each SDF
>>> author to write their own watermark estimation which typically prevented
>>> re-use. Therefore we would like to have a "library" of watermark estimators
>>> that help SDF authors perform this estimation similar to how there is a
>>> "library" of restrictions and restriction trackers that SDF authors can
>>> use. For SDF authors where the existing library doesn't work, they can add
>>> additional ones that observe timestamps of elements or choose to directly
>>> report the watermark through a "ManualWatermarkEstimator" parameter that
>>> can be supplied to @ProcessElement methods.
>>>
>>> The public facing portion of the DoFn changes adds three new annotations
>>> for new DoFn style methods:
>>> GetInitialWatermarkEstimatorState: Returns the initial watermark state,
>>> similar to GetInitialRestriction
>>> GetWatermarkEstimatorStateCoder: Returns a coder compatible with
>>> watermark state type, similar to GetRestrictionCoder for restrictions
>>> returned by GetInitialRestriction.
>>> NewWatermarkEstimator: Returns a watermark estimator that either the
>>> framework invokes allowing it to observe the timestamps of output records
>>> or a manual watermark

Re: Error logging from fn_api_runners

2020-03-02 Thread Heejong Lee
I think it should be either info or debug but not error.

On Mon, Mar 2, 2020 at 2:35 PM Ning Kang  wrote:

> Hi,
>
> I just observed some error level loggings like these:
> ```
> ERROR:apache_beam.runners.portability.fn_api_runner:created 1 workers
> {'worker_5':
>  at 0x127fdaa58>}
> ERROR:apache_beam.runners.portability.fn_api_runner:created 1 workers
> {'worker_5':
>  at 0x127fdaa58>}
> ERROR:apache_beam.runners.portability.fn_api_runner:created 1 workers
> {'worker_5':
>  at 0x127fdaa58>}
> ERROR:apache_beam.runners.portability.fn_api_runner:created 1 workers
> {'worker_5':
>  at 0x127fdaa58>}
> ```
> It's coming from this PR
> 
> .
> ```
>
> def get_worker_handlers(
> self,
> environment_id,  # type: Optional[str]
> num_workers  # type: int
> ):
>   # type: (...) -> List[WorkerHandler]
>   if environment_id is None:
> # Any environment will do, pick one arbitrarily.
> environment_id = next(iter(self._environments.keys()))
>   environment = self._environments[environment_id]
>
>   # assume all environments except EMBEDDED_PYTHON use gRPC.
>   if environment.urn == python_urns.EMBEDDED_PYTHON:
> # special case for EmbeddedWorkerHandler: there's no need for a gRPC
> # server, but to pass the type check on WorkerHandler.create() we
> # make like we have a GrpcServer instance.
> self._grpc_server = cast(GrpcServer, None)
>   elif self._grpc_server is None:
> self._grpc_server = GrpcServer(
> self._state, self._job_provision_info, self)
>
>   worker_handler_list = self._cached_handlers[environment_id]
>   if len(worker_handler_list) < num_workers:
> for _ in range(len(worker_handler_list), num_workers):
>   worker_handler = WorkerHandler.create(
>   environment,
>   self._state,
>   self._job_provision_info,
>   self._grpc_server)
>   _LOGGER.info(
>   "Created Worker handler %s for environment %s",
>   worker_handler,
>   environment)
>   self._cached_handlers[environment_id].append(worker_handler)
>   self._workers_by_id[worker_handler.worker_id] = worker_handler
>   worker_handler.start_worker()
>   _LOGGER.error("created %s workers %s", num_workers, self._workers_by_id)
>   return self._cached_handlers[environment_id][:num_workers]
>
> ```
> Is this supposed to be an info level logging?
>
> Thanks!
>
> Ning.
>


Error logging from fn_api_runners

2020-03-02 Thread Ning Kang
Hi,

I just observed some error level loggings like these:
```
ERROR:apache_beam.runners.portability.fn_api_runner:created 1 workers
{'worker_5':
}
ERROR:apache_beam.runners.portability.fn_api_runner:created 1 workers
{'worker_5':
}
ERROR:apache_beam.runners.portability.fn_api_runner:created 1 workers
{'worker_5':
}
ERROR:apache_beam.runners.portability.fn_api_runner:created 1 workers
{'worker_5':
}
```
It's coming from this PR

.
```

def get_worker_handlers(
self,
environment_id,  # type: Optional[str]
num_workers  # type: int
):
  # type: (...) -> List[WorkerHandler]
  if environment_id is None:
# Any environment will do, pick one arbitrarily.
environment_id = next(iter(self._environments.keys()))
  environment = self._environments[environment_id]

  # assume all environments except EMBEDDED_PYTHON use gRPC.
  if environment.urn == python_urns.EMBEDDED_PYTHON:
# special case for EmbeddedWorkerHandler: there's no need for a gRPC
# server, but to pass the type check on WorkerHandler.create() we
# make like we have a GrpcServer instance.
self._grpc_server = cast(GrpcServer, None)
  elif self._grpc_server is None:
self._grpc_server = GrpcServer(
self._state, self._job_provision_info, self)

  worker_handler_list = self._cached_handlers[environment_id]
  if len(worker_handler_list) < num_workers:
for _ in range(len(worker_handler_list), num_workers):
  worker_handler = WorkerHandler.create(
  environment,
  self._state,
  self._job_provision_info,
  self._grpc_server)
  _LOGGER.info(
  "Created Worker handler %s for environment %s",
  worker_handler,
  environment)
  self._cached_handlers[environment_id].append(worker_handler)
  self._workers_by_id[worker_handler.worker_id] = worker_handler
  worker_handler.start_worker()
  _LOGGER.error("created %s workers %s", num_workers, self._workers_by_id)
  return self._cached_handlers[environment_id][:num_workers]

```
Is this supposed to be an info level logging?

Thanks!

Ning.


Re: Python Static Typing: Next Steps

2020-03-02 Thread Kyle Weaver
> Python lint takes 4-5mins to complete. I think if the mypy analysis is
really on the order of 10s, the additional time won't matter and could
always be enabled.

+1 of course it would be nice to make mypy as fast as possible, but I don't
think speed needs to be a blocker. The productivity gains we'd get from
reliable type analysis more than offset the cost IMO.

On Mon, Mar 2, 2020 at 2:03 PM Luke Cwik  wrote:

> Python lint takes 4-5mins to complete. I think if the mypy analysis is
> really on the order of 10s, the additional time won't matter and could
> always be enabled.
>
> On Mon, Mar 2, 2020 at 1:21 PM Chad Dombrova  wrote:
>
>> I believe that mypy via pre-commit hook will be faster than 10s since it
>>> only applies to modified files.
>>>
>>
>> Correct, with a few caveats:
>>
>>- pre-commit can be setup to only run if a python file changes.  so
>>modifying a java file won't trigger mypy to run.
>>- if *any* python file changes mypy has to run on the whole codebase,
>>because a change to one file can affect the others (i.e. a function arg
>>type changes).  it's not really meaningful to run mypy on a single file.
>>- the mypy daemon tracks which files have changed, and runs
>>incremental updates.  so if we setup the precommit hook to run the daemon,
>>we should see that get appreciably faster.  I'll do some tests and report
>>back.
>>
>>


Re: Python Static Typing: Next Steps

2020-03-02 Thread Luke Cwik
Python lint takes 4-5mins to complete. I think if the mypy analysis is
really on the order of 10s, the additional time won't matter and could
always be enabled.

On Mon, Mar 2, 2020 at 1:21 PM Chad Dombrova  wrote:

> I believe that mypy via pre-commit hook will be faster than 10s since it
>> only applies to modified files.
>>
>
> Correct, with a few caveats:
>
>- pre-commit can be setup to only run if a python file changes.  so
>modifying a java file won't trigger mypy to run.
>- if *any* python file changes mypy has to run on the whole codebase,
>because a change to one file can affect the others (i.e. a function arg
>type changes).  it's not really meaningful to run mypy on a single file.
>- the mypy daemon tracks which files have changed, and runs
>incremental updates.  so if we setup the precommit hook to run the daemon,
>we should see that get appreciably faster.  I'll do some tests and report
>back.
>
>


Re: Python Static Typing: Next Steps

2020-03-02 Thread Chad Dombrova
>
> I believe that mypy via pre-commit hook will be faster than 10s since it
> only applies to modified files.
>

Correct, with a few caveats:

   - pre-commit can be setup to only run if a python file changes.  so
   modifying a java file won't trigger mypy to run.
   - if *any* python file changes mypy has to run on the whole codebase,
   because a change to one file can affect the others (i.e. a function arg
   type changes).  it's not really meaningful to run mypy on a single file.
   - the mypy daemon tracks which files have changed, and runs incremental
   updates.  so if we setup the precommit hook to run the daemon, we should
   see that get appreciably faster.  I'll do some tests and report back.


Fwd: Google Summer of Code 2020 Mentor Registration

2020-03-02 Thread Pablo Estrada
FYI

-- Forwarded message -
From: Maxim Solodovnik 
Date: Thu, Feb 27, 2020 at 6:08 PM
Subject: Google Summer of Code 2020 Mentor Registration
To: 


Dear PMCs,

I'm happy to announce that the ASF has made it onto the list of
accepted organizations for
Google Summer of Code 2020! [1,2]

It is now time for mentors to sign up, so please pass this email on to
your community and
podlings. If you aren’t already subscribed to
ment...@community.apache.org you should do so now else
you might miss important information.

Mentor signup requires two steps: mentor signup in Google's system [3]
and PMC acknowledgement.

If you want to mentor a project in this year's SoC you will have to

1. Be an Apache committer.
2. Request an acknowledgement from the PMC for which you want to
mentor projects. Use the below
template and *do not forget to copy ment...@community.apache.org*. We
will use the email adress you
indicate to send the invite to be a mentor for Apache.

PMCs, read carefully please.

We request that each mentor is acknowledged by a PMC member. This is
to ensure the mentor is in good
standing with the community. When you receive a request for
acknowledgement, please ACK it and cc
ment...@community.apache.org

Lastly, it is not yet too late to record your ideas in Jira (see
previous emails for details).
Students will now begin to explore ideas so if you haven’t already
done so, record your ideas
immediately!

Cheers,

The Apache GSoC Team

mentor request email template:

to: private@.apache.org
cc: ment...@community.apache.org
subject: GSoC 2020 mentor request for 

 PMC,

please acknowledge my request to become a mentor for Google Summer of
Code 2018 projects for Apache
.

I would like to receive the mentor invite to 





[1] https://summerofcode.withgoogle.com/organizations/
[2] https://summerofcode.withgoogle.com/organizations/5919474722537472/
[3] https://summerofcode.withgoogle.com/

-
To unsubscribe, e-mail: dev-unsubscr...@community.apache.org
For additional commands, e-mail: dev-h...@community.apache.org


Re: [VOTE][BIP-1] Beam Schema Options

2020-03-02 Thread Alex Van Boxel
Anyone keen to review this PR:
https://github.com/apache/beam/pull/10413

without this foundation I can't continue with the rest.

 _/
_/ Alex Van Boxel


On Fri, Feb 28, 2020 at 11:40 PM Alex Van Boxel  wrote:

> Thank you everyone for voting: Accepted by majority vote +1 (7 votes, 3
> binding), -1 (0 votes)
>
> I added the information to the document:
>
> https://cwiki.apache.org/confluence/display/BEAM/%5BBIP-1%5D+Beam+Schema+Options
>
> Now that the 2.20 is frozen it's the ideal moment to get the PR of the
> interface in, then I can focus on the different implementations:
> https://github.com/apache/beam/pull/10413
>
>  _/
> _/ Alex Van Boxel
>
>
> On Wed, Feb 26, 2020 at 9:25 PM Maximilian Michels  wrote:
>
>> +1 (binding)
>>
>> Looks like a useful feature to have in schemas and fields. Thank you for
>> the good write-up.
>>
>> -Max
>>
>> On 26.02.20 19:35, Alex Van Boxel wrote:
>> > Nico, thanks for the addition to SQL/MED. I could use another PMC vote
>> > to conclude the vote.
>> >
>> >   _/
>> > _/ Alex Van Boxel
>> >
>> >
>> > On Mon, Feb 24, 2020 at 11:25 PM Kenneth Knowles > > > wrote:
>> >
>> > +1 (binding)
>> >
>> > Added a link to the wiki proposal for SQL/MED (Management of
>> > External Data) which treats some of the same ideas.
>> >
>> > Kenn
>> >
>> > On Fri, Feb 21, 2020 at 3:02 PM Brian Hulette > > > wrote:
>> >
>> > +1 (non-binding) thanks for all your work on this Alex :)
>> >
>> > On Fri, Feb 21, 2020 at 6:50 AM Alex Van Boxel <
>> a...@vanboxel.be
>> > > wrote:
>> >
>> > +1 (non-binding)
>> >
>> > I assume I can vote on my own proposal :-)
>> >
>> >   _/
>> > _/ Alex Van Boxel
>> >
>> >
>> > On Fri, Feb 21, 2020 at 6:36 AM Jean-Baptiste Onofre
>> > mailto:j...@nanthrax.net>> wrote:
>> >
>> > +1 (binding)
>> >
>> > Very interesting. It remembers me when we start the
>> > discussion about schema support ;)
>> >
>> > Regards
>> > JB
>> >
>> >> Le 20 févr. 2020 à 08:36, Alex Van Boxel
>> >> mailto:a...@vanboxel.be>> a écrit :
>> >>
>> >> Hi all,
>> >>
>> >> let's do a vote on the very first Beam Improvement
>> >> Proposal. If you have a -1 or -1 (binding) please add
>> >> your concern to the open issues section to the wiki.
>> >> Thanks.
>> >>
>> >> This is the proposal:
>> >>
>> https://cwiki.apache.org/confluence/display/BEAM/%5BBIP-1%5D+Beam+Schema+Options
>> >>
>> >> Can I have your votes.
>> >>
>> >>  _/
>> >> _/ Alex Van Boxel
>> >
>>
>


Re: Python Static Typing: Next Steps

2020-03-02 Thread Udi Meiri
Let's go forward with this and see. I volunteer to help as well.

I believe that mypy via pre-commit hook will be faster than 10s since it
only applies to modified files.

On Mon, Mar 2, 2020 at 10:53 AM Robert Bradshaw  wrote:

> +1
>
> We should enable this on jenkins, plus trivial instructions (ideally a
> one-liner tox command) to run it locally. Hopefully the errors will be
> easy enough for contributors to figure out (in particular local to and
> commensurate in complexity with the code that they're editing), and I
> agree it's the only way to keep them accurate (which is a net positive
> for tooling and developers).
>
> Running it as part of a pre-commit hook could be discussed once we
> have a bit more experience (but 10s is certainly on the long side).
>
> On Mon, Mar 2, 2020 at 10:01 AM Luke Cwik  wrote:
> >
> > +1
> >
> > The typing information has really helped me several times figuring out
> that API contracts and expected types.
> >
> > On Mon, Mar 2, 2020 at 9:54 AM Pablo Estrada  wrote:
> >>
> >> I am in favor of enabling the test, and also am happy to start
> answering questions too.
> >> Thanks so much Chad for leading this.
> >> Best
> >> -P.
> >>
> >> On Mon, Mar 2, 2020 at 9:44 AM Chad Dombrova  wrote:
> >>>
> >>> Good news everyone!
> >>> We nearly have the full beam codebase passing in mypy.
> >>>
> >>> As we are now approaching the zero-error event horizon, I'd like to
> open up a discussion around enabling mypy in the PythonLint job.  Every day
> or so a PR is merged that introduces some new mypy errors, so enabling this
> test is the only way I see to keep the annotations accurate and thus useful.
> >>>
> >>> Developer fatigue is a real concern here, since static typing has a a
> steep learning curve, and there are still not a lot of experts to help
> consult on PRs.  Here are some things that I hope will mitigate those
> concerns:
> >>>
> >>> We have a lot of tying coverage, so that means plenty of examples of
> how to solve different types of problems
> >>> Running mypy only takes 10 seconds to complete (if you execute it
> outside of gradle / tox), and that will get better when we get to 0
> errors.  Also, running mypy in daemon mode should speed that up even more
> >>> I have a PR[1] to allow developers to easily (and optionally) setup
> yapf to run in a local git pre-commit hook;  I'd like to do the same for
> mypy.
> >>> I will make myself and members of my team available to help out with
> typing questions in PRs
> >>>
> >>> Is there anyone else on the list who is knowledgable about python
> static typing who would like to volunteer to be flagged on typing questions?
> >>>
> >>> What else can we do to make this transition easier?
> >>>
> >>> [1] https://github.com/apache/beam/pull/10810
> >>>
> >>> -chad
> >>>
>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Python Static Typing: Next Steps

2020-03-02 Thread Robert Bradshaw
+1

We should enable this on jenkins, plus trivial instructions (ideally a
one-liner tox command) to run it locally. Hopefully the errors will be
easy enough for contributors to figure out (in particular local to and
commensurate in complexity with the code that they're editing), and I
agree it's the only way to keep them accurate (which is a net positive
for tooling and developers).

Running it as part of a pre-commit hook could be discussed once we
have a bit more experience (but 10s is certainly on the long side).

On Mon, Mar 2, 2020 at 10:01 AM Luke Cwik  wrote:
>
> +1
>
> The typing information has really helped me several times figuring out that 
> API contracts and expected types.
>
> On Mon, Mar 2, 2020 at 9:54 AM Pablo Estrada  wrote:
>>
>> I am in favor of enabling the test, and also am happy to start answering 
>> questions too.
>> Thanks so much Chad for leading this.
>> Best
>> -P.
>>
>> On Mon, Mar 2, 2020 at 9:44 AM Chad Dombrova  wrote:
>>>
>>> Good news everyone!
>>> We nearly have the full beam codebase passing in mypy.
>>>
>>> As we are now approaching the zero-error event horizon, I'd like to open up 
>>> a discussion around enabling mypy in the PythonLint job.  Every day or so a 
>>> PR is merged that introduces some new mypy errors, so enabling this test is 
>>> the only way I see to keep the annotations accurate and thus useful.
>>>
>>> Developer fatigue is a real concern here, since static typing has a a steep 
>>> learning curve, and there are still not a lot of experts to help consult on 
>>> PRs.  Here are some things that I hope will mitigate those concerns:
>>>
>>> We have a lot of tying coverage, so that means plenty of examples of how to 
>>> solve different types of problems
>>> Running mypy only takes 10 seconds to complete (if you execute it outside 
>>> of gradle / tox), and that will get better when we get to 0 errors.  Also, 
>>> running mypy in daemon mode should speed that up even more
>>> I have a PR[1] to allow developers to easily (and optionally) setup yapf to 
>>> run in a local git pre-commit hook;  I'd like to do the same for mypy.
>>> I will make myself and members of my team available to help out with typing 
>>> questions in PRs
>>>
>>> Is there anyone else on the list who is knowledgable about python static 
>>> typing who would like to volunteer to be flagged on typing questions?
>>>
>>> What else can we do to make this transition easier?
>>>
>>> [1] https://github.com/apache/beam/pull/10810
>>>
>>> -chad
>>>


Re: Python Static Typing: Next Steps

2020-03-02 Thread Luke Cwik
+1

The typing information has really helped me several times figuring out that
API contracts and expected types.

On Mon, Mar 2, 2020 at 9:54 AM Pablo Estrada  wrote:

> I am in favor of enabling the test, and also am happy to start answering
> questions too.
> Thanks so much Chad for leading this.
> Best
> -P.
>
> On Mon, Mar 2, 2020 at 9:44 AM Chad Dombrova  wrote:
>
>> Good news everyone!
>> We nearly have the full beam codebase passing in mypy.
>>
>> As we are now approaching the zero-error event horizon, I'd like to open
>> up a discussion around enabling mypy in the PythonLint job.  Every day or
>> so a PR is merged that introduces some new mypy errors, so enabling this
>> test is the only way I see to keep the annotations accurate and thus useful.
>>
>> Developer fatigue is a real concern here, since static typing has a a
>> steep learning curve, and there are still not a lot of experts to help
>> consult on PRs.  Here are some things that I hope will mitigate those
>> concerns:
>>
>>- We have a lot of tying coverage, so that means plenty of examples
>>of how to solve different types of problems
>>- Running mypy only takes 10 seconds to complete (if you execute it
>>outside of gradle / tox), and that will get better when we get to 0
>>errors.  Also, running mypy in daemon mode should speed that up even more
>>- I have a PR[1] to allow developers to easily (and optionally) setup
>>yapf to run in a local git pre-commit hook;  I'd like to do the same for
>>mypy.
>>- I will make myself and members of my team available to help out
>>with typing questions in PRs
>>
>> Is there anyone else on the list who is knowledgable about python static
>> typing who would like to volunteer to be flagged on typing questions?
>>
>> What else can we do to make this transition easier?
>>
>> [1] https://github.com/apache/beam/pull/10810
>>
>> -chad
>>
>>


Re: Python Static Typing: Next Steps

2020-03-02 Thread Pablo Estrada
I am in favor of enabling the test, and also am happy to start answering
questions too.
Thanks so much Chad for leading this.
Best
-P.

On Mon, Mar 2, 2020 at 9:44 AM Chad Dombrova  wrote:

> Good news everyone!
> We nearly have the full beam codebase passing in mypy.
>
> As we are now approaching the zero-error event horizon, I'd like to open
> up a discussion around enabling mypy in the PythonLint job.  Every day or
> so a PR is merged that introduces some new mypy errors, so enabling this
> test is the only way I see to keep the annotations accurate and thus useful.
>
> Developer fatigue is a real concern here, since static typing has a a
> steep learning curve, and there are still not a lot of experts to help
> consult on PRs.  Here are some things that I hope will mitigate those
> concerns:
>
>- We have a lot of tying coverage, so that means plenty of examples of
>how to solve different types of problems
>- Running mypy only takes 10 seconds to complete (if you execute it
>outside of gradle / tox), and that will get better when we get to 0
>errors.  Also, running mypy in daemon mode should speed that up even more
>- I have a PR[1] to allow developers to easily (and optionally) setup
>yapf to run in a local git pre-commit hook;  I'd like to do the same for
>mypy.
>- I will make myself and members of my team available to help out with
>typing questions in PRs
>
> Is there anyone else on the list who is knowledgable about python static
> typing who would like to volunteer to be flagged on typing questions?
>
> What else can we do to make this transition easier?
>
> [1] https://github.com/apache/beam/pull/10810
>
> -chad
>
>


Python Static Typing: Next Steps

2020-03-02 Thread Chad Dombrova
Good news everyone!
We nearly have the full beam codebase passing in mypy.

As we are now approaching the zero-error event horizon, I'd like to open up
a discussion around enabling mypy in the PythonLint job.  Every day or so a
PR is merged that introduces some new mypy errors, so enabling this test is
the only way I see to keep the annotations accurate and thus useful.

Developer fatigue is a real concern here, since static typing has a a steep
learning curve, and there are still not a lot of experts to help consult on
PRs.  Here are some things that I hope will mitigate those concerns:

   - We have a lot of tying coverage, so that means plenty of examples of
   how to solve different types of problems
   - Running mypy only takes 10 seconds to complete (if you execute it
   outside of gradle / tox), and that will get better when we get to 0
   errors.  Also, running mypy in daemon mode should speed that up even more
   - I have a PR[1] to allow developers to easily (and optionally) setup
   yapf to run in a local git pre-commit hook;  I'd like to do the same for
   mypy.
   - I will make myself and members of my team available to help out with
   typing questions in PRs

Is there anyone else on the list who is knowledgable about python static
typing who would like to volunteer to be flagged on typing questions?

What else can we do to make this transition easier?

[1] https://github.com/apache/beam/pull/10810

-chad


Re: Beam Emitted Metrics Reference

2020-03-02 Thread Alex Amato
MonitoringInfoSpecs is effectively a list of metrics
,
but its purpose is to simply define how SDKs should populate MonitoringInfo
protos for a RunnerHarness to interpret.

These metrics are provided by the Java and Python SDKs, and Go will soon
provide all of them as well.

But there is no requirement for a particular runner to support any of these
metrics. The DataflowRunner will support these and they metrics are
accessible via Dataflow APIs. I am not sure the state of other runners.




On Mon, Mar 2, 2020 at 2:47 AM Etienne Chauchot 
wrote:

> Hi,
>
> There is a doc about metrics here:
> https://beam.apache.org/documentation/programming-guide/#metrics
>
> You can also export the metrics to sinks (REST http endpoint and
> Graphite), see MetricsOptions class for configuration.
>
> Still, there is no doc for export on website, I'll add some
>
> Best
>
> Etienne
> On 28/02/2020 01:07, Pablo Estrada wrote:
>
> Hi Daniel!
> I think +Alex Amato  had tried to have an inventory
> of metrics at some point.
> Other than that, I don't think we have a document outlining them.
>
> Can you talk about what you plan to do with them? Do you plan to export
> them somehow? Do you plan to add your own?
> Best
> -P.
>
> On Thu, Feb 27, 2020 at 11:33 AM Daniel Chen  wrote:
>
>> Hi all,
>>
>> I some questions about the reference to the framework metrics emitted by
>> Beam. I would like to leverage these metrics to allow better monitoring of
>> by Beam jobs but cannot find any references to the description or a
>> complete set of emitted metrics.
>>
>> Do we have this information documented anywhere?
>>
>> Thanks,
>> Daniel
>>
>


Re: JdbcIO for writing to Dynamic Schemas in Postgres

2020-03-02 Thread Jean-Baptiste Onofre
Hi

You have the setPrepareStatement() method where you define the target tables.
However, it’s in the same database (datasource) per pipeline.

You can define several datasources and use a different datasource in each 
JdbcIO write. Meaning that you can divide in sub pipelines.

Regards
JB

> Le 29 févr. 2020 à 17:52, Vasu Gupta  a écrit :
> 
> Hey folks,
> 
> Can we use JdbcIO for writing data to multiple Schemas(For Postgres Database) 
> dynamically using Apache beam Java Framework? Currently, I can't find any 
> property that I could set to JdbcIO transform for providing schema or maybe I 
> am missing something.
> 
> Thanks



Re: GroupIntoBatches not Working properly for Direct Runner Java

2020-03-02 Thread Vasu Gupta
Input : a-1, Timestamp : 1582994620366
Input : c-2, Timestamp : 1582994620367
Input : e-3, Timestamp : 1582994620367
Input : d-4, Timestamp : 1582994620367
Input : e-5, Timestamp : 1582994620367
Input : b-6, Timestamp : 1582994620368
Input : a-7, Timestamp : 1582994620368

Output : Timestamp : 1582994620367, Key : e-3,5
Output : Timestamp : 1582994620368, Key : a-1,7

As you can see c-2 and d-4 are missing and I never received these packets.

On 2020/02/28 18:15:03, Kenneth Knowles  wrote: 
> What are the timestamps on the elements?
> 
> On Fri, Feb 28, 2020 at 8:36 AM Vasu Gupta  wrote:
> 
> > Edit: Issue is on Direct Runner(Not Direction Runner - mistyped)
> > Issue Details:
> > Input data: 7 key-value Packets like: a-1, a-4, b-3, c-5, d-1, e-4, e-5
> > Batch Size: 5
> > Expected output: a-1,4, b-3, c-5, d-1, e-4,5
> > Getting Packets with irregular size like a-1, b-5, e-4,5 OR a-1,4, c-5 etc
> > But i always got correct number of packets with BATCH_SIZE = 1
> >
> > On 2020/02/27 20:40:16, Kenneth Knowles  wrote:
> > > Can you share some more details? What is the expected output and what
> > > output are you seeing?
> > >
> > > On Thu, Feb 27, 2020 at 9:39 AM Vasu Gupta 
> > wrote:
> > >
> > > > Hey folks, I am using Apache beam Framework in Java with Direction
> > Runner
> > > > for local testing purposes. When using GroupIntoBatches with batch
> > size 1
> > > > it works perfectly fine i.e. the output of the transform is consistent
> > and
> > > > as expected. But when using with batch size > 1 the output Pcollection
> > has
> > > > less data than it should be.
> > > >
> > > > Pipeline flow:
> > > > 1. A Transform for reading from pubsub
> > > > 2. Transform for making a KV out of the data
> > > > 3. A Fixed Window transform of 1 second
> > > > 4. Applying GroupIntoBatches transform
> > > > 5. And last, Logging the resulting Iterables.
> > > >
> > > > Weird thing is that it batch_size > 1 works great when running on
> > > > DataflowRunner but not with DirectRunner. I think the issue might be
> > with
> > > > Timer Expiry since GroupIntoBatches uses BagState internally.
> > > >
> > > > Any help will be much appreciated.
> > > >
> > >
> >
> 


JdbcIO for writing to Dynamic Schemas in Postgres

2020-03-02 Thread Vasu Gupta
Hey folks,

Can we use JdbcIO for writing data to multiple Schemas(For Postgres Database) 
dynamically using Apache beam Java Framework? Currently, I can't find any 
property that I could set to JdbcIO transform for providing schema or maybe I 
am missing something.

Thanks


Re: [ANNOUNCE] New Committer: Kamil Wasilewski

2020-03-02 Thread Elias Djurfeldt
Congrats Kamil!!

On Mon, 2 Mar 2020 at 16:16, Karolina Rosół 
wrote:

> Congratulations Kamil! Well deserved :-)
>
> Karolina Rosół
> Polidea  | Project Manager
>
> M: +48 606 630 236 <+48606630236>
> E: karolina.ro...@polidea.com
> [image: Polidea] 
>
> Check out our projects! 
> [image: Github]  [image: Facebook]
>  [image: Twitter]
>  [image: Linkedin]
>  [image: Instagram]
>  [image: Behance]
>  [image: dribbble]
> 
>
>
> On Mon, Mar 2, 2020 at 10:43 AM Kamil Wasilewski <
> kamil.wasilew...@polidea.com> wrote:
>
>> Thank you all! I am very happy to be a part of the community :)
>>
>>
>> On Mon, Mar 2, 2020 at 9:45 AM Ryan Skraba  wrote:
>>
>>> Congratulations Kamil!
>>>
>>> On Mon, Mar 2, 2020 at 8:06 AM Michał Walenia <
>>> michal.wale...@polidea.com> wrote:
>>>
 Congratulations!

 On Sun, Mar 1, 2020 at 2:55 AM Reza Rokni  wrote:

> Congratilation Kamil
>
> On Sat, 29 Feb 2020, 06:18 Udi Meiri,  wrote:
>
>> Welcome Kamil!
>>
>> On Fri, Feb 28, 2020 at 12:53 PM Mark Liu  wrote:
>>
>>> Congrats, Kamil!
>>>
>>> On Fri, Feb 28, 2020 at 12:23 PM Ismaël Mejía 
>>> wrote:
>>>
 Congratulations Kamil!

 On Fri, Feb 28, 2020 at 7:09 PM Yichi Zhang 
 wrote:

> Congrats, Kamil!
>
> On Fri, Feb 28, 2020 at 9:53 AM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> Congratulations, Kamil!
>>
>> On Fri, Feb 28, 2020 at 9:34 AM Pablo Estrada 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming a new
>>> committer: Kamil Wasilewski
>>>
>>> Kamil has contributed to Beam in many ways, including the
>>> performance testing infrastructure, and a custom BQ source, along 
>>> with
>>> other contributions.
>>>
>>> In consideration of his contributions, the Beam PMC trusts him
>>> with the responsibilities of a Beam committer[1].
>>>
>>> Thanks for your contributions Kamil!
>>>
>>> Pablo, on behalf of the Apache Beam PMC.
>>>
>>> [1] https://beam.apache.org/contribute/become-a-committer
>>> /#an-apache-beam-committer
>>>
>>>

 --

 Michał Walenia
 Polidea  | Software Engineer

 M: +48 791 432 002 <+48791432002>
 E: michal.wale...@polidea.com

 Unique Tech
 Check out our projects! 

>>>


Re: Permission to self-assign JIRAs

2020-03-02 Thread Luke Cwik
Welcome, you have been added.

On Mon, Mar 2, 2020 at 3:57 AM Jozef Vilcek  wrote:

> Can I please get a permission in JIRA for `jvilcek` user to self assign
> JIRAs?
>


Upcoming Apache Beam meetups in Warsaw

2020-03-02 Thread Karolina Rosół
Hi everyone,

I'm Project Manager at Polidea and work closely with three Apache Beam
committers (Katarzyna Kucharczyk, Kamil Wasilewski and Michał Walenia).

Together with folks from Polidea we'd like to announce our plans towards
the upcoming Apache Beam meetups in Warsaw. The next date for the Beam
meetup we're considering is March 26th (Thursday).

Does any of you want to be a speaker? If yes, let me know.

The event has not been announced yet because we're struggling a little bit
with finding the right people to carry out the talks :-)

Thanks,

Karolina Rosół
Polidea  | Project Manager

M: +48 606 630 236 <+48606630236>
E: karolina.ro...@polidea.com
[image: Polidea] 

Check out our projects! 
[image: Github]  [image: Facebook]
 [image: Twitter]
 [image: Linkedin]
 [image: Instagram]
 [image: Behance]
 [image: dribbble]



Re: [ANNOUNCE] New Committer: Kamil Wasilewski

2020-03-02 Thread Karolina Rosół
Congratulations Kamil! Well deserved :-)

Karolina Rosół
Polidea  | Project Manager

M: +48 606 630 236 <+48606630236>
E: karolina.ro...@polidea.com
[image: Polidea] 

Check out our projects! 
[image: Github]  [image: Facebook]
 [image: Twitter]
 [image: Linkedin]
 [image: Instagram]
 [image: Behance]
 [image: dribbble]



On Mon, Mar 2, 2020 at 10:43 AM Kamil Wasilewski <
kamil.wasilew...@polidea.com> wrote:

> Thank you all! I am very happy to be a part of the community :)
>
>
> On Mon, Mar 2, 2020 at 9:45 AM Ryan Skraba  wrote:
>
>> Congratulations Kamil!
>>
>> On Mon, Mar 2, 2020 at 8:06 AM Michał Walenia 
>> wrote:
>>
>>> Congratulations!
>>>
>>> On Sun, Mar 1, 2020 at 2:55 AM Reza Rokni  wrote:
>>>
 Congratilation Kamil

 On Sat, 29 Feb 2020, 06:18 Udi Meiri,  wrote:

> Welcome Kamil!
>
> On Fri, Feb 28, 2020 at 12:53 PM Mark Liu  wrote:
>
>> Congrats, Kamil!
>>
>> On Fri, Feb 28, 2020 at 12:23 PM Ismaël Mejía 
>> wrote:
>>
>>> Congratulations Kamil!
>>>
>>> On Fri, Feb 28, 2020 at 7:09 PM Yichi Zhang 
>>> wrote:
>>>
 Congrats, Kamil!

 On Fri, Feb 28, 2020 at 9:53 AM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> Congratulations, Kamil!
>
> On Fri, Feb 28, 2020 at 9:34 AM Pablo Estrada 
> wrote:
>
>> Hi everyone,
>>
>> Please join me and the rest of the Beam PMC in welcoming a new
>> committer: Kamil Wasilewski
>>
>> Kamil has contributed to Beam in many ways, including the
>> performance testing infrastructure, and a custom BQ source, along 
>> with
>> other contributions.
>>
>> In consideration of his contributions, the Beam PMC trusts him
>> with the responsibilities of a Beam committer[1].
>>
>> Thanks for your contributions Kamil!
>>
>> Pablo, on behalf of the Apache Beam PMC.
>>
>> [1] https://beam.apache.org/contribute/become-a-committer
>> /#an-apache-beam-committer
>>
>>
>>>
>>> --
>>>
>>> Michał Walenia
>>> Polidea  | Software Engineer
>>>
>>> M: +48 791 432 002 <+48791432002>
>>> E: michal.wale...@polidea.com
>>>
>>> Unique Tech
>>> Check out our projects! 
>>>
>>


Re: KafkaIO: Configurable timeout for setupInitialOffset()

2020-03-02 Thread Jozef Vilcek
Thanks Ismael!

On Mon, Mar 2, 2020 at 2:15 PM Ismaël Mejía  wrote:

> Done, also assigned the issue you mentioned in the previous email to you.
>
> On Mon, Mar 2, 2020 at 12:56 PM Jozef Vilcek 
> wrote:
>
>> Recently I had a problem with Beam pipeline unable to start due to
>> unhealthy broker in the list of configured bootstrap servers. I have
>> created a JIRA for it and plan to work on the fix.
>>
>> https://issues.apache.org/jira/browse/BEAM-9420
>>
>> Please let me know in case it does not make sense of should be addressed
>> somehow else.
>>
>> Thanks,
>> Jozef
>>
>


Re: KafkaIO: Configurable timeout for setupInitialOffset()

2020-03-02 Thread Ismaël Mejía
Done, also assigned the issue you mentioned in the previous email to you.

On Mon, Mar 2, 2020 at 12:56 PM Jozef Vilcek  wrote:

> Recently I had a problem with Beam pipeline unable to start due to
> unhealthy broker in the list of configured bootstrap servers. I have
> created a JIRA for it and plan to work on the fix.
>
> https://issues.apache.org/jira/browse/BEAM-9420
>
> Please let me know in case it does not make sense of should be addressed
> somehow else.
>
> Thanks,
> Jozef
>


Beam Dependency Check Report (2020-03-02)

2020-03-02 Thread Apache Jenkins Server
ERROR: File 'src/build/dependencyUpdates/beam-dependency-check-report.html' does not exist

Permission to self-assign JIRAs

2020-03-02 Thread Jozef Vilcek
Can I please get a permission in JIRA for `jvilcek` user to self assign
JIRAs?


KafkaIO: Configurable timeout for setupInitialOffset()

2020-03-02 Thread Jozef Vilcek
Recently I had a problem with Beam pipeline unable to start due to
unhealthy broker in the list of configured bootstrap servers. I have
created a JIRA for it and plan to work on the fix.

https://issues.apache.org/jira/browse/BEAM-9420

Please let me know in case it does not make sense of should be addressed
somehow else.

Thanks,
Jozef


Re: Beam Emitted Metrics Reference

2020-03-02 Thread Etienne Chauchot

Hi,

There is a doc about metrics here: 
https://beam.apache.org/documentation/programming-guide/#metrics


You can also export the metrics to sinks (REST http endpoint and 
Graphite), see MetricsOptions class for configuration.


Still, there is no doc for export on website, I'll add some

Best

Etienne

On 28/02/2020 01:07, Pablo Estrada wrote:

Hi Daniel!
I think +Alex Amato  had tried to have an 
inventory of metrics at some point.

Other than that, I don't think we have a document outlining them.

Can you talk about what you plan to do with them? Do you plan to 
export them somehow? Do you plan to add your own?

Best
-P.

On Thu, Feb 27, 2020 at 11:33 AM Daniel Chen > wrote:


Hi all,

I some questions about the reference to the framework metrics
emitted by Beam. I would like to leverage these metrics to allow
better monitoring of by Beam jobs but cannot find any references
to the description or a complete set of emitted metrics.

Do we have this information documented anywhere?

Thanks,
Daniel



Re: GroupIntoBatches not Working properly for Direct Runner Java

2020-03-02 Thread Etienne Chauchot

Hi,

+1 to what Kenn asked: your pipeline is in streaming mode and GIB 
preserves windowing, the elements are buffered until one of these 
conditions are true: batchsize reached or end of window. I your case I 
think it is the second one.


Best

Etienne

On 28/02/2020 19:15, Kenneth Knowles wrote:

What are the timestamps on the elements?

On Fri, Feb 28, 2020 at 8:36 AM Vasu Gupta > wrote:


Edit: Issue is on Direct Runner(Not Direction Runner - mistyped)
Issue Details:
Input data: 7 key-value Packets like: a-1, a-4, b-3, c-5, d-1,
e-4, e-5
Batch Size: 5
Expected output: a-1,4, b-3, c-5, d-1, e-4,5
Getting Packets with irregular size like a-1, b-5, e-4,5 OR a-1,4,
c-5 etc
But i always got correct number of packets with BATCH_SIZE = 1

On 2020/02/27 20:40:16, Kenneth Knowles mailto:k...@apache.org>> wrote:
> Can you share some more details? What is the expected output and
what
> output are you seeing?
>
> On Thu, Feb 27, 2020 at 9:39 AM Vasu Gupta
mailto:dev.vasugu...@gmail.com>> wrote:
>
> > Hey folks, I am using Apache beam Framework in Java with
Direction Runner
> > for local testing purposes. When using GroupIntoBatches with
batch size 1
> > it works perfectly fine i.e. the output of the transform is
consistent and
> > as expected. But when using with batch size > 1 the output
Pcollection has
> > less data than it should be.
> >
> > Pipeline flow:
> > 1. A Transform for reading from pubsub
> > 2. Transform for making a KV out of the data
> > 3. A Fixed Window transform of 1 second
> > 4. Applying GroupIntoBatches transform
> > 5. And last, Logging the resulting Iterables.
> >
> > Weird thing is that it batch_size > 1 works great when running on
> > DataflowRunner but not with DirectRunner. I think the issue
might be with
> > Timer Expiry since GroupIntoBatches uses BagState internally.
> >
> > Any help will be much appreciated.
> >
>



Re: [ANNOUNCE] New Committer: Kamil Wasilewski

2020-03-02 Thread Kamil Wasilewski
Thank you all! I am very happy to be a part of the community :)


On Mon, Mar 2, 2020 at 9:45 AM Ryan Skraba  wrote:

> Congratulations Kamil!
>
> On Mon, Mar 2, 2020 at 8:06 AM Michał Walenia 
> wrote:
>
>> Congratulations!
>>
>> On Sun, Mar 1, 2020 at 2:55 AM Reza Rokni  wrote:
>>
>>> Congratilation Kamil
>>>
>>> On Sat, 29 Feb 2020, 06:18 Udi Meiri,  wrote:
>>>
 Welcome Kamil!

 On Fri, Feb 28, 2020 at 12:53 PM Mark Liu  wrote:

> Congrats, Kamil!
>
> On Fri, Feb 28, 2020 at 12:23 PM Ismaël Mejía 
> wrote:
>
>> Congratulations Kamil!
>>
>> On Fri, Feb 28, 2020 at 7:09 PM Yichi Zhang 
>> wrote:
>>
>>> Congrats, Kamil!
>>>
>>> On Fri, Feb 28, 2020 at 9:53 AM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>>
 Congratulations, Kamil!

 On Fri, Feb 28, 2020 at 9:34 AM Pablo Estrada 
 wrote:

> Hi everyone,
>
> Please join me and the rest of the Beam PMC in welcoming a new
> committer: Kamil Wasilewski
>
> Kamil has contributed to Beam in many ways, including the
> performance testing infrastructure, and a custom BQ source, along with
> other contributions.
>
> In consideration of his contributions, the Beam PMC trusts him
> with the responsibilities of a Beam committer[1].
>
> Thanks for your contributions Kamil!
>
> Pablo, on behalf of the Apache Beam PMC.
>
> [1] https://beam.apache.org/contribute/become-a-committer
> /#an-apache-beam-committer
>
>
>>
>> --
>>
>> Michał Walenia
>> Polidea  | Software Engineer
>>
>> M: +48 791 432 002 <+48791432002>
>> E: michal.wale...@polidea.com
>>
>> Unique Tech
>> Check out our projects! 
>>
>


Re: [ANNOUNCE] New Committer: Kamil Wasilewski

2020-03-02 Thread Ryan Skraba
Congratulations Kamil!

On Mon, Mar 2, 2020 at 8:06 AM Michał Walenia 
wrote:

> Congratulations!
>
> On Sun, Mar 1, 2020 at 2:55 AM Reza Rokni  wrote:
>
>> Congratilation Kamil
>>
>> On Sat, 29 Feb 2020, 06:18 Udi Meiri,  wrote:
>>
>>> Welcome Kamil!
>>>
>>> On Fri, Feb 28, 2020 at 12:53 PM Mark Liu  wrote:
>>>
 Congrats, Kamil!

 On Fri, Feb 28, 2020 at 12:23 PM Ismaël Mejía 
 wrote:

> Congratulations Kamil!
>
> On Fri, Feb 28, 2020 at 7:09 PM Yichi Zhang  wrote:
>
>> Congrats, Kamil!
>>
>> On Fri, Feb 28, 2020 at 9:53 AM Valentyn Tymofieiev <
>> valen...@google.com> wrote:
>>
>>> Congratulations, Kamil!
>>>
>>> On Fri, Feb 28, 2020 at 9:34 AM Pablo Estrada 
>>> wrote:
>>>
 Hi everyone,

 Please join me and the rest of the Beam PMC in welcoming a new
 committer: Kamil Wasilewski

 Kamil has contributed to Beam in many ways, including the
 performance testing infrastructure, and a custom BQ source, along with
 other contributions.

 In consideration of his contributions, the Beam PMC trusts him with
 the responsibilities of a Beam committer[1].

 Thanks for your contributions Kamil!

 Pablo, on behalf of the Apache Beam PMC.

 [1] https://beam.apache.org/contribute/become-a-committer
 /#an-apache-beam-committer


>
> --
>
> Michał Walenia
> Polidea  | Software Engineer
>
> M: +48 791 432 002 <+48791432002>
> E: michal.wale...@polidea.com
>
> Unique Tech
> Check out our projects! 
>