date:20190603

Re: Jira tracker permission

2019-06-03 Thread Pablo Estrada

I've added you as contributor - welcome
-P.

On Mon, Jun 3, 2019, 9:16 PM Yichi Zhang  wrote:

> Hi, beam-dev,
>
> This is Yichi Zhang from Google, I just started looking into beam projects
> and will be actively working on beam sdk, could someone grant me permission
> to beam jira issue tracker? My jira username is yichi
> .
>
> Looking forward to work with everyone else.
>
> Thanks,
> Yichi
>

Re: Timer support in Flink

2019-06-03 Thread Melissa Pashniak

Yeah, people's eyes likely jump to the big "What is being computed?" header
first and skip the small font "expand details" (that's what my eyes did
anyway!) Even just moving the expand/collapse to be AFTER the header of the
table (or down to the next line)  and making the font bigger might help a
lot. And maybe making the text more explicit: "Click to expand for more
details".

I'm traveling right now so can't take an in-depth look, but this might be
doable by changing the order of things in [1] and the font size in [2].
I'll add this info to the JIRA also.

[1]
https://github.com/apache/beam/blame/master/website/src/_includes/capability-matrix.md#L18
[2]
https://github.com/apache/beam/blob/master/website/src/_sass/capability-matrix.scss#L130


On Mon, Jun 3, 2019 at 2:15 AM Maximilian Michels  wrote:

> Good point. I think I discovered the detailed view when I made changes
> to the source code. Classic tunnel-vision problem :)
>
> On 30.05.19 12:57, Reza Rokni wrote:
> > :-)
> >
> > https://issues.apache.org/jira/browse/BEAM-7456
> >
> > On Thu, 30 May 2019 at 18:41, Alex Van Boxel  > > wrote:
> >
> > Oh... you can expand the matrix. Never saw that, this could indeed
> > be better. So it isn't you.
> >
> >   _/
> > _/ Alex Van Boxel
> >
> >
> > On Thu, May 30, 2019 at 12:24 PM Reza Rokni  > > wrote:
> >
> > PS, until it was just pointed out to me by Max, I had missed the
> > (expand details) clickable link in the capability matrix.
> >
> > Probably just me, but do others think it's also easy to miss? If
> > yes I will raise a Jira for it
> >
> > On Wed, 29 May 2019 at 19:52, Reza Rokni  > > wrote:
> >
> > Thanx Max!
> >
> > Reza
> >
> > On Wed, 29 May 2019, 16:38 Maximilian Michels,
> > mailto:m...@apache.org>> wrote:
> >
> > Hi Reza,
> >
> > The detailed view of the capability matrix states: "The
> > Flink Runner
> > supports timers in non-merging windows."
> >
> > That is still the case. Other than that, timers should
> > be working fine.
> >
> >  > It makes very heavy use of Event.Time timers and has
> > to do some manual DoFn cache work to get around some
> > O(heavy) issues.
> >
> > If you are running on Flink 1.5, timer deletion suffers
> > from O(n)
> > complexity which has been fixed in newer versions.
> >
> > Cheers,
> > Max
> >
> > On 29.05.19 03:27, Reza Rokni wrote:
> >  > Hi Flink experts,
> >  >
> >  > I am getting ready to push a PR around a utility
> > class for timeseries join
> >  >
> >  > left.timestamp match to closest right.timestamp where
> > right.timestamp <=
> >  > left.timestamp.
> >  >
> >  > It makes very heavy use of Event.Time timers and has
> > to do some manual
> >  > DoFn cache work to get around some O(heavy) issues.
> > Wanted to test
> >  > things against Flink: In the capability matrix we
> > have "~" for Timer
> >  > support in Flink:
> >  >
> >  >
> >
> https://beam.apache.org/documentation/runners/capability-matrix/
> >  >
> >  > Is that page outdated, if not what are the areas that
> > still need to be
> >  > addressed please?
> >  >
> >  > Cheers
> >  >
> >  > Reza
> >  >
> >  >
> >  > --
> >  >
> >  > This email may be confidential and privileged. If you
> > received this
> >  > communication by mistake, please don't forward it to
> > anyone else, please
> >  > erase all copies and attachments, and please let me
> > know that it has
> >  > gone to the wrong person.
> >  >
> >  > The above terms reflect a potential business
> > arrangement, are provided
> >  > solely as a basis for further discussion, and are not
> > intended to be and
> >  > do not constitute a legally binding obligation. No
> > legally binding
> >  > obligations will be created, implied, or inferred
> > until an agreement in
> >  > final form is executed in writing by all parties
> > involved.
> >

Re: [DISCUSS] Portability representation of schemas

2019-06-03 Thread Reuven Lax

On Mon, Jun 3, 2019 at 12:27 PM Brian Hulette  wrote:

> > It has to go into the proto somewhere (since that's the only way the
> SDK can get it), but I'm not sure they should be considered integral parts
> of the type.
> Are you just advocating for an approach where any SDK-specific information
> is stored outside of the Schema message itself so that Schema really does
> just represent the type? That seems reasonable to me, and alleviates my
> concerns about how this applies to columnar encodings a bit as well.
>

Yes, that's exactly what I'm advocating.


>
> We could lift all of the LogicalTypeConversion messages out of the Schema
> and the LogicalType like this:
>
> message SchemaCoder {
>   Schema schema = 1;
>   LogicalTypeConversion root_conversion = 2;
>   map attribute_conversions = 3; // only
> necessary for user type aliases, portable logical types by definition have
> nothing SDK-specific
> }
>

I'm not sure what the map is for? I think we have status quo wihtout it.


>
> I think a critical question (that has implications for the above proposal)
> is how/if the two different concepts Kenn mentioned are allowed to nest.
> For example, you could argue it's redundant to have a user type alias that
> has a Row representation with a field that is itself a user type alias,
> because instead you could just have a single top-level type alias
> with to/from functions that pack and unpack the entire hierarchy. On the
> other hand, I think it does make sense for a user type alias or a truly
> portable logical type to have a field that is itself a truly portable
> logical type (e.g. a user type alias or portable type with a DateTime).
>
> I've been assuming that user-type aliases could be nested, but should we
> disallow that? Or should we go the other way and require that logical types
> define at most one "level"?
>

No I think it's useful to allow things to be nested (though of course the
nesting must terminate).


>
> Brian
>
> On Mon, Jun 3, 2019 at 11:08 AM Kenneth Knowles  wrote:
>
>>
>> On Mon, Jun 3, 2019 at 10:53 AM Reuven Lax  wrote:
>>
>>> So I feel a bit leery about making the to/from functions a fundamental
>>> part of the portability representation. In my mind, that is very tied to a
>>> specific SDK/language. A SDK (say the Java SDK) wants to allow users to use
>>> a wide variety of native types with schemas, and under the covers uses the
>>> to/from functions to implement that. However from the portable Beam
>>> perspective, the schema itself should be the real "type" of the
>>> PCollection; the to/from methods are simply a way that a particular SDK
>>> makes schemas easier to use. It has to go into the proto somewhere (since
>>> that's the only way the SDK can get it), but I'm not sure they should be
>>> considered integral parts of the type.
>>>
>>
>> On the doc in a couple places this distinction was made:
>>
>> * For truly portable logical types, no instructions for the SDK are
>> needed. Instead, they require:
>>- URN: a standardized identifier any SDK can recognize
>>- A spec: what is the universe of values in this type?
>>- A representation: how is it represented in built-in types? This is
>> how SDKs who do not know/care about the URN will process it
>>- (optional): SDKs choose preferred SDK-specific types to embed the
>> values in. SDKs have to know about the URN and choose for themselves.
>>
>> *For user-level type aliases, written as convenience by the user in their
>> pipeline, what Java schemas have today:
>>- to/from UDFs: the code is SDK-specific
>>- some representation of the intended type (like java class): also SDK
>> specific
>>- a representation
>>- any "id" is just like other ids in the pipeline, just avoiding
>> duplicating the proto
>>- Luke points out that nesting these can give multiple SDKs a hint
>>
>> In my mind the remaining complexity is whether or not we need to be able
>> to move between the two. Composite PTransforms, for example, do have
>> fluidity between being strictly user-defined versus portable URN+payload.
>> But it requires lots of engineering, namely the current work on expansion
>> service.
>>
>> Kenn
>>
>>
>>> On Mon, Jun 3, 2019 at 10:23 AM Brian Hulette 
>>> wrote:
>>>
 Ah I see, I didn't realize that. Then I suppose we'll need to/from
 functions somewhere in the logical type conversion to preserve the current
 behavior.

 I'm still a little hesitant to make these functions an explicit part of
 LogicalTypeConversion for another reason. Down the road, schemas could give
 us an avenue to use a batched columnar format (presumably arrow, but of
 course others are possible). By making to/from an explicit part of logical
 types we add some element-wise logic to a schema representation that's
 otherwise ambivalent to element-wise vs. batched encodings.

 I suppose you could make an argument that to/from are only for
 custom types. There will also be some set of

Jira tracker permission

2019-06-03 Thread Yichi Zhang

Hi, beam-dev,

This is Yichi Zhang from Google, I just started looking into beam projects and 
will be actively working on beam sdk, could someone grant me permission to beam 
jira issue tracker? My jira username is yichi 
.

Looking forward to work with everyone else.

Thanks,
Yichi

Jira issue tracker permission

2019-06-03 Thread Yichi Zhang

Hi, beam-dev,

This is Yichi Zhang from Google, I just started looking into beam projects
and will be actively working on beam sdk, could someone grant me permission
to beam jira issue tracker? My jira username is yichi
.

Looking forward to work with everyone else.

Thanks,
Yichi

Re: [VOTE] Release 2.13.0, release candidate #2

2019-06-03 Thread Valentyn Tymofieiev

Thanks, Ankur, for driving the release. Do we have a draft of user-friendly
summary of release notes with high-level changes somewhere? If so, please
tag me on a document or a PR, or post the link in this thread. Thank you!

On Mon, Jun 3, 2019 at 5:38 PM Ankur Goenka  wrote:

> +1
> Thanks for validating the release and voting.
> With 0(-1), 6(+1) and 3(+1 binding) votes, I am concluding the voting
> process.
> I am going ahead with the release and will keep the community posted with
> the updates.
>
> On Mon, Jun 3, 2019 at 1:57 PM Andrew Pilloud  wrote:
>
>> +1 Reviewed the Nexmark java and SQL perfkit graphs, no obvious
>> regressions over the previous release.
>>
>> On Mon, Jun 3, 2019 at 1:15 PM Lukasz Cwik  wrote:
>>
>>> Thanks for the clarification.
>>>
>>> On Mon, Jun 3, 2019 at 11:40 AM Ankur Goenka  wrote:
>>>
 Yes, i meant i will close the voting at 5pm and start the release
 process.

 On Mon, Jun 3, 2019, 10:59 AM Lukasz Cwik  wrote:

> Ankur, did you mean to say your going to close the vote today at 5pm?
> (and then complete the release afterwards)
>
> On Mon, Jun 3, 2019 at 10:54 AM Ankur Goenka 
> wrote:
>
>> Thanks for validating and voting.
>>
>> We have 4 binding votes.
>> I will complete the release today 5PM. Please raise any concerns
>> before that.
>>
>> Thanks,
>> Ankur
>>
>> On Mon, Jun 3, 2019 at 8:36 AM Lukasz Cwik  wrote:
>>
>>> Since the gearpump issue has been ongoing since 2.10, I can't
>>> consider it a blocker for this release and am voting +1.
>>>
>>> On Mon, Jun 3, 2019 at 7:13 AM Jean-Baptiste Onofré 
>>> wrote:
>>>
 +1 (binding)

 Quickly tested on beam-samples.

 Regards
 JB

 On 31/05/2019 04:52, Ankur Goenka wrote:
 > Hi everyone,
 >
 > Please review and vote on the release candidate #2 for the version
 > 2.13.0, as follows:
 >
 > [ ] +1, Approve the release
 > [ ] -1, Do not approve the release (please provide specific
 comments)
 >
 > The complete staging area is available for your review, which
 includes:
 > * JIRA release notes [1],
 > * the official Apache source release to be deployed to
 dist.apache.org
 >  [2], which is signed with the key with
 > fingerprint 6356C1A9F089B0FA3DE8753688934A6699985948 [3],
 > * all artifacts to be deployed to the Maven Central Repository
 [4],
 > * source code tag "v2.13.0-RC2" [5],
 > * website pull request listing the release [6] and publishing the
 API
 > reference manual [7].
 > * Python artifacts are deployed along with the source release to
 the
 > dist.apache.org  [2].
 > * Validation sheet with a tab for 2.13.0 release to help with
 validation
 > [8].
 >
 > The vote will be open for at least 72 hours. It is adopted by
 majority
 > approval, with at least 3 PMC affirmative votes.
 >
 > Thanks,
 > Ankur
 >
 > [1]
 >
 https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12345166
 > [2] https://dist.apache.org/repos/dist/dev/beam/2.13.0/
 > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
 > [4]
 https://repository.apache.org/content/repositories/orgapachebeam-1070/
 > [5] https://github.com/apache/beam/tree/v2.13.0-RC2
 > [6] https://github.com/apache/beam/pull/8645
 > [7] https://github.com/apache/beam-site/pull/589
 > [8]
 >
 https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1031196952

 --
 Jean-Baptiste Onofré
 jbono...@apache.org
 http://blog.nanthrax.net
 Talend - http://www.talend.com

>>>

Re: [DISCUSS] Cookbooks for users with knowledge in other frameworks

2019-06-03 Thread Ahmet Altay

Thank you for the feedback so far. It seems like this will be generally
helpful :)

I guess next step would be, would anyone be interested in working in this
area? We can potentially break this down into starter tasks.

On Sat, Jun 1, 2019 at 7:00 PM Ankur Goenka  wrote:

> +1 for the proposal.
> Compatibility Matrix
>  can be
> a good place to show case parity between different runners.
>

+1

> Do you think we should write 2 way examples [Spark, Flink, ..]<=>Beam?
>

Both ways, would be most useful I believe.

>
>
>
> On Sat, Jun 1, 2019 at 4:31 PM Reza Rokni  wrote:
>
>> For layer 1, what about working through this link as a starting point :
>> https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations
>> ?
>>
>
+1

>
>> On Sat, 1 Jun 2019 at 09:21, Ahmet Altay  wrote:
>>
>>> Thank you Reza. That separation makes sense to me.
>>>
>>> On Wed, May 29, 2019 at 6:26 PM Reza Rokni  wrote:
>>>
 +1

 I think there will be at least two layers of this;

 Layer 1 - Using primitives : I do join, GBK, Aggregation... with system
 x this way, what is the canonical equivalent in Beam.
 Layer 2 - Patterns : I read and join Unbounded and Bounded Data in
 system x this way, what is the canonical equivalent in Beam.

 I suspect as a first pass Layer 1 is reasonably well bounded work,
 there would need to be agreement on "canonical" version of how to do
 something in Beam as this could be seen to be opinionated. As there are
 often a multitude of ways of doing x

>>>
>>> Once we identify a set of layer 1 items, we could crowd source the
>>> canonical implementations. I believe we can use our usual code review
>>> process to settle on a version that is agreeable. (Examples have the same
>>> issue, they are probably opinionated today based on the author but it works
>>> out.)
>>>
>>>

 On Thu, 30 May 2019 at 08:56, Ahmet Altay  wrote:

> Hi all,
>
> Inspired by the user asking about a Spark feature in Beam [1] in the
> release thread, I searched the user@ list and noticed a few instances
> of people asking for question like "I can do X in Spark, how can I do that
> in Beam?" Would it make sense to add documentation to explain how certain
> tasks that can be accomplished in Beam with side by side examples of doing
> the same task in Beam/Spark etc. It could help with on-boarding because it
> will be easier for people to leverage their existing knowledge. It could
> also help other frameworks as well, because it will serve as a Rosetta
> stone with two translations.
>
> Questions I have are:
> - Would such a thing be a helpful?
> - Is it feasible? Would a few pages worth of examples can cover enough
> use cases?
>
> Thank you!
> Ahmet
>
> [1]
> https://lists.apache.org/thread.html/b73a54aa1e6e9933628f177b04a8f907c26cac854745fa081c478eff@%3Cdev.beam.apache.org%3E
>

 --

 This email may be confidential and privileged. If you received this
 communication by mistake, please don't forward it to anyone else, please
 erase all copies and attachments, and please let me know that it has gone
 to the wrong person.

 The above terms reflect a potential business arrangement, are provided
 solely as a basis for further discussion, and are not intended to be and do
 not constitute a legally binding obligation. No legally binding obligations
 will be created, implied, or inferred until an agreement in final form is
 executed in writing by all parties involved.

>>>
>>
>> --
>>
>> This email may be confidential and privileged. If you received this
>> communication by mistake, please don't forward it to anyone else, please
>> erase all copies and attachments, and please let me know that it has gone
>> to the wrong person.
>>
>> The above terms reflect a potential business arrangement, are provided
>> solely as a basis for further discussion, and are not intended to be and do
>> not constitute a legally binding obligation. No legally binding obligations
>> will be created, implied, or inferred until an agreement in final form is
>> executed in writing by all parties involved.
>>
>

Re: BQ IT tests fail on TestDataflowRunner - Python SDK

2019-06-03 Thread Chamikara Jayalath

Sounds like your input job was somehow incompatible with the Dataflow
worker. Running using a clean virtual env should help verify as Ahmet
mentioned.

On Mon, Jun 3, 2019 at 5:44 PM Ahmet Altay  wrote:

> Do you have any other changes? Are you trying from head with a clean
> virtual environment?
>
> If you can share a link to dataflow job (in the apache-beam-testing GCP
> project), we can try to look at additional logs as well.
>
> On Mon, Jun 3, 2019 at 1:42 PM Tanay Tummalapalli 
> wrote:
>
>> Hi everyone,
>>
>> I ran the Integration Tests -
>> BigQueryStreamingInsertTransformIntegrationTests[1] and
>> BigQueryFileLoadsIT[2] on the master branch locally, with the following
>> command:
>> ./scripts/run_integration_test.sh --test_opts
>> --tests=apache_beam.io.gcp.bigquery_test:BigQueryStreamingInsertTransformIntegrationTests
>> The Dataflow jobs for the tests failed with the following error:
>> root: INFO: 2019-06-03T18:36:53.021Z: JOB_MESSAGE_ERROR: Traceback (most
>> recent call last):
>> File
>> "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py",
>> line 649, in do_work
>> work_executor.execute()
>> File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py",
>> line 150, in execute
>> test_shuffle_sink=self._test_shuffle_sink)
>> File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py",
>> line 116, in create_operation
>> is_streaming=False)
>> File "apache_beam/runners/worker/operations.py", line 962, in
>> apache_beam.runners.worker.operations.create_operation
>> op = BatchGroupAlsoByWindowsOperation(
>> File "dataflow_worker/shuffle_operations.py", line 219, in
>> dataflow_worker.shuffle_operations.BatchGroupAlsoByWindowsOperation.
>> __init__
>> self.windowing = deserialize_windowing_strategy(self.spec.window_fn)
>> File "dataflow_worker/shuffle_operations.py", line 207, in
>> dataflow_worker.shuffle_operations.deserialize_windowing_strategy
>> return pickler.loads(serialized_data)
>> File
>> "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py",
>> line 248, in loads
>> c = base64.b64decode(encoded)
>> File "/usr/lib/python2.7/base64.py", line 78, in b64decode
>> raise TypeError(msg)
>> TypeError: Incorrect padding
>>
>>
>> I tested the same tests on the 2.13.0-RC#2 branch as well and they
>> passed. These tests also don't fail in the most recent Python post-commit
>> tests[3-5].
>>
>> Keeping in mind the recent b64 changes in BQ, none of the tests in the
>> test classes mentioned above makes use of a "BYTES" type field.
>> Would love to get pointers to possible reasons.
>>
>> Thank You
>> - TT
>>
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_test.py#L479-L630
>> [2]
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_file_loads_test.py#L358-L528
>> [3]
>> https://builds.apache.org/job/beam_PostCommit_Python_Verify/lastCompletedBuild/
>> [4]
>> https://builds.apache.org/job/beam_PostCommit_Python3_Verify/lastCompletedBuild/
>> [5]
>> https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/
>>
>

Re: BQ IT tests fail on TestDataflowRunner - Python SDK

2019-06-03 Thread Ahmet Altay

Do you have any other changes? Are you trying from head with a clean
virtual environment?

If you can share a link to dataflow job (in the apache-beam-testing GCP
project), we can try to look at additional logs as well.

On Mon, Jun 3, 2019 at 1:42 PM Tanay Tummalapalli 
wrote:

> Hi everyone,
>
> I ran the Integration Tests -
> BigQueryStreamingInsertTransformIntegrationTests[1] and
> BigQueryFileLoadsIT[2] on the master branch locally, with the following
> command:
> ./scripts/run_integration_test.sh --test_opts
> --tests=apache_beam.io.gcp.bigquery_test:BigQueryStreamingInsertTransformIntegrationTests
> The Dataflow jobs for the tests failed with the following error:
> root: INFO: 2019-06-03T18:36:53.021Z: JOB_MESSAGE_ERROR: Traceback (most
> recent call last):
> File
> "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py",
> line 649, in do_work
> work_executor.execute()
> File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py",
> line 150, in execute
> test_shuffle_sink=self._test_shuffle_sink)
> File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py",
> line 116, in create_operation
> is_streaming=False)
> File "apache_beam/runners/worker/operations.py", line 962, in
> apache_beam.runners.worker.operations.create_operation
> op = BatchGroupAlsoByWindowsOperation(
> File "dataflow_worker/shuffle_operations.py", line 219, in
> dataflow_worker.shuffle_operations.BatchGroupAlsoByWindowsOperation.
> __init__
> self.windowing = deserialize_windowing_strategy(self.spec.window_fn)
> File "dataflow_worker/shuffle_operations.py", line 207, in
> dataflow_worker.shuffle_operations.deserialize_windowing_strategy
> return pickler.loads(serialized_data)
> File
> "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py",
> line 248, in loads
> c = base64.b64decode(encoded)
> File "/usr/lib/python2.7/base64.py", line 78, in b64decode
> raise TypeError(msg)
> TypeError: Incorrect padding
>
>
> I tested the same tests on the 2.13.0-RC#2 branch as well and they passed.
> These tests also don't fail in the most recent Python post-commit
> tests[3-5].
>
> Keeping in mind the recent b64 changes in BQ, none of the tests in the
> test classes mentioned above makes use of a "BYTES" type field.
> Would love to get pointers to possible reasons.
>
> Thank You
> - TT
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_test.py#L479-L630
> [2]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_file_loads_test.py#L358-L528
> [3]
> https://builds.apache.org/job/beam_PostCommit_Python_Verify/lastCompletedBuild/
> [4]
> https://builds.apache.org/job/beam_PostCommit_Python3_Verify/lastCompletedBuild/
> [5]
> https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/
>

Re: [VOTE] Release 2.13.0, release candidate #2

2019-06-03 Thread Ankur Goenka

+1
Thanks for validating the release and voting.
With 0(-1), 6(+1) and 3(+1 binding) votes, I am concluding the voting
process.
I am going ahead with the release and will keep the community posted with
the updates.

On Mon, Jun 3, 2019 at 1:57 PM Andrew Pilloud  wrote:

> +1 Reviewed the Nexmark java and SQL perfkit graphs, no obvious
> regressions over the previous release.
>
> On Mon, Jun 3, 2019 at 1:15 PM Lukasz Cwik  wrote:
>
>> Thanks for the clarification.
>>
>> On Mon, Jun 3, 2019 at 11:40 AM Ankur Goenka  wrote:
>>
>>> Yes, i meant i will close the voting at 5pm and start the release
>>> process.
>>>
>>> On Mon, Jun 3, 2019, 10:59 AM Lukasz Cwik  wrote:
>>>
 Ankur, did you mean to say your going to close the vote today at 5pm?
 (and then complete the release afterwards)

 On Mon, Jun 3, 2019 at 10:54 AM Ankur Goenka  wrote:

> Thanks for validating and voting.
>
> We have 4 binding votes.
> I will complete the release today 5PM. Please raise any concerns
> before that.
>
> Thanks,
> Ankur
>
> On Mon, Jun 3, 2019 at 8:36 AM Lukasz Cwik  wrote:
>
>> Since the gearpump issue has been ongoing since 2.10, I can't
>> consider it a blocker for this release and am voting +1.
>>
>> On Mon, Jun 3, 2019 at 7:13 AM Jean-Baptiste Onofré 
>> wrote:
>>
>>> +1 (binding)
>>>
>>> Quickly tested on beam-samples.
>>>
>>> Regards
>>> JB
>>>
>>> On 31/05/2019 04:52, Ankur Goenka wrote:
>>> > Hi everyone,
>>> >
>>> > Please review and vote on the release candidate #2 for the version
>>> > 2.13.0, as follows:
>>> >
>>> > [ ] +1, Approve the release
>>> > [ ] -1, Do not approve the release (please provide specific
>>> comments)
>>> >
>>> > The complete staging area is available for your review, which
>>> includes:
>>> > * JIRA release notes [1],
>>> > * the official Apache source release to be deployed to
>>> dist.apache.org
>>> >  [2], which is signed with the key with
>>> > fingerprint 6356C1A9F089B0FA3DE8753688934A6699985948 [3],
>>> > * all artifacts to be deployed to the Maven Central Repository [4],
>>> > * source code tag "v2.13.0-RC2" [5],
>>> > * website pull request listing the release [6] and publishing the
>>> API
>>> > reference manual [7].
>>> > * Python artifacts are deployed along with the source release to
>>> the
>>> > dist.apache.org  [2].
>>> > * Validation sheet with a tab for 2.13.0 release to help with
>>> validation
>>> > [8].
>>> >
>>> > The vote will be open for at least 72 hours. It is adopted by
>>> majority
>>> > approval, with at least 3 PMC affirmative votes.
>>> >
>>> > Thanks,
>>> > Ankur
>>> >
>>> > [1]
>>> >
>>> https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12345166
>>> > [2] https://dist.apache.org/repos/dist/dev/beam/2.13.0/
>>> > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> > [4]
>>> https://repository.apache.org/content/repositories/orgapachebeam-1070/
>>> > [5] https://github.com/apache/beam/tree/v2.13.0-RC2
>>> > [6] https://github.com/apache/beam/pull/8645
>>> > [7] https://github.com/apache/beam-site/pull/589
>>> > [8]
>>> >
>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1031196952
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>

Re: [PROPOSAL] Standardize Gradle structure in Python SDK

2019-06-03 Thread Valentyn Tymofieiev

Hey Mark & others,

We've been following the structure proposed in this thread to extend test
coverage for Beam Python SDK on Python 3.5, 3.6, 3.7 interpreters, see [1].

This structure allowed us to add 3.x suites without slowing down the
pre/postcommit execution time. We can actually see a drop in precommit
latency [2] around March 23 we first made some Python 3.x suites run in
parallel, and we have added more suites since then without slowing down
pre/postcommits. Therefore I am in favor of this proposal, especially since
AFAIK we don't have better one. Thanks a lot!

I do have some feedback on this proposal:

1. There is a duplication of gradle code between test suites for different
python minor versions, for example see the identical definition of
DirectRunner PostCommitIT suite for Python 3.6 and Python 3.7 [4,5].

Possible solution to reduce the duplication is to move common code that
defines a task into a separate groovy file shared across multiple gradle
files. We have an example of this, where enablePythonPerformanceTest() is
defined in BeamModulePlugin.groovy, and used in several build.gradle files
to create a gradle task required for performance tests, see: [6]. I
followed the same example in a Python 3 test suite for Portable Flink
Runner I am working on [3], however I am not sure if BeamModulePlugin is
the best place to define common gradle tasks to needed for Python CI.
Perhaps we can make a separate groovy file for this purpose in
sdk/python/test-suites?

2. Python 3 test suites currently live in sdks/python/test-suites, while
most Python 2 suites are still defined in sdks/python/build.gradle.

This may cause confusion for folks working on adding new Python suites. If
there is an overall agreement on proposed structure I suggest to  start
moving Python 2 CI tasks out of  sdks/python/build.gradle into
sdks/python/test-suites/[runner]/py27/build.gradle, or a common groovy
file. If there are better alternatives we can continue discussing them here.

Thanks,
Valenyn

[1] https://github.com/apache/beam/tree/master/sdks/python/test-suites
[2]
http://104.154.241.245/d/_TNndF2iz/pre-commit-test-latency?orgId=1=1546507894013=1554189164736
[3] https://github.com/apache/beam/pull/8745
[4]
https://github.com/apache/beam/blob/291f1e9fb5ce5ee4bb7e2519ffe40334fb5c08c5/sdks/python/test-suites/direct/py36/build.gradle#L27
[5]
https://github.com/apache/beam/blob/291f1e9fb5ce5ee4bb7e2519ffe40334fb5c08c5/sdks/python/test-suites/direct/py37/build.gradle#L27
[6]
https://github.com/apache/beam/search?q=enablePythonPerformanceTest_q=enablePythonPerformanceTest

On Fri, Mar 29, 2019 at 9:45 AM Udi Meiri  wrote:

> I don't use gradle commands for Python development either, because they
> are slow (no incremental testing).
>
>
>
> On Fri, Mar 29, 2019 at 9:16 AM Michael Luckey 
> wrote:
>
>>
>>
>> On Fri, Mar 29, 2019 at 2:31 PM Robert Bradshaw 
>> wrote:
>>
>>> On Fri, Mar 29, 2019 at 12:54 PM Michael Luckey 
>>> wrote:
>>> >
>>> > Really like the idea of improving here.
>>> >
>>> > Unfortunately, I haven't worked with python on that scale yet, so bear
>>> with my naive understandings in this regard. If I understand correctly, the
>>> suggestion will result in a couple of projects consisting only of a
>>> build,gradle file to kind of workaround on gradles decision not to
>>> parallelize within projects, right? In consequence, this also kind of
>>> decouples projects from their content - they stuff which constitutes the
>>> project - and forces the build file to 'somehow reach out to content of
>>> other (only python root?) projects. E.g couples projects. This somehow
>>> 'feels non natural' to me. But, of course, might be the path to go. As I
>>> said before, never worked on python on that scale.
>>>
>>> It feels a bit odd to me as well. Is it possible to have multiple
>>> projects per directory (e.g. a suite of testing ones) rather than
>>> having to break things up like this, especially if the goal is
>>> primarily to get parallel running of tests? Especially if we could
>>> automatically create the cross-product rather than manually? There
>>> also seems to be some redundancy with what tox is doing here.
>>>
>>
>> Not sure, whether I understand correctly. But I do not think that's
>> possible. If we are going to do some cross-product, we are probably better
>> of doing that on tasks, e.g. by leveraging task rules or programmatically
>> adding tasks (which is already done in parts). Of course, this will not
>> help with parallelisation (but might enable that, see below).
>>
>>
>>>
>>> > But I believe to remember Robert talking about using in project
>>> parallelisation for his development. Is this something which could also
>>> work on CI? Of course, that will not help with different python versions,
>>> but maybe that could be solved also by gradles variants which are
>>> introduced in 5.3 - definitely need some time to investigate the
>>> possibilities here. On first sight it feels like lots of duplication to

Re: [Discuss] Ideas for Apache Beam presence in social media

2019-06-03 Thread Aizhamal Nurmamat kyzy

Hello folks,

I have created a spreadsheet where people can suggest tweets [1]. It
contains a couple of tweets that have been tweeted as examples. Also, there
are a couple others that I will ask PMC members to review in the next few
days.

I have also created a blog post[2] to invite community members to
participate by proposing tweets / retweets.

Does this look OK to everyone? I’d love to try it out and see if it drives
engagement in the community. If not we can always change the processes.

Thanks,
aizhamal

[1] s.apache.org/beam-tweets
[2] https://github.com/apache/beam/pull/8747

On Fri, May 24, 2019 at 4:26 PM Kenneth Knowles  wrote:

> Thanks for taking on this work!
>
> Kenn
>
> On Fri, May 24, 2019 at 2:52 PM Aizhamal Nurmamat kyzy <
> aizha...@google.com> wrote:
>
>> Hi everyone,
>>
>> I'd like to pilot this if that's okay by everyone. I'll set up a
>> spreadsheet, write a blog post publicizing it, and perhaps send out a
>> tweet. We can improve the process later with tools if necessary.
>>
>> Thanks all and have a great weekend!
>> Aizhamal
>>
>> On Tue, May 21, 2019 at 8:37 PM Kenneth Knowles  wrote:
>>
>>> Great idea.
>>>
>>> Austin - point well taken about whether the PMC really has to
>>> micro-manage here. The stakes are potentially very high, but so are the
>>> stakes for code and website changes.
>>>
>>> I know that comdev votes authoring privileges to people who are not
>>> committers, but they are not speaking on behalf of comdev but under their
>>> own name.
>>>
>>> Let's definitely find a way to be effective on social media.
>>>
>>> Kenn
>>>
>>> On Tue, May 21, 2019 at 4:14 AM Maximilian Michels 
>>> wrote:
>>>
 Hi Aizhamal,

 This is a great idea. I think it would help Beam to be more prominent
 on
 social media.

 We need to discuss this also on the private@ mailing list but I don't
 see anything standing in the way if the PMC always gets to approve the
 proposed social media postings.

 I could even imagine that the PMC gives rights to a Beam community
 member to post in their name.

 Thanks,
 Max

 On 21.05.19 03:09, Austin Bennett wrote:
 > Is PMC definitely in charge of this (approving, communication
 channel,
 > etc)?
 >
 > There could even be a more concrete pull-request-like function even
 for
 > things like tweets (to minimize cut/paste operations)?
 >
 > I remember a bit of a mechanism having been proposed some time ago
 (in
 > another circumstance), though doesn't look like it made it terribly
 far:
 >
 http://www.redhenlab.org/home/the-cognitive-core-research-topics-in-red-hen/the-barnyard/-slick-tweeting
 > (I haven't otherwise seen such functionality).
 >
 >
 >
 > On Mon, May 20, 2019 at 4:54 PM Robert Burke >>> > > wrote:
 >
 > +1
 > As a twitter user, I like this idea.
 >
 > On Mon, 20 May 2019 at 15:18, Aizhamal Nurmamat kyzy
 > mailto:aizha...@google.com>> wrote:
 >
 > Hello everyone,
 >
 >
 > What does the community think of making Apache Beam’s social
 > media presence more active and more community driven?
 >
 >
 > The Slack and StackOverflow for Apache Beam offer pretty nice
 > support, but we still could utilize Twitter & LinkedIn better
 to
 > share more interesting Beam news. For example, we could tweet
 to
 > welcome new committers, announce new features consistently,
 > share and recognize contributions, promote events and meetups,
 > share other news that are relevant to Beam, big data, etc.
 >
 >
 > I understand that PMC members may not have time to do
 curation,
 > moderation and creation of content; so I was wondering if we
 > could create a spreadsheet where community members could
 propose
 > posts with publishing dates, and let somebody to filter,
 > moderate, and manage it; then send to a PMC member for
 publication.
 >
 >
 > I would love to help where I can in this regard. I’ve had some
 > experience doing social media elsewhere in the past.
 >
 >
 > Best
 >
 > Aizhamal
 >
 >

>>>

Re: [VOTE] Release 2.13.0, release candidate #2

2019-06-03 Thread Andrew Pilloud

+1 Reviewed the Nexmark java and SQL perfkit graphs, no obvious regressions
over the previous release.

On Mon, Jun 3, 2019 at 1:15 PM Lukasz Cwik  wrote:

> Thanks for the clarification.
>
> On Mon, Jun 3, 2019 at 11:40 AM Ankur Goenka  wrote:
>
>> Yes, i meant i will close the voting at 5pm and start the release process.
>>
>> On Mon, Jun 3, 2019, 10:59 AM Lukasz Cwik  wrote:
>>
>>> Ankur, did you mean to say your going to close the vote today at 5pm?
>>> (and then complete the release afterwards)
>>>
>>> On Mon, Jun 3, 2019 at 10:54 AM Ankur Goenka  wrote:
>>>
 Thanks for validating and voting.

 We have 4 binding votes.
 I will complete the release today 5PM. Please raise any concerns before
 that.

 Thanks,
 Ankur

 On Mon, Jun 3, 2019 at 8:36 AM Lukasz Cwik  wrote:

> Since the gearpump issue has been ongoing since 2.10, I can't consider
> it a blocker for this release and am voting +1.
>
> On Mon, Jun 3, 2019 at 7:13 AM Jean-Baptiste Onofré 
> wrote:
>
>> +1 (binding)
>>
>> Quickly tested on beam-samples.
>>
>> Regards
>> JB
>>
>> On 31/05/2019 04:52, Ankur Goenka wrote:
>> > Hi everyone,
>> >
>> > Please review and vote on the release candidate #2 for the version
>> > 2.13.0, as follows:
>> >
>> > [ ] +1, Approve the release
>> > [ ] -1, Do not approve the release (please provide specific
>> comments)
>> >
>> > The complete staging area is available for your review, which
>> includes:
>> > * JIRA release notes [1],
>> > * the official Apache source release to be deployed to
>> dist.apache.org
>> >  [2], which is signed with the key with
>> > fingerprint 6356C1A9F089B0FA3DE8753688934A6699985948 [3],
>> > * all artifacts to be deployed to the Maven Central Repository [4],
>> > * source code tag "v2.13.0-RC2" [5],
>> > * website pull request listing the release [6] and publishing the
>> API
>> > reference manual [7].
>> > * Python artifacts are deployed along with the source release to the
>> > dist.apache.org  [2].
>> > * Validation sheet with a tab for 2.13.0 release to help with
>> validation
>> > [8].
>> >
>> > The vote will be open for at least 72 hours. It is adopted by
>> majority
>> > approval, with at least 3 PMC affirmative votes.
>> >
>> > Thanks,
>> > Ankur
>> >
>> > [1]
>> >
>> https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12345166
>> > [2] https://dist.apache.org/repos/dist/dev/beam/2.13.0/
>> > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>> > [4]
>> https://repository.apache.org/content/repositories/orgapachebeam-1070/
>> > [5] https://github.com/apache/beam/tree/v2.13.0-RC2
>> > [6] https://github.com/apache/beam/pull/8645
>> > [7] https://github.com/apache/beam-site/pull/589
>> > [8]
>> >
>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1031196952
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>

BQ IT tests fail on TestDataflowRunner - Python SDK

2019-06-03 Thread Tanay Tummalapalli

Hi everyone,

I ran the Integration Tests -
BigQueryStreamingInsertTransformIntegrationTests[1] and BigQueryFileLoadsIT[2]
on the master branch locally, with the following command:
./scripts/run_integration_test.sh --test_opts
--tests=apache_beam.io.gcp.bigquery_test:BigQueryStreamingInsertTransformIntegrationTests
The Dataflow jobs for the tests failed with the following error:
root: INFO: 2019-06-03T18:36:53.021Z: JOB_MESSAGE_ERROR: Traceback (most
recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py",
line 649, in do_work
work_executor.execute()
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py",
line 150, in execute
test_shuffle_sink=self._test_shuffle_sink)
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py",
line 116, in create_operation
is_streaming=False)
File "apache_beam/runners/worker/operations.py", line 962, in
apache_beam.runners.worker.operations.create_operation
op = BatchGroupAlsoByWindowsOperation(
File "dataflow_worker/shuffle_operations.py", line 219, in
dataflow_worker.shuffle_operations.BatchGroupAlsoByWindowsOperation.__init__
self.windowing = deserialize_windowing_strategy(self.spec.window_fn)
File "dataflow_worker/shuffle_operations.py", line 207, in
dataflow_worker.shuffle_operations.deserialize_windowing_strategy
return pickler.loads(serialized_data)
File
"/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py",
line 248, in loads
c = base64.b64decode(encoded)
File "/usr/lib/python2.7/base64.py", line 78, in b64decode
raise TypeError(msg)
TypeError: Incorrect padding


I tested the same tests on the 2.13.0-RC#2 branch as well and they passed.
These tests also don't fail in the most recent Python post-commit
tests[3-5].

Keeping in mind the recent b64 changes in BQ, none of the tests in the test
classes mentioned above makes use of a "BYTES" type field.
Would love to get pointers to possible reasons.

Thank You
- TT

[1]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_test.py#L479-L630
[2]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_file_loads_test.py#L358-L528
[3]
https://builds.apache.org/job/beam_PostCommit_Python_Verify/lastCompletedBuild/
[4]
https://builds.apache.org/job/beam_PostCommit_Python3_Verify/lastCompletedBuild/
[5]
https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/

Re: 1 Million Lines of Code (1 MLOC)

2019-06-03 Thread Brian Hulette

You can run loc and tokei with a --files arg to get a breakdown by file.
They're just classifying one file as autoconf:
https://github.com/apache/beam/blob/master/sdks/python/MANIFEST.in

On Mon, Jun 3, 2019 at 1:02 PM Kenneth Knowles  wrote:

> Where's the autoconf?
>
> On Mon, Jun 3, 2019 at 10:21 AM Kyle Weaver  wrote:
>
>> > time to delete the entire project and start over again
>>
>> Agreed, but this time using Rust. (Just think of all the good press we'll
>> get on Hacker News! )
>>
>> @ruoyun looks like the c++ is a basic `echo` program for an example
>> pipeline?
>> https://github.com/apache/beam/tree/master/examples/java/src/main/java/org/apache/beam/examples/subprocess
>>
>> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
>> | +1650203
>>
>>
>> On Mon, Jun 3, 2019 at 10:11 AM Ruoyun Huang  wrote:
>>
>>> interesting stats.
>>>
>>> I am very curious in what we can benefit from merely *32* lines of c++
>>> code in a MLOC repository.
>>>
>>> On Mon, Jun 3, 2019 at 2:10 AM Maximilian Michels 
>>> wrote:
>>>
 Interesting stats :) This metric does not take into a account Beam's
 dependencies, e.g. libraries and execution backends. That would
 increase
 the LOCs to millions.

 On 01.06.19 01:46, Alex Amato wrote:
 > Interesting, so if we play with https://github.com/cgag/loc we could
 > break it down further? I.e. test files vs code files? Which folders,
 > etc. That could be interesting as well.
 >
 > On Fri, May 31, 2019 at 4:20 PM Brian Hulette >>> > > wrote:
 >
 > Dennis Nedry needed 2 million lines of code to control Jurassic
 > Park, and he only had to manage eight computers! I think we may
 > actually need to pick up the pace.
 >
 > On Fri, May 31, 2019 at 4:11 PM Anton Kedin >>> > > wrote:
 >
 > And to reduce the effort of future rewrites we should start
 > doing it on a schedule. I propose we start over once a week :)
 >
 > On Fri, May 31, 2019 at 4:02 PM Lukasz Cwik >>> > > wrote:
 >
 > 1 million lines is too much, time to delete the entire
 > project and start over again, :-)
 >
 > On Fri, May 31, 2019 at 3:12 PM Ankur Goenka
 > mailto:goe...@google.com>> wrote:
 >
 > Thanks for sharing.
 > This is really interesting metrics.
 > One use I can see is to track LOC vs Comments to make
 > sure that we keep up with the practice of writing
 > maintainable code.
 >
 > On Fri, May 31, 2019 at 3:04 PM Ismaël Mejía
 > mailto:ieme...@gmail.com>> wrote:
 >
 > I was checking some metrics in our codebase and
 > found by chance that
 > we have passed the 1 million lines of code (MLOC).
 > Of course lines of
 > code may not matter much but anyway it is
 > interesting to see the size
 > of our project at this moment.
 >
 > This is the detailed information returned by loc
 [1]:
 >
 >
  
 
 >   Language FilesLines
 > Blank  Comment Code
 >
  
 
 >   Java  3681   673007
 > 78265   140753   453989
 >   Python 497   131082
 > 225601337895144
 >   Go 333   105775
 > 136811107381021
 >   Markdown   20531989
 >   6526025463
 >   Plain Text  1121979
 >   6359015620
 >   Sass92 9867
 >   1434 1900 6533
 >   JavaScript  19 5157
 >   1197  467 3493
 >   YAML14 4601
 > 454 1104 3043
 >   Bourne Shell30 3874
 > 470 1028 2376
 >   Protobuf17 4258
 > 677 1373

Re: [VOTE] Release 2.13.0, release candidate #2

2019-06-03 Thread Lukasz Cwik

Thanks for the clarification.

On Mon, Jun 3, 2019 at 11:40 AM Ankur Goenka  wrote:

> Yes, i meant i will close the voting at 5pm and start the release process.
>
> On Mon, Jun 3, 2019, 10:59 AM Lukasz Cwik  wrote:
>
>> Ankur, did you mean to say your going to close the vote today at 5pm?
>> (and then complete the release afterwards)
>>
>> On Mon, Jun 3, 2019 at 10:54 AM Ankur Goenka  wrote:
>>
>>> Thanks for validating and voting.
>>>
>>> We have 4 binding votes.
>>> I will complete the release today 5PM. Please raise any concerns before
>>> that.
>>>
>>> Thanks,
>>> Ankur
>>>
>>> On Mon, Jun 3, 2019 at 8:36 AM Lukasz Cwik  wrote:
>>>
 Since the gearpump issue has been ongoing since 2.10, I can't consider
 it a blocker for this release and am voting +1.

 On Mon, Jun 3, 2019 at 7:13 AM Jean-Baptiste Onofré 
 wrote:

> +1 (binding)
>
> Quickly tested on beam-samples.
>
> Regards
> JB
>
> On 31/05/2019 04:52, Ankur Goenka wrote:
> > Hi everyone,
> >
> > Please review and vote on the release candidate #2 for the version
> > 2.13.0, as follows:
> >
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> > The complete staging area is available for your review, which
> includes:
> > * JIRA release notes [1],
> > * the official Apache source release to be deployed to
> dist.apache.org
> >  [2], which is signed with the key with
> > fingerprint 6356C1A9F089B0FA3DE8753688934A6699985948 [3],
> > * all artifacts to be deployed to the Maven Central Repository [4],
> > * source code tag "v2.13.0-RC2" [5],
> > * website pull request listing the release [6] and publishing the API
> > reference manual [7].
> > * Python artifacts are deployed along with the source release to the
> > dist.apache.org  [2].
> > * Validation sheet with a tab for 2.13.0 release to help with
> validation
> > [8].
> >
> > The vote will be open for at least 72 hours. It is adopted by
> majority
> > approval, with at least 3 PMC affirmative votes.
> >
> > Thanks,
> > Ankur
> >
> > [1]
> >
> https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12345166
> > [2] https://dist.apache.org/repos/dist/dev/beam/2.13.0/
> > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> > [4]
> https://repository.apache.org/content/repositories/orgapachebeam-1070/
> > [5] https://github.com/apache/beam/tree/v2.13.0-RC2
> > [6] https://github.com/apache/beam/pull/8645
> > [7] https://github.com/apache/beam-site/pull/589
> > [8]
> >
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1031196952
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: 1 Million Lines of Code (1 MLOC)

2019-06-03 Thread Kenneth Knowles

Where's the autoconf?

On Mon, Jun 3, 2019 at 10:21 AM Kyle Weaver  wrote:

> > time to delete the entire project and start over again
>
> Agreed, but this time using Rust. (Just think of all the good press we'll
> get on Hacker News! )
>
> @ruoyun looks like the c++ is a basic `echo` program for an example
> pipeline?
> https://github.com/apache/beam/tree/master/examples/java/src/main/java/org/apache/beam/examples/subprocess
>
> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
> | +1650203
>
>
> On Mon, Jun 3, 2019 at 10:11 AM Ruoyun Huang  wrote:
>
>> interesting stats.
>>
>> I am very curious in what we can benefit from merely *32* lines of c++
>> code in a MLOC repository.
>>
>> On Mon, Jun 3, 2019 at 2:10 AM Maximilian Michels  wrote:
>>
>>> Interesting stats :) This metric does not take into a account Beam's
>>> dependencies, e.g. libraries and execution backends. That would increase
>>> the LOCs to millions.
>>>
>>> On 01.06.19 01:46, Alex Amato wrote:
>>> > Interesting, so if we play with https://github.com/cgag/loc we could
>>> > break it down further? I.e. test files vs code files? Which folders,
>>> > etc. That could be interesting as well.
>>> >
>>> > On Fri, May 31, 2019 at 4:20 PM Brian Hulette >> > > wrote:
>>> >
>>> > Dennis Nedry needed 2 million lines of code to control Jurassic
>>> > Park, and he only had to manage eight computers! I think we may
>>> > actually need to pick up the pace.
>>> >
>>> > On Fri, May 31, 2019 at 4:11 PM Anton Kedin >> > > wrote:
>>> >
>>> > And to reduce the effort of future rewrites we should start
>>> > doing it on a schedule. I propose we start over once a week :)
>>> >
>>> > On Fri, May 31, 2019 at 4:02 PM Lukasz Cwik >> > > wrote:
>>> >
>>> > 1 million lines is too much, time to delete the entire
>>> > project and start over again, :-)
>>> >
>>> > On Fri, May 31, 2019 at 3:12 PM Ankur Goenka
>>> > mailto:goe...@google.com>> wrote:
>>> >
>>> > Thanks for sharing.
>>> > This is really interesting metrics.
>>> > One use I can see is to track LOC vs Comments to make
>>> > sure that we keep up with the practice of writing
>>> > maintainable code.
>>> >
>>> > On Fri, May 31, 2019 at 3:04 PM Ismaël Mejía
>>> > mailto:ieme...@gmail.com>> wrote:
>>> >
>>> > I was checking some metrics in our codebase and
>>> > found by chance that
>>> > we have passed the 1 million lines of code (MLOC).
>>> > Of course lines of
>>> > code may not matter much but anyway it is
>>> > interesting to see the size
>>> > of our project at this moment.
>>> >
>>> > This is the detailed information returned by loc
>>> [1]:
>>> >
>>> >
>>>  
>>> 
>>> >   Language FilesLines
>>> > Blank  Comment Code
>>> >
>>>  
>>> 
>>> >   Java  3681   673007
>>> > 78265   140753   453989
>>> >   Python 497   131082
>>> > 225601337895144
>>> >   Go 333   105775
>>> > 136811107381021
>>> >   Markdown   20531989
>>> >   6526025463
>>> >   Plain Text  1121979
>>> >   6359015620
>>> >   Sass92 9867
>>> >   1434 1900 6533
>>> >   JavaScript  19 5157
>>> >   1197  467 3493
>>> >   YAML14 4601
>>> > 454 1104 3043
>>> >   Bourne Shell30 3874
>>> > 470 1028 2376
>>> >   Protobuf17 4258
>>> > 677 1373 2208
>>> >   XML 17 2789
>>> > 296  559 1934
>>> >   Kotlin  19 3501
>>> > 347 1370 1784
>>> >   HTML60 2447
>>> > 148

Re: [DISCUSS] Portability representation of schemas

2019-06-03 Thread Brian Hulette

> It has to go into the proto somewhere (since that's the only way the SDK
can get it), but I'm not sure they should be considered integral parts of
the type.
Are you just advocating for an approach where any SDK-specific information
is stored outside of the Schema message itself so that Schema really does
just represent the type? That seems reasonable to me, and alleviates my
concerns about how this applies to columnar encodings a bit as well.

We could lift all of the LogicalTypeConversion messages out of the Schema
and the LogicalType like this:

message SchemaCoder {
  Schema schema = 1;
  LogicalTypeConversion root_conversion = 2;
  map attribute_conversions = 3; // only
necessary for user type aliases, portable logical types by definition have
nothing SDK-specific
}

I think a critical question (that has implications for the above proposal)
is how/if the two different concepts Kenn mentioned are allowed to nest.
For example, you could argue it's redundant to have a user type alias that
has a Row representation with a field that is itself a user type alias,
because instead you could just have a single top-level type alias
with to/from functions that pack and unpack the entire hierarchy. On the
other hand, I think it does make sense for a user type alias or a truly
portable logical type to have a field that is itself a truly portable
logical type (e.g. a user type alias or portable type with a DateTime).

I've been assuming that user-type aliases could be nested, but should we
disallow that? Or should we go the other way and require that logical types
define at most one "level"?

Brian

On Mon, Jun 3, 2019 at 11:08 AM Kenneth Knowles  wrote:

>
> On Mon, Jun 3, 2019 at 10:53 AM Reuven Lax  wrote:
>
>> So I feel a bit leery about making the to/from functions a fundamental
>> part of the portability representation. In my mind, that is very tied to a
>> specific SDK/language. A SDK (say the Java SDK) wants to allow users to use
>> a wide variety of native types with schemas, and under the covers uses the
>> to/from functions to implement that. However from the portable Beam
>> perspective, the schema itself should be the real "type" of the
>> PCollection; the to/from methods are simply a way that a particular SDK
>> makes schemas easier to use. It has to go into the proto somewhere (since
>> that's the only way the SDK can get it), but I'm not sure they should be
>> considered integral parts of the type.
>>
>
> On the doc in a couple places this distinction was made:
>
> * For truly portable logical types, no instructions for the SDK are
> needed. Instead, they require:
>- URN: a standardized identifier any SDK can recognize
>- A spec: what is the universe of values in this type?
>- A representation: how is it represented in built-in types? This is
> how SDKs who do not know/care about the URN will process it
>- (optional): SDKs choose preferred SDK-specific types to embed the
> values in. SDKs have to know about the URN and choose for themselves.
>
> *For user-level type aliases, written as convenience by the user in their
> pipeline, what Java schemas have today:
>- to/from UDFs: the code is SDK-specific
>- some representation of the intended type (like java class): also SDK
> specific
>- a representation
>- any "id" is just like other ids in the pipeline, just avoiding
> duplicating the proto
>- Luke points out that nesting these can give multiple SDKs a hint
>
> In my mind the remaining complexity is whether or not we need to be able
> to move between the two. Composite PTransforms, for example, do have
> fluidity between being strictly user-defined versus portable URN+payload.
> But it requires lots of engineering, namely the current work on expansion
> service.
>
> Kenn
>
>
>> On Mon, Jun 3, 2019 at 10:23 AM Brian Hulette 
>> wrote:
>>
>>> Ah I see, I didn't realize that. Then I suppose we'll need to/from
>>> functions somewhere in the logical type conversion to preserve the current
>>> behavior.
>>>
>>> I'm still a little hesitant to make these functions an explicit part of
>>> LogicalTypeConversion for another reason. Down the road, schemas could give
>>> us an avenue to use a batched columnar format (presumably arrow, but of
>>> course others are possible). By making to/from an explicit part of logical
>>> types we add some element-wise logic to a schema representation that's
>>> otherwise ambivalent to element-wise vs. batched encodings.
>>>
>>> I suppose you could make an argument that to/from are only for
>>> custom types. There will also be some set of well-known types identified
>>> only by URN and some parameters, which could easily be translated to a
>>> columnar format. We could just not support custom types fully if we add a
>>> columnar encoding, or maybe add optional toBatch/fromBatch functions
>>> when/if we get there.
>>>
>>> What about something like this that makes the two different types of
>>> logical types explicit?
>>>
>>> // Describes a

Re: [VOTE] Release 2.13.0, release candidate #2

2019-06-03 Thread Ankur Goenka

Yes, i meant i will close the voting at 5pm and start the release process.

On Mon, Jun 3, 2019, 10:59 AM Lukasz Cwik  wrote:

> Ankur, did you mean to say your going to close the vote today at 5pm? (and
> then complete the release afterwards)
>
> On Mon, Jun 3, 2019 at 10:54 AM Ankur Goenka  wrote:
>
>> Thanks for validating and voting.
>>
>> We have 4 binding votes.
>> I will complete the release today 5PM. Please raise any concerns before
>> that.
>>
>> Thanks,
>> Ankur
>>
>> On Mon, Jun 3, 2019 at 8:36 AM Lukasz Cwik  wrote:
>>
>>> Since the gearpump issue has been ongoing since 2.10, I can't consider
>>> it a blocker for this release and am voting +1.
>>>
>>> On Mon, Jun 3, 2019 at 7:13 AM Jean-Baptiste Onofré 
>>> wrote:
>>>
 +1 (binding)

 Quickly tested on beam-samples.

 Regards
 JB

 On 31/05/2019 04:52, Ankur Goenka wrote:
 > Hi everyone,
 >
 > Please review and vote on the release candidate #2 for the version
 > 2.13.0, as follows:
 >
 > [ ] +1, Approve the release
 > [ ] -1, Do not approve the release (please provide specific comments)
 >
 > The complete staging area is available for your review, which
 includes:
 > * JIRA release notes [1],
 > * the official Apache source release to be deployed to
 dist.apache.org
 >  [2], which is signed with the key with
 > fingerprint 6356C1A9F089B0FA3DE8753688934A6699985948 [3],
 > * all artifacts to be deployed to the Maven Central Repository [4],
 > * source code tag "v2.13.0-RC2" [5],
 > * website pull request listing the release [6] and publishing the API
 > reference manual [7].
 > * Python artifacts are deployed along with the source release to the
 > dist.apache.org  [2].
 > * Validation sheet with a tab for 2.13.0 release to help with
 validation
 > [8].
 >
 > The vote will be open for at least 72 hours. It is adopted by majority
 > approval, with at least 3 PMC affirmative votes.
 >
 > Thanks,
 > Ankur
 >
 > [1]
 >
 https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12345166
 > [2] https://dist.apache.org/repos/dist/dev/beam/2.13.0/
 > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
 > [4]
 https://repository.apache.org/content/repositories/orgapachebeam-1070/
 > [5] https://github.com/apache/beam/tree/v2.13.0-RC2
 > [6] https://github.com/apache/beam/pull/8645
 > [7] https://github.com/apache/beam-site/pull/589
 > [8]
 >
 https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1031196952

 --
 Jean-Baptiste Onofré
 jbono...@apache.org
 http://blog.nanthrax.net
 Talend - http://www.talend.com

>>>

Re: [DISCUSS] Portability representation of schemas

2019-06-03 Thread Kenneth Knowles

On Mon, Jun 3, 2019 at 10:53 AM Reuven Lax  wrote:

> So I feel a bit leery about making the to/from functions a fundamental
> part of the portability representation. In my mind, that is very tied to a
> specific SDK/language. A SDK (say the Java SDK) wants to allow users to use
> a wide variety of native types with schemas, and under the covers uses the
> to/from functions to implement that. However from the portable Beam
> perspective, the schema itself should be the real "type" of the
> PCollection; the to/from methods are simply a way that a particular SDK
> makes schemas easier to use. It has to go into the proto somewhere (since
> that's the only way the SDK can get it), but I'm not sure they should be
> considered integral parts of the type.
>

On the doc in a couple places this distinction was made:

* For truly portable logical types, no instructions for the SDK are needed.
Instead, they require:
   - URN: a standardized identifier any SDK can recognize
   - A spec: what is the universe of values in this type?
   - A representation: how is it represented in built-in types? This is how
SDKs who do not know/care about the URN will process it
   - (optional): SDKs choose preferred SDK-specific types to embed the
values in. SDKs have to know about the URN and choose for themselves.

*For user-level type aliases, written as convenience by the user in their
pipeline, what Java schemas have today:
   - to/from UDFs: the code is SDK-specific
   - some representation of the intended type (like java class): also SDK
specific
   - a representation
   - any "id" is just like other ids in the pipeline, just avoiding
duplicating the proto
   - Luke points out that nesting these can give multiple SDKs a hint

In my mind the remaining complexity is whether or not we need to be able to
move between the two. Composite PTransforms, for example, do have fluidity
between being strictly user-defined versus portable URN+payload. But it
requires lots of engineering, namely the current work on expansion service.

Kenn


> On Mon, Jun 3, 2019 at 10:23 AM Brian Hulette  wrote:
>
>> Ah I see, I didn't realize that. Then I suppose we'll need to/from
>> functions somewhere in the logical type conversion to preserve the current
>> behavior.
>>
>> I'm still a little hesitant to make these functions an explicit part of
>> LogicalTypeConversion for another reason. Down the road, schemas could give
>> us an avenue to use a batched columnar format (presumably arrow, but of
>> course others are possible). By making to/from an explicit part of logical
>> types we add some element-wise logic to a schema representation that's
>> otherwise ambivalent to element-wise vs. batched encodings.
>>
>> I suppose you could make an argument that to/from are only for
>> custom types. There will also be some set of well-known types identified
>> only by URN and some parameters, which could easily be translated to a
>> columnar format. We could just not support custom types fully if we add a
>> columnar encoding, or maybe add optional toBatch/fromBatch functions
>> when/if we get there.
>>
>> What about something like this that makes the two different types of
>> logical types explicit?
>>
>> // Describes a logical type and how to convert between it and its
>> representation (e.g. Row).
>> message LogicalTypeConversion {
>>   oneof conversion {
>> message Standard standard = 1;
>> message Custom custom = 2;
>>   }
>>
>>   message Standard {
>> String urn = 1;
>> repeated string args = 2; // could also be a map
>>   }
>>
>>   message Custom {
>> FunctionSpec(?) toRepresentation = 1;
>> FunctionSpec(?) fromRepresentation = 2;
>> bytes type = 3; // e.g. serialized class for Java
>>   }
>> }
>>
>> And LogicalType and Schema become:
>>
>> message LogicalType {
>>   FieldType representation = 1;
>>   LogicalTypeConversion conversion = 2;
>> }
>>
>> message Schema {
>>   ...
>>   repeated Field fields = 1;
>>   LogicalTypeConversion conversion = 2; // implied that representation is
>> Row
>> }
>>
>> Brian
>>
>> On Sat, Jun 1, 2019 at 10:44 AM Reuven Lax  wrote:
>>
>>> Keep in mind that right now the SchemaRegistry is only assumed to exist
>>> at graph-construction time, not at execution time; all information in the
>>> schema registry is embedded in the SchemaCoder, which is the only thing we
>>> keep around when the pipeline is actually running. We could look into
>>> changing this, but it would potentially be a very big change, and I do
>>> think we should start getting users actively using schemas soon.
>>>
>>> On Fri, May 31, 2019 at 3:40 PM Brian Hulette 
>>> wrote:
>>>
 > Can you propose what the protos would look like in this case? Right
 now LogicalType does not contain the to/from conversion functions in the
 proto. Do you think we'll need to add these in?

 Maybe. Right now the proposed LogicalType message is pretty
 simple/generic:
 message LogicalType {
   FieldType representation = 1;

Re: [VOTE] Release 2.13.0, release candidate #2

2019-06-03 Thread Lukasz Cwik

Ankur, did you mean to say your going to close the vote today at 5pm? (and
then complete the release afterwards)

On Mon, Jun 3, 2019 at 10:54 AM Ankur Goenka  wrote:

> Thanks for validating and voting.
>
> We have 4 binding votes.
> I will complete the release today 5PM. Please raise any concerns before
> that.
>
> Thanks,
> Ankur
>
> On Mon, Jun 3, 2019 at 8:36 AM Lukasz Cwik  wrote:
>
>> Since the gearpump issue has been ongoing since 2.10, I can't consider it
>> a blocker for this release and am voting +1.
>>
>> On Mon, Jun 3, 2019 at 7:13 AM Jean-Baptiste Onofré 
>> wrote:
>>
>>> +1 (binding)
>>>
>>> Quickly tested on beam-samples.
>>>
>>> Regards
>>> JB
>>>
>>> On 31/05/2019 04:52, Ankur Goenka wrote:
>>> > Hi everyone,
>>> >
>>> > Please review and vote on the release candidate #2 for the version
>>> > 2.13.0, as follows:
>>> >
>>> > [ ] +1, Approve the release
>>> > [ ] -1, Do not approve the release (please provide specific comments)
>>> >
>>> > The complete staging area is available for your review, which includes:
>>> > * JIRA release notes [1],
>>> > * the official Apache source release to be deployed to dist.apache.org
>>> >  [2], which is signed with the key with
>>> > fingerprint 6356C1A9F089B0FA3DE8753688934A6699985948 [3],
>>> > * all artifacts to be deployed to the Maven Central Repository [4],
>>> > * source code tag "v2.13.0-RC2" [5],
>>> > * website pull request listing the release [6] and publishing the API
>>> > reference manual [7].
>>> > * Python artifacts are deployed along with the source release to the
>>> > dist.apache.org  [2].
>>> > * Validation sheet with a tab for 2.13.0 release to help with
>>> validation
>>> > [8].
>>> >
>>> > The vote will be open for at least 72 hours. It is adopted by majority
>>> > approval, with at least 3 PMC affirmative votes.
>>> >
>>> > Thanks,
>>> > Ankur
>>> >
>>> > [1]
>>> >
>>> https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12345166
>>> > [2] https://dist.apache.org/repos/dist/dev/beam/2.13.0/
>>> > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> > [4]
>>> https://repository.apache.org/content/repositories/orgapachebeam-1070/
>>> > [5] https://github.com/apache/beam/tree/v2.13.0-RC2
>>> > [6] https://github.com/apache/beam/pull/8645
>>> > [7] https://github.com/apache/beam-site/pull/589
>>> > [8]
>>> >
>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1031196952
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>

Re: [VOTE] Release 2.13.0, release candidate #2

2019-06-03 Thread Ankur Goenka

Thanks for validating and voting.

We have 4 binding votes.
I will complete the release today 5PM. Please raise any concerns before
that.

Thanks,
Ankur

On Mon, Jun 3, 2019 at 8:36 AM Lukasz Cwik  wrote:

> Since the gearpump issue has been ongoing since 2.10, I can't consider it
> a blocker for this release and am voting +1.
>
> On Mon, Jun 3, 2019 at 7:13 AM Jean-Baptiste Onofré 
> wrote:
>
>> +1 (binding)
>>
>> Quickly tested on beam-samples.
>>
>> Regards
>> JB
>>
>> On 31/05/2019 04:52, Ankur Goenka wrote:
>> > Hi everyone,
>> >
>> > Please review and vote on the release candidate #2 for the version
>> > 2.13.0, as follows:
>> >
>> > [ ] +1, Approve the release
>> > [ ] -1, Do not approve the release (please provide specific comments)
>> >
>> > The complete staging area is available for your review, which includes:
>> > * JIRA release notes [1],
>> > * the official Apache source release to be deployed to dist.apache.org
>> >  [2], which is signed with the key with
>> > fingerprint 6356C1A9F089B0FA3DE8753688934A6699985948 [3],
>> > * all artifacts to be deployed to the Maven Central Repository [4],
>> > * source code tag "v2.13.0-RC2" [5],
>> > * website pull request listing the release [6] and publishing the API
>> > reference manual [7].
>> > * Python artifacts are deployed along with the source release to the
>> > dist.apache.org  [2].
>> > * Validation sheet with a tab for 2.13.0 release to help with validation
>> > [8].
>> >
>> > The vote will be open for at least 72 hours. It is adopted by majority
>> > approval, with at least 3 PMC affirmative votes.
>> >
>> > Thanks,
>> > Ankur
>> >
>> > [1]
>> >
>> https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12345166
>> > [2] https://dist.apache.org/repos/dist/dev/beam/2.13.0/
>> > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>> > [4]
>> https://repository.apache.org/content/repositories/orgapachebeam-1070/
>> > [5] https://github.com/apache/beam/tree/v2.13.0-RC2
>> > [6] https://github.com/apache/beam/pull/8645
>> > [7] https://github.com/apache/beam-site/pull/589
>> > [8]
>> >
>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1031196952
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>

Re: [DISCUSS] Portability representation of schemas

2019-06-03 Thread Reuven Lax

So I feel a bit leery about making the to/from functions a fundamental part
of the portability representation. In my mind, that is very tied to a
specific SDK/language. A SDK (say the Java SDK) wants to allow users to use
a wide variety of native types with schemas, and under the covers uses the
to/from functions to implement that. However from the portable Beam
perspective, the schema itself should be the real "type" of the
PCollection; the to/from methods are simply a way that a particular SDK
makes schemas easier to use. It has to go into the proto somewhere (since
that's the only way the SDK can get it), but I'm not sure they should be
considered integral parts of the type.

On Mon, Jun 3, 2019 at 10:23 AM Brian Hulette  wrote:

> Ah I see, I didn't realize that. Then I suppose we'll need to/from
> functions somewhere in the logical type conversion to preserve the current
> behavior.
>
> I'm still a little hesitant to make these functions an explicit part of
> LogicalTypeConversion for another reason. Down the road, schemas could give
> us an avenue to use a batched columnar format (presumably arrow, but of
> course others are possible). By making to/from an explicit part of logical
> types we add some element-wise logic to a schema representation that's
> otherwise ambivalent to element-wise vs. batched encodings.
>
> I suppose you could make an argument that to/from are only for
> custom types. There will also be some set of well-known types identified
> only by URN and some parameters, which could easily be translated to a
> columnar format. We could just not support custom types fully if we add a
> columnar encoding, or maybe add optional toBatch/fromBatch functions
> when/if we get there.
>
> What about something like this that makes the two different types of
> logical types explicit?
>
> // Describes a logical type and how to convert between it and its
> representation (e.g. Row).
> message LogicalTypeConversion {
>   oneof conversion {
> message Standard standard = 1;
> message Custom custom = 2;
>   }
>
>   message Standard {
> String urn = 1;
> repeated string args = 2; // could also be a map
>   }
>
>   message Custom {
> FunctionSpec(?) toRepresentation = 1;
> FunctionSpec(?) fromRepresentation = 2;
> bytes type = 3; // e.g. serialized class for Java
>   }
> }
>
> And LogicalType and Schema become:
>
> message LogicalType {
>   FieldType representation = 1;
>   LogicalTypeConversion conversion = 2;
> }
>
> message Schema {
>   ...
>   repeated Field fields = 1;
>   LogicalTypeConversion conversion = 2; // implied that representation is
> Row
> }
>
> Brian
>
> On Sat, Jun 1, 2019 at 10:44 AM Reuven Lax  wrote:
>
>> Keep in mind that right now the SchemaRegistry is only assumed to exist
>> at graph-construction time, not at execution time; all information in the
>> schema registry is embedded in the SchemaCoder, which is the only thing we
>> keep around when the pipeline is actually running. We could look into
>> changing this, but it would potentially be a very big change, and I do
>> think we should start getting users actively using schemas soon.
>>
>> On Fri, May 31, 2019 at 3:40 PM Brian Hulette 
>> wrote:
>>
>>> > Can you propose what the protos would look like in this case? Right
>>> now LogicalType does not contain the to/from conversion functions in the
>>> proto. Do you think we'll need to add these in?
>>>
>>> Maybe. Right now the proposed LogicalType message is pretty
>>> simple/generic:
>>> message LogicalType {
>>>   FieldType representation = 1;
>>>   string logical_urn = 2;
>>>   bytes logical_payload = 3;
>>> }
>>>
>>> If we keep just logical_urn and logical_payload, the logical_payload
>>> could itself be a protobuf with attributes of 1) a serialized class and
>>> 2/3) to/from functions. Or, alternatively, we could have a generalization
>>> of the SchemaRegistry for logical types. Implementations for standard types
>>> and user-defined types would be registered by URN, and the SDK could look
>>> them up given just a URN. I put a brief section about this alternative in
>>> the doc last week [1]. What I suggested there included removing the
>>> logical_payload field, which is probably overkill. The critical piece is
>>> just relying on a registry in the SDK to look up types and to/from
>>> functions rather than storing them in the portable schema itself.
>>>
>>> I kind of like keeping the LogicalType message generic for now, since it
>>> gives us a way to try out these various approaches, but maybe that's just a
>>> cop out.
>>>
>>> [1]
>>> https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?ts=5cdf6a5b#heading=h.jlt5hdrolfy
>>>
>>> On Fri, May 31, 2019 at 12:36 PM Reuven Lax  wrote:
>>>


 On Tue, May 28, 2019 at 10:11 AM Brian Hulette 
 wrote:

>
>
> On Sun, May 26, 2019 at 1:25 PM Reuven Lax  wrote:
>
>>
>>
>> On Fri, May 24, 2019 at 11:42 AM Brian Hulette 
>>

Re: [DISCUSS] Portability representation of schemas

2019-06-03 Thread Brian Hulette

Ah I see, I didn't realize that. Then I suppose we'll need to/from
functions somewhere in the logical type conversion to preserve the current
behavior.

I'm still a little hesitant to make these functions an explicit part of
LogicalTypeConversion for another reason. Down the road, schemas could give
us an avenue to use a batched columnar format (presumably arrow, but of
course others are possible). By making to/from an explicit part of logical
types we add some element-wise logic to a schema representation that's
otherwise ambivalent to element-wise vs. batched encodings.

I suppose you could make an argument that to/from are only for
custom types. There will also be some set of well-known types identified
only by URN and some parameters, which could easily be translated to a
columnar format. We could just not support custom types fully if we add a
columnar encoding, or maybe add optional toBatch/fromBatch functions
when/if we get there.

What about something like this that makes the two different types of
logical types explicit?

// Describes a logical type and how to convert between it and its
representation (e.g. Row).
message LogicalTypeConversion {
  oneof conversion {
message Standard standard = 1;
message Custom custom = 2;
  }

  message Standard {
String urn = 1;
repeated string args = 2; // could also be a map
  }

  message Custom {
FunctionSpec(?) toRepresentation = 1;
FunctionSpec(?) fromRepresentation = 2;
bytes type = 3; // e.g. serialized class for Java
  }
}

And LogicalType and Schema become:

message LogicalType {
  FieldType representation = 1;
  LogicalTypeConversion conversion = 2;
}

message Schema {
  ...
  repeated Field fields = 1;
  LogicalTypeConversion conversion = 2; // implied that representation is
Row
}

Brian

On Sat, Jun 1, 2019 at 10:44 AM Reuven Lax  wrote:

> Keep in mind that right now the SchemaRegistry is only assumed to exist at
> graph-construction time, not at execution time; all information in the
> schema registry is embedded in the SchemaCoder, which is the only thing we
> keep around when the pipeline is actually running. We could look into
> changing this, but it would potentially be a very big change, and I do
> think we should start getting users actively using schemas soon.
>
> On Fri, May 31, 2019 at 3:40 PM Brian Hulette  wrote:
>
>> > Can you propose what the protos would look like in this case? Right now
>> LogicalType does not contain the to/from conversion functions in the proto.
>> Do you think we'll need to add these in?
>>
>> Maybe. Right now the proposed LogicalType message is pretty
>> simple/generic:
>> message LogicalType {
>>   FieldType representation = 1;
>>   string logical_urn = 2;
>>   bytes logical_payload = 3;
>> }
>>
>> If we keep just logical_urn and logical_payload, the logical_payload
>> could itself be a protobuf with attributes of 1) a serialized class and
>> 2/3) to/from functions. Or, alternatively, we could have a generalization
>> of the SchemaRegistry for logical types. Implementations for standard types
>> and user-defined types would be registered by URN, and the SDK could look
>> them up given just a URN. I put a brief section about this alternative in
>> the doc last week [1]. What I suggested there included removing the
>> logical_payload field, which is probably overkill. The critical piece is
>> just relying on a registry in the SDK to look up types and to/from
>> functions rather than storing them in the portable schema itself.
>>
>> I kind of like keeping the LogicalType message generic for now, since it
>> gives us a way to try out these various approaches, but maybe that's just a
>> cop out.
>>
>> [1]
>> https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?ts=5cdf6a5b#heading=h.jlt5hdrolfy
>>
>> On Fri, May 31, 2019 at 12:36 PM Reuven Lax  wrote:
>>
>>>
>>>
>>> On Tue, May 28, 2019 at 10:11 AM Brian Hulette 
>>> wrote:
>>>


 On Sun, May 26, 2019 at 1:25 PM Reuven Lax  wrote:

>
>
> On Fri, May 24, 2019 at 11:42 AM Brian Hulette 
> wrote:
>
>> *tl;dr:* SchemaCoder represents a logical type with a base type of
>> Row and we should think about that.
>>
>> I'm a little concerned that the current proposals for a portable
>> representation don't actually fully represent Schemas. It seems to me 
>> that
>> the current java-only Schemas are made up three concepts that are
>> intertwined:
>> (a) The Java SDK specific code for schema inference, type coercion,
>> and "schema-aware" transforms.
>> (b) A RowCoder[1] that encodes Rows[2] which have a particular
>> Schema[3].
>> (c) A SchemaCoder[4] that has a RowCoder for a particular schema, and
>> functions for converting Rows with that schema to/from a Java type T. 
>> Those
>> functions and the RowCoder are then composed to provider a Coder for the
>> type T.
>>
>
> RowCoder is currently just an

Re: 1 Million Lines of Code (1 MLOC)

2019-06-03 Thread Kyle Weaver

> time to delete the entire project and start over again

Agreed, but this time using Rust. (Just think of all the good press we'll
get on Hacker News! )

@ruoyun looks like the c++ is a basic `echo` program for an example
pipeline?
https://github.com/apache/beam/tree/master/examples/java/src/main/java/org/apache/beam/examples/subprocess

Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com |
+1650203


On Mon, Jun 3, 2019 at 10:11 AM Ruoyun Huang  wrote:

> interesting stats.
>
> I am very curious in what we can benefit from merely *32* lines of c++
> code in a MLOC repository.
>
> On Mon, Jun 3, 2019 at 2:10 AM Maximilian Michels  wrote:
>
>> Interesting stats :) This metric does not take into a account Beam's
>> dependencies, e.g. libraries and execution backends. That would increase
>> the LOCs to millions.
>>
>> On 01.06.19 01:46, Alex Amato wrote:
>> > Interesting, so if we play with https://github.com/cgag/loc we could
>> > break it down further? I.e. test files vs code files? Which folders,
>> > etc. That could be interesting as well.
>> >
>> > On Fri, May 31, 2019 at 4:20 PM Brian Hulette > > > wrote:
>> >
>> > Dennis Nedry needed 2 million lines of code to control Jurassic
>> > Park, and he only had to manage eight computers! I think we may
>> > actually need to pick up the pace.
>> >
>> > On Fri, May 31, 2019 at 4:11 PM Anton Kedin > > > wrote:
>> >
>> > And to reduce the effort of future rewrites we should start
>> > doing it on a schedule. I propose we start over once a week :)
>> >
>> > On Fri, May 31, 2019 at 4:02 PM Lukasz Cwik > > > wrote:
>> >
>> > 1 million lines is too much, time to delete the entire
>> > project and start over again, :-)
>> >
>> > On Fri, May 31, 2019 at 3:12 PM Ankur Goenka
>> > mailto:goe...@google.com>> wrote:
>> >
>> > Thanks for sharing.
>> > This is really interesting metrics.
>> > One use I can see is to track LOC vs Comments to make
>> > sure that we keep up with the practice of writing
>> > maintainable code.
>> >
>> > On Fri, May 31, 2019 at 3:04 PM Ismaël Mejía
>> > mailto:ieme...@gmail.com>> wrote:
>> >
>> > I was checking some metrics in our codebase and
>> > found by chance that
>> > we have passed the 1 million lines of code (MLOC).
>> > Of course lines of
>> > code may not matter much but anyway it is
>> > interesting to see the size
>> > of our project at this moment.
>> >
>> > This is the detailed information returned by loc
>> [1]:
>> >
>> >
>>  
>> 
>> >   Language FilesLines
>> > Blank  Comment Code
>> >
>>  
>> 
>> >   Java  3681   673007
>> > 78265   140753   453989
>> >   Python 497   131082
>> > 225601337895144
>> >   Go 333   105775
>> > 136811107381021
>> >   Markdown   20531989
>> >   6526025463
>> >   Plain Text  1121979
>> >   6359015620
>> >   Sass92 9867
>> >   1434 1900 6533
>> >   JavaScript  19 5157
>> >   1197  467 3493
>> >   YAML14 4601
>> > 454 1104 3043
>> >   Bourne Shell30 3874
>> > 470 1028 2376
>> >   Protobuf17 4258
>> > 677 1373 2208
>> >   XML 17 2789
>> > 296  559 1934
>> >   Kotlin  19 3501
>> > 347 1370 1784
>> >   HTML60 2447
>> > 148  914 1385
>> >   Batch3  249
>> >   570  192
>> >   INI

Re: 1 Million Lines of Code (1 MLOC)

2019-06-03 Thread Ruoyun Huang

interesting stats.

I am very curious in what we can benefit from merely *32* lines of c++ code
in a MLOC repository.

On Mon, Jun 3, 2019 at 2:10 AM Maximilian Michels  wrote:

> Interesting stats :) This metric does not take into a account Beam's
> dependencies, e.g. libraries and execution backends. That would increase
> the LOCs to millions.
>
> On 01.06.19 01:46, Alex Amato wrote:
> > Interesting, so if we play with https://github.com/cgag/loc we could
> > break it down further? I.e. test files vs code files? Which folders,
> > etc. That could be interesting as well.
> >
> > On Fri, May 31, 2019 at 4:20 PM Brian Hulette  > > wrote:
> >
> > Dennis Nedry needed 2 million lines of code to control Jurassic
> > Park, and he only had to manage eight computers! I think we may
> > actually need to pick up the pace.
> >
> > On Fri, May 31, 2019 at 4:11 PM Anton Kedin  > > wrote:
> >
> > And to reduce the effort of future rewrites we should start
> > doing it on a schedule. I propose we start over once a week :)
> >
> > On Fri, May 31, 2019 at 4:02 PM Lukasz Cwik  > > wrote:
> >
> > 1 million lines is too much, time to delete the entire
> > project and start over again, :-)
> >
> > On Fri, May 31, 2019 at 3:12 PM Ankur Goenka
> > mailto:goe...@google.com>> wrote:
> >
> > Thanks for sharing.
> > This is really interesting metrics.
> > One use I can see is to track LOC vs Comments to make
> > sure that we keep up with the practice of writing
> > maintainable code.
> >
> > On Fri, May 31, 2019 at 3:04 PM Ismaël Mejía
> > mailto:ieme...@gmail.com>> wrote:
> >
> > I was checking some metrics in our codebase and
> > found by chance that
> > we have passed the 1 million lines of code (MLOC).
> > Of course lines of
> > code may not matter much but anyway it is
> > interesting to see the size
> > of our project at this moment.
> >
> > This is the detailed information returned by loc [1]:
> >
> >
>  
> 
> >   Language FilesLines
> > Blank  Comment Code
> >
>  
> 
> >   Java  3681   673007
> > 78265   140753   453989
> >   Python 497   131082
> > 225601337895144
> >   Go 333   105775
> > 136811107381021
> >   Markdown   20531989
> >   6526025463
> >   Plain Text  1121979
> >   6359015620
> >   Sass92 9867
> >   1434 1900 6533
> >   JavaScript  19 5157
> >   1197  467 3493
> >   YAML14 4601
> > 454 1104 3043
> >   Bourne Shell30 3874
> > 470 1028 2376
> >   Protobuf17 4258
> > 677 1373 2208
> >   XML 17 2789
> > 296  559 1934
> >   Kotlin  19 3501
> > 347 1370 1784
> >   HTML60 2447
> > 148  914 1385
> >   Batch3  249
> >   570  192
> >   INI  1  206
> >   21   16  169
> >   C++  2   72
> > 4   36   32
> >   Autoconf 1   21
> > 1   164
> >
>  
> 
> >   Total 5002  1000874
> >   132497   173987   694390
> >
>  
>

Re: [VOTE] Release 2.13.0, release candidate #2

2019-06-03 Thread Lukasz Cwik

Since the gearpump issue has been ongoing since 2.10, I can't consider it a
blocker for this release and am voting +1.

On Mon, Jun 3, 2019 at 7:13 AM Jean-Baptiste Onofré  wrote:

> +1 (binding)
>
> Quickly tested on beam-samples.
>
> Regards
> JB
>
> On 31/05/2019 04:52, Ankur Goenka wrote:
> > Hi everyone,
> >
> > Please review and vote on the release candidate #2 for the version
> > 2.13.0, as follows:
> >
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> > The complete staging area is available for your review, which includes:
> > * JIRA release notes [1],
> > * the official Apache source release to be deployed to dist.apache.org
> >  [2], which is signed with the key with
> > fingerprint 6356C1A9F089B0FA3DE8753688934A6699985948 [3],
> > * all artifacts to be deployed to the Maven Central Repository [4],
> > * source code tag "v2.13.0-RC2" [5],
> > * website pull request listing the release [6] and publishing the API
> > reference manual [7].
> > * Python artifacts are deployed along with the source release to the
> > dist.apache.org  [2].
> > * Validation sheet with a tab for 2.13.0 release to help with validation
> > [8].
> >
> > The vote will be open for at least 72 hours. It is adopted by majority
> > approval, with at least 3 PMC affirmative votes.
> >
> > Thanks,
> > Ankur
> >
> > [1]
> >
> https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12345166
> > [2] https://dist.apache.org/repos/dist/dev/beam/2.13.0/
> > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> > [4]
> https://repository.apache.org/content/repositories/orgapachebeam-1070/
> > [5] https://github.com/apache/beam/tree/v2.13.0-RC2
> > [6] https://github.com/apache/beam/pull/8645
> > [7] https://github.com/apache/beam-site/pull/589
> > [8]
> >
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1031196952
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: [VOTE] Release 2.13.0, release candidate #2

2019-06-03 Thread Jean-Baptiste Onofré

+1 (binding)

Quickly tested on beam-samples.

Regards
JB

On 31/05/2019 04:52, Ankur Goenka wrote:
> Hi everyone,
> 
> Please review and vote on the release candidate #2 for the version
> 2.13.0, as follows:
> 
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
> 
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release to be deployed to dist.apache.org
>  [2], which is signed with the key with
> fingerprint 6356C1A9F089B0FA3DE8753688934A6699985948 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v2.13.0-RC2" [5],
> * website pull request listing the release [6] and publishing the API
> reference manual [7].
> * Python artifacts are deployed along with the source release to the
> dist.apache.org  [2].
> * Validation sheet with a tab for 2.13.0 release to help with validation
> [8].
> 
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
> 
> Thanks,
> Ankur
> 
> [1]
> https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12345166
> [2] https://dist.apache.org/repos/dist/dev/beam/2.13.0/
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> [4] https://repository.apache.org/content/repositories/orgapachebeam-1070/
> [5] https://github.com/apache/beam/tree/v2.13.0-RC2
> [6] https://github.com/apache/beam/pull/8645
> [7] https://github.com/apache/beam-site/pull/589
> [8]
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1031196952

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Beam Dependency Check Report (2019-06-03)

2019-06-03 Thread Apache Jenkins Server


High Priority Dependency Updates Of Beam Python SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
google-cloud-bigquery
1.6.1
1.13.0
2019-01-21
2019-06-03BEAM-5537
google-cloud-core
0.29.1
1.0.1
2019-02-04
2019-06-03BEAM-5538
oauth2client
3.0.0
4.1.3
2018-12-10
2018-12-10BEAM-6089
High Priority Dependency Updates Of Beam Java SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
com.google.auto.service:auto-service
1.0-rc2
1.0-rc5
2014-10-25
2019-03-25BEAM-5541
com.github.ben-manes.versions:com.github.ben-manes.versions.gradle.plugin
0.17.0
0.21.0
2019-02-11
2019-03-04BEAM-6645
org.conscrypt:conscrypt-openjdk
1.1.3
2.1.0
2018-06-04
2019-04-03BEAM-5748
javax.servlet:javax.servlet-api
3.1.0
4.0.1
2013-04-25
2018-04-20BEAM-5750
org.eclipse.jetty:jetty-server
9.2.10.v20150310
9.4.18.v20190429
2015-03-10
2019-04-29BEAM-5752
org.eclipse.jetty:jetty-servlet
9.2.10.v20150310
9.4.18.v20190429
2015-03-10
2019-04-29BEAM-5753
junit:junit
4.13-beta-1
4.13-beta-3
2018-11-25
2019-05-05BEAM-6127
com.github.spotbugs:spotbugs-annotations
3.1.11
4.0.0-beta2
2019-01-21
2019-05-22BEAM-6951

 A dependency update is high priority if it satisfies one of following criteria: 

 It has major versions update available, e.g. org.assertj:assertj-core 2.5.0 -> 3.10.0; 


 It is over 3 minor versions behind the latest version, e.g. org.tukaani:xz 1.5 -> 1.8; 


 The current version is behind the later version for over 180 days, e.g. com.google.auto.service:auto-service 2014-10-24 -> 2017-12-11. 

 In Beam, we make a best-effort attempt at keeping all dependencies up-to-date.
 In the future, issues will be filed and tracked for these automatically,
 but in the meantime you can search for existing issues or open a new one.

 For more information:  Beam Dependency Guide

Re: [VOTE] Release 2.13.0, release candidate #2

2019-06-03 Thread Maximilian Michels

+1 (binding)

Tested Flink Runner local/cluster execution with the included examples
and all supported Flink versions.

There is an issue with the staging for remote execution but it is not a
blocker since an alternative way exists:
https://jira.apache.org/jira/browse/BEAM-7478

Reminder: The voting closes on 2nd June so please validate and vote by then.

We generally we do not include weekends in the minimum voting period of
72 hours. I'd propose to leave the vote open at least until Wednesday
04:52 CEST which would be 72 hours excluding the weekend. Btw, thank you
for all your work on preparing the RC!

-Max

On 03.06.19 09:33, Robert Bradshaw wrote:

I validated the artifacts and Python 3.

On Sat, Jun 1, 2019 at 7:45 PM Ankur Goenka wrote:

Thanks Ahmet and Luke for validation.

If no one has objections then I am planning to move ahead without Gearpump
validation as it seems to be broken from past multiple releases.

Reminder: The voting closes on 2nd June so please validate and vote by then.

On Fri, May 31, 2019 at 10:43 AM Ahmet Altay wrote:

I validated python 2 quickstarts.

On Fri, May 31, 2019 at 10:22 AM Lukasz Cwik wrote:

I did the Java local quickstart for all the runners in the release validation
sheet and gearpump failed for me due to a missing dependency. Even after I
fixed up the dependency, the pipeline then got stuck. I filed BEAM-7467 with
all the details.

Note that I tried the quickstart for 2.8.0 through 2.12.0
2.8.0 and 2.9.0 failed due to a timeout (maybe I was using the wrong command
but this test[1] suggests that I was using a correct one)
2.10.0 and higher fail due to the missing gs-collections dependency.

Manu, could you help figure out what is going on?

1:
https://github.com/apache/beam/blob/2d3bcdc542536037c3e657a8b00ebc222487476b/release/src/main/groovy/quickstart-java-gearpump.groovy#L33

On Thu, May 30, 2019 at 7:53 PM Ankur Goenka wrote:

Hi everyone,

Please review and vote on the release candidate #2 for the version 2.13.0, as
follows:

[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org [2],
which is signed with the key with fingerprint
6356C1A9F089B0FA3DE8753688934A6699985948 [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "v2.13.0-RC2" [5],
* website pull request listing the release [6] and publishing the API reference
manual [7].
* Python artifacts are deployed along with the source release to the
dist.apache.org [2].
* Validation sheet with a tab for 2.13.0 release to help with validation [8].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Thanks,
Ankur

[1]
https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12345166
[2] https://dist.apache.org/repos/dist/dev/beam/2.13.0/
[3] https://dist.apache.org/repos/dist/release/beam/KEYS
[4] https://repository.apache.org/content/repositories/orgapachebeam-1070/
[5] https://github.com/apache/beam/tree/v2.13.0-RC2
[6] https://github.com/apache/beam/pull/8645
[7] https://github.com/apache/beam-site/pull/589
[8]
https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1031196952

Re: 1 Million Lines of Code (1 MLOC)

2019-06-03 Thread Maximilian Michels

Interesting stats :) This metric does not take into a account Beam's 
dependencies, e.g. libraries and execution backends. That would increase 
the LOCs to millions.


On 01.06.19 01:46, Alex Amato wrote:
Interesting, so if we play with https://github.com/cgag/loc we could 
break it down further? I.e. test files vs code files? Which folders, 
etc. That could be interesting as well.


On Fri, May 31, 2019 at 4:20 PM Brian Hulette > wrote:


Dennis Nedry needed 2 million lines of code to control Jurassic
Park, and he only had to manage eight computers! I think we may
actually need to pick up the pace.

On Fri, May 31, 2019 at 4:11 PM Anton Kedin mailto:ke...@google.com>> wrote:

And to reduce the effort of future rewrites we should start
doing it on a schedule. I propose we start over once a week :)

On Fri, May 31, 2019 at 4:02 PM Lukasz Cwik mailto:lc...@google.com>> wrote:

1 million lines is too much, time to delete the entire
project and start over again, :-)

On Fri, May 31, 2019 at 3:12 PM Ankur Goenka
mailto:goe...@google.com>> wrote:

Thanks for sharing.
This is really interesting metrics.
One use I can see is to track LOC vs Comments to make
sure that we keep up with the practice of writing
maintainable code.

On Fri, May 31, 2019 at 3:04 PM Ismaël Mejía
mailto:ieme...@gmail.com>> wrote:

I was checking some metrics in our codebase and
found by chance that
we have passed the 1 million lines of code (MLOC).
Of course lines of
code may not matter much but anyway it is
interesting to see the size
of our project at this moment.

This is the detailed information returned by loc [1]:



  Language             Files        Lines   
Blank      Comment         Code



  Java                  3681       673007   
78265       140753       453989
  Python                 497       131082   
22560        13378        95144
  Go                     333       105775   
13681        11073        81021
  Markdown               205        31989   
  6526            0        25463
  Plain Text              11        21979   
  6359            0        15620
  Sass                    92         9867   
  1434         1900         6533
  JavaScript              19         5157   
  1197          467         3493
  YAML                    14         4601 
454         1104         3043
  Bourne Shell            30         3874 
470         1028         2376
  Protobuf                17         4258 
677         1373         2208
  XML                     17         2789 
296          559         1934
  Kotlin                  19         3501 
347         1370         1784
  HTML                    60         2447 
148          914         1385
  Batch                    3          249 
  57            0          192
  INI                      1          206 
  21           16          169
  C++                      2           72   
4           36           32
  Autoconf                 1           21   
1           16            4



  Total                 5002      1000874 
  132497       173987       694390




[1] https://github.com/cgag/loc

Re: Timer support in Flink

2019-06-03 Thread Maximilian Michels

Good point. I think I discovered the detailed view when I made changes 
to the source code. Classic tunnel-vision problem :)

On 30.05.19 12:57, Reza Rokni wrote:

:-)

https://issues.apache.org/jira/browse/BEAM-7456

On Thu, 30 May 2019 at 18:41, Alex Van Boxel > wrote:

Oh... you can expand the matrix. Never saw that, this could indeed
be better. So it isn't you.

  _/
_/ Alex Van Boxel

On Thu, May 30, 2019 at 12:24 PM Reza Rokni mailto:r...@google.com>> wrote:

PS, until it was just pointed out to me by Max, I had missed the
(expand details) clickable link in the capability matrix.

Probably just me, but do others think it's also easy to miss? If
yes I will raise a Jira for it

On Wed, 29 May 2019 at 19:52, Reza Rokni mailto:r...@google.com>> wrote:

Thanx Max!

Reza

On Wed, 29 May 2019, 16:38 Maximilian Michels,
mailto:m...@apache.org>> wrote:

Hi Reza,

The detailed view of the capability matrix states: "The
Flink Runner
supports timers in non-merging windows."

That is still the case. Other than that, timers should
be working fine.

 > It makes very heavy use of Event.Time timers and has
to do some manual DoFn cache work to get around some
O(heavy) issues.

If you are running on Flink 1.5, timer deletion suffers
from O(n)
complexity which has been fixed in newer versions.

Cheers,
Max

On 29.05.19 03:27, Reza Rokni wrote:
 > Hi Flink experts,
 >
 > I am getting ready to push a PR around a utility
class for timeseries join
 >
 > left.timestamp match to closest right.timestamp where
right.timestamp <=
 > left.timestamp.
 >
 > It makes very heavy use of Event.Time timers and has
to do some manual
 > DoFn cache work to get around some O(heavy) issues.
Wanted to test
 > things against Flink: In the capability matrix we
have "~" for Timer
 > support in Flink:
 >
 >
https://beam.apache.org/documentation/runners/capability-matrix/
 >
 > Is that page outdated, if not what are the areas that
still need to be
 > addressed please?
 >
 > Cheers
 >
 > Reza
 >
 >
 > --
 >
 > This email may be confidential and privileged. If you
received this
 > communication by mistake, please don't forward it to
anyone else, please
 > erase all copies and attachments, and please let me
know that it has
 > gone to the wrong person.
 >
 > The above terms reflect a potential business
arrangement, are provided
 > solely as a basis for further discussion, and are not
intended to be and
 > do not constitute a legally binding obligation. No
legally binding
 > obligations will be created, implied, or inferred
until an agreement in
 > final form is executed in writing by all parties
involved.
 >

-- 

This email may be confidential and privileged. If you received
this communication by mistake, please don't forward it to anyone
else, please erase all copies and attachments, and please let me
know that it has gone to the wrong person.

The above terms reflect a potential business arrangement, are
provided solely as a basis for further discussion, and are not
intended to be and do not constitute a legally binding
obligation. No legally binding obligations will be created,
implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

--

This email may be confidential and privileged. If you received this 
communication by mistake, please don't forward it to anyone else, please 
erase all copies and attachments, and please let me know that it has 
gone to the wrong person.

The above terms reflect a potential business arrangement, are provided 
solely as a basis for further discussion, and are not intended to be and 
do not constitute a legally binding obligation. No legally binding 
obligations will be created, implied, or inferred until an agreement

Re: [VOTE] Release 2.13.0, release candidate #2

2019-06-03 Thread Robert Bradshaw

+1

I validated the artifacts and Python 3.

On Sat, Jun 1, 2019 at 7:45 PM Ankur Goenka  wrote:
>
> Thanks Ahmet and Luke for validation.
>
> If no one has objections then I am planning to move ahead without Gearpump 
> validation as it seems to be broken from past multiple releases.
>
> Reminder: The voting closes on 2nd June so please validate and vote by then.
>
> On Fri, May 31, 2019 at 10:43 AM Ahmet Altay  wrote:
>>
>> +1
>>
>> I validated python 2 quickstarts.
>>
>> On Fri, May 31, 2019 at 10:22 AM Lukasz Cwik  wrote:
>>>
>>> I did the Java local quickstart for all the runners in the release 
>>> validation sheet and gearpump failed for me due to a missing dependency. 
>>> Even after I fixed up the dependency, the pipeline then got stuck. I filed 
>>> BEAM-7467 with all the details.
>>>
>>> Note that I tried the quickstart for 2.8.0 through 2.12.0
>>> 2.8.0 and 2.9.0 failed due to a timeout (maybe I was using the wrong 
>>> command but this test[1] suggests that I was using a correct one)
>>> 2.10.0 and higher fail due to the missing gs-collections dependency.
>>>
>>> Manu, could you help figure out what is going on?
>>>
>>> 1: 
>>> https://github.com/apache/beam/blob/2d3bcdc542536037c3e657a8b00ebc222487476b/release/src/main/groovy/quickstart-java-gearpump.groovy#L33
>>>
>>> On Thu, May 30, 2019 at 7:53 PM Ankur Goenka  wrote:

 Hi everyone,

 Please review and vote on the release candidate #2 for the version 2.13.0, 
 as follows:

 [ ] +1, Approve the release
 [ ] -1, Do not approve the release (please provide specific comments)

 The complete staging area is available for your review, which includes:
 * JIRA release notes [1],
 * the official Apache source release to be deployed to dist.apache.org 
 [2], which is signed with the key with fingerprint 
 6356C1A9F089B0FA3DE8753688934A6699985948 [3],
 * all artifacts to be deployed to the Maven Central Repository [4],
 * source code tag "v2.13.0-RC2" [5],
 * website pull request listing the release [6] and publishing the API 
 reference manual [7].
 * Python artifacts are deployed along with the source release to the 
 dist.apache.org [2].
 * Validation sheet with a tab for 2.13.0 release to help with validation 
 [8].

 The vote will be open for at least 72 hours. It is adopted by majority 
 approval, with at least 3 PMC affirmative votes.

 Thanks,
 Ankur

 [1] 
 https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12345166
 [2] https://dist.apache.org/repos/dist/dev/beam/2.13.0/
 [3] https://dist.apache.org/repos/dist/release/beam/KEYS
 [4] https://repository.apache.org/content/repositories/orgapachebeam-1070/
 [5] https://github.com/apache/beam/tree/v2.13.0-RC2
 [6] https://github.com/apache/beam/pull/8645
 [7] https://github.com/apache/beam-site/pull/589
 [8] 
 https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1031196952

Re: Jira tracker permission

Re: Timer support in Flink

Re: [DISCUSS] Portability representation of schemas

Jira tracker permission

Jira issue tracker permission

Re: [VOTE] Release 2.13.0, release candidate #2

Re: [DISCUSS] Cookbooks for users with knowledge in other frameworks

Re: BQ IT tests fail on TestDataflowRunner - Python SDK

Re: BQ IT tests fail on TestDataflowRunner - Python SDK

Re: [VOTE] Release 2.13.0, release candidate #2

Re: [PROPOSAL] Standardize Gradle structure in Python SDK

Re: [Discuss] Ideas for Apache Beam presence in social media

Re: [VOTE] Release 2.13.0, release candidate #2

BQ IT tests fail on TestDataflowRunner - Python SDK

Re: 1 Million Lines of Code (1 MLOC)

Re: [VOTE] Release 2.13.0, release candidate #2

Re: 1 Million Lines of Code (1 MLOC)

Re: [DISCUSS] Portability representation of schemas

Re: [VOTE] Release 2.13.0, release candidate #2

Re: [DISCUSS] Portability representation of schemas

Re: [VOTE] Release 2.13.0, release candidate #2

Re: [VOTE] Release 2.13.0, release candidate #2

Re: [DISCUSS] Portability representation of schemas

Re: [DISCUSS] Portability representation of schemas

Re: 1 Million Lines of Code (1 MLOC)

Re: 1 Million Lines of Code (1 MLOC)

Re: [VOTE] Release 2.13.0, release candidate #2

Re: [VOTE] Release 2.13.0, release candidate #2

Beam Dependency Check Report (2019-06-03)

Re: [VOTE] Release 2.13.0, release candidate #2

Re: 1 Million Lines of Code (1 MLOC)

Re: Timer support in Flink

Re: [VOTE] Release 2.13.0, release candidate #2

33 matches

Site Navigation

Mail list logo

Footer information