Re: Links to Java API docs in Beam Website documentation (Was: Version Beam Website Documentation)

2019-12-11 Thread jincheng sun
+1 for using {{site.release_latest}}, which is make more sense to me.

Best,
Jincheng

Kenneth Knowles  于2019年12月11日周三 下午1:12写道:

> +1 to site.release_latest
>
> We do have a dead link checker in the website tests. Does it not catch
> moved classes, etc?
>
> On Tue, Dec 10, 2019 at 1:49 PM Pablo Estrada  wrote:
>
>> +1 to rely on expanding {{site.release_latest}}.
>>
>> On Tue, Dec 10, 2019 at 12:05 PM Brian Hulette 
>> wrote:
>>
>>> I was thinking about this recently as well. I requested we add a link to
>>> the java API docs in a website change [1]. I searched around a bit to look
>>> for precedent on how to do this, but I found three different methods:
>>> - Links to a specific version (e.g.
>>> https://beam.apache.org/releases/javadoc/2.0.0/...)
>>> - Links to "current" (e.g.
>>> https://beam.apache.org/releases/javadoc/current/...)
>>> - Links that rely on expanding site.relase_latest (i.e.
>>> https://beam.apache.org/releases/javadoc/{{ site.release_latest}}/...)
>>>
>>> The first seems clearly bad if we want to always use the most recent
>>> version, but it does have the benefit that we don't have to worry about the
>>> links breaking after code changes.
>>>
>>> The latter two are effectively the same, but site.release_latest has the
>>> benefit that its parameterized so we _could_ generate documentation for
>>> other versions if we want to. It also seems to be the most prevalent. So I
>>> think that's the way to go. Are there any objections to updating all of our
>>> links to use site.release_latest?
>>>
>>> I think the only possible concern is we might break a link if we
>>> move/rename a class. It would be nice if there were some way to validate
>>> them.
>>>
>>> Brian
>>>
>>> [1] https://github.com/apache/beam/pull/10273#discussion_r354533080
>>>
>>>
>>> On Fri, Dec 6, 2019 at 7:20 AM Maximilian Michels 
>>> wrote:
>>>
 @Kenn This is not only about breaking changes. We can also add new
 features or settings which will then be advertised in the documentation
 but not be available in older versions.

 Having a single source of truth is easier to maintain and better
 discoverable via search engines. However, it forces us to use wordings
 like "Works like this in Beam version <= X.Y, otherwise use..". The
 pragmatic approach there is to just ignore old Beam versions. That's
 not
 super user friendly, but it works.

 IMHO the amount of version-specific content in the Beam documentation
 probably does not yet justify forking the documentation for every
 release.

 Cheers,
 Max

 On 06.12.19 08:13, Alex Van Boxel wrote:
 > It seems also be too complex for the Google Crawler as well. A lot of
 > times I arrived on documentation on an older version of a product
 when I
 > search (aka Google) for something.
 >
 >   _/
 > _/ Alex Van Boxel
 >
 >
 > On Fri, Dec 6, 2019 at 6:20 AM Kenneth Knowles >>> > > wrote:
 >
 > Since we are not making breaking changes (we hope) and we try to
 be
 > careful about performance regressions, I think it is OK to simply
 > encourage users to upgrade to the latest if they expect the
 > narrative documentation to match their version. The versioned API
 > docs are probably enough. We might consider putting more info into
 > the javadocs / pydocs to bridge the gap, if you have seen any
 issues
 > with users hitting trouble.
 >
 > I am saying this for two reasons:
 >
 >   - versioning the site is more work, and someone would need to do
 > that work
 >   - but more than that, versioned site is more complex for users
 >
 > Kenn
 >
 > On Wed, Dec 4, 2019 at 1:48 PM Ankur Goenka >>> > > wrote:
 >
 > I agree, having a single website showcase the latest beam
 > versions and encourages users to use the latest Beam version
 > which is very useful.
 > Calling out version limitations are definitely makes users
 life
 > easier.
 >
 > The usecase I have in mind is more on the lines of best
 > practices and recommended way of doing things.
 > One such example is the way we recommend new users to try
 > Portable Flink. We are overhauling and simplifying the user
 > onboarding experience. Though the old way of doing things are
 > still supported, the easier new recommendation for onboarding
 > will only apply from Beam 2.18.
 > We can ofcource create sections on documentation for this
 > usecase but it seems like a poor man's way of versioning :)
 >
 > You also highlighted a great usecase about LTS release. Should
 > we simply separate out the documentations for LTS release and
 > 

December board report

2019-12-11 Thread Kenneth Knowles
Hi all,

Late notice on this, but the December Board report is due "now".

I've started a draft here:
https://docs.google.com/document/d/1AJT5j-qRLJPeN5x6nbHD5KqadXLM0zT0Ugmiy_vQ7C8/edit?usp=sharing


I included a number of big discussions or features that I noticed in the
email lists or GitHub history. Please help me by describing them more
fully. You can see the style of past board reports at
https://whimsy.apache.org/board/minutes/Beam.html

I will submit by the end of the week.

Kenn


Re: [PROPOSAL] Revised streaming extensions for Beam SQL

2019-12-11 Thread jincheng sun
Thanks for bring up this discussion Kenn!

Definitely +1 for the proposal.

I have left some questions in the documentation :)

Best,
Jincheng

Rui Wang  于2019年12月11日周三 上午5:23写道:

> Until now as I am not seeing more people are commenting on this proposal,
> can we consider this proposal is already accepted by Beam community?
>
> If it is accepted, I want to start a discussion on deprecate the old GROUP
> BY windowing style and only keep table-valued function windowing.
>
>
> -Rui
>
> On Thu, Jul 25, 2019 at 11:32 AM Kenneth Knowles  wrote:
>
>> We hope it does enter the SQL standard. It is one reason for coming
>> together to write this paper.
>>
>> OVER clause is mentioned often.
>>
>>  - TUMBLE can actually just be a function so you don't need OVER or any
>> of the fancy stuff we propose; it is just done to make them all look similar
>>  - HOP still doesn't work since OVER clause has one value per input row,
>> it is still 1 to 1 input/output ratio
>>  - SESSION GAP 5 MINUTES (PARTITION BY key) is actually a natural syntax
>> that could work well
>>
>> None of them require ORDER, by design.
>>
>> On the other hand, implementing the general OVER clause and the rank,
>> running sum, etc, could be done with GBK + sort values. That is not related
>> to windowing. And since in SQL users of windowing will think of OVER as
>> related to ordering, I personally don't want to also use it for something
>> that has nothing to do with ordering.
>>
>> But if you would write up something that could be interesting to discuss
>> more.
>>
>> Kenn
>>
>> On Wed, Jul 24, 2019 at 2:24 PM Mingmin Xu  wrote:
>>
>>> +1 to remove those magic words in Calcite streaming SQL, just because
>>> they're not SQL standard. The idea to replace HOP/TUMBLE with
>>> table-view-functions makes it concise, my only question is, is it(or will
>>> it be) part of SQL standard? --I'm a big fan to align with standards :lol
>>>
>>> Ps, although the concept of `window` used here are different from window
>>> function in SQL, the syntax gives some insight. Take the example of 
>>> `ROW_NUMBER()
>>> OVER (PARTITION BY COL1 ORDER BY COL2) AS row_number`, `ROW_NUMBER()`
>>> assigns a sequence value for records in subgroup with key 'COL1'. We can
>>> introduce another function, like TUMBLE() which will assign a window
>>> instance(more instances for HOP()) for the record.
>>>
>>> Mingmin
>>>
>>>
>>> On Sun, Jul 21, 2019 at 9:42 PM Manu Zhang 
>>> wrote:
>>>
 Thanks Kenn,
 great paper and left some newbie questions on the proposal.

 Manu

 On Fri, Jul 19, 2019 at 1:51 AM Kenneth Knowles 
 wrote:

> Hi all,
>
> I recently had the great privilege to work with others from Beam plus
> Calcite and Flink SQL contributors to build a new and minimal proposal for
> adding streaming extensions to standard SQL: event time, watermarks,
> windowing, triggers, stream materialization.
>
> We hope this will influence the standard body and also Calcite and
> Flink and other projects working on the streaming SQL.
>
> I would like to start implementing these extensions in Beam, moving
> from our current streaming extensions to the new proposal.
>
>The whole paper is https://arxiv.org/abs/1905.12133
>
>My small proposal to start in Beam:
> https://s.apache.org/streaming-beam-sql
>
> TL;DR: replace `GROUP BY Tumble/Hop/Session` with table functions that
> do Tumble, Hop, Session. The details of why to make this change are
> explained in the appendix to my proposal. For the big picture of how it
> fits in, the full paper is best.
>
> Kenn
>

>>>
>>> --
>>> 
>>> Mingmin
>>>
>>


Re: Executing the runner validation tests for the Twister2 runner

2019-12-11 Thread Kenneth Knowles
I dug in to Twister2 a little bit to understand the question better,
checking how the various resource managers / launchers are plumbed.

How would a user set up automated monitoring for a job? If that is scraping
the logs, then it seems unfortunate for users, but I think the Beam runner
would naturally use whatever a user might use.

Kenn

On Wed, Dec 11, 2019 at 10:45 AM Pulasthi Supun Wickramasinghe <
pulasthi...@gmail.com> wrote:

> Hi Dev's
>
> I have been making some progress on the Twister2 runner for the beam that
> I mentioned before on the mailing list. The runner is able to run the
> wordcount example and produce correct results. So I am currently trying to
> run the runner validation tests.
>
> From what I understood looking at a couple examples is that tests are
> validated based on the exceptions that are thrown (or not) during test
> runtime.  However in Twister2 currently the job submission client does not
> get failure information such as exceptions back once the job is submitted.
> These are however recorded in the worker log files.
>
> So in order to validate the tests for Twister2 I would have to parse the
> worker logfile and check what exceptions are in the logs. Would that be an
> acceptable solution for the validation tests?
>
> Best Regards,
> Pulasthi
>
>
>
>
> --
> Pulasthi S. Wickramasinghe
> PhD Candidate  | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> cell: 224-386-9035 <(224)%20386-9035>
>


Re: Contributor Permission for BEAM-8953

2019-12-11 Thread Pablo Estrada
Hi Ryan!
Welcome! I've added you as contributor, and assigned BEAM-8953 to you.
Best
-P.

On Wed, Dec 11, 2019 at 3:31 PM Ryan Berti  wrote:

> Hello,
>
> My name is Ryan Berti, working at Quibi. We're using Beam via SCIO, and
> I've run into a situation where I'd like to contribute some minor
> improvements to the java sdk, specifically to ParquetIO. I've created
> BEAM-8953 and would like some feedback on the improvement. My JIRA user
> name is 'Ryan Berti'.
>
> Thanks!
> Ryan
>


Contributor Permission for BEAM-8953

2019-12-11 Thread Ryan Berti
Hello,

My name is Ryan Berti, working at Quibi. We're using Beam via SCIO, and
I've run into a situation where I'd like to contribute some minor
improvements to the java sdk, specifically to ParquetIO. I've created
BEAM-8953 and would like some feedback on the improvement. My JIRA user
name is 'Ryan Berti'.

Thanks!
Ryan


Re: Cython unit test suites running without Cythonized sources

2019-12-11 Thread Chad Dombrova
>
> IIUC, isolated_build=True and the removal of setup.py invocation in the
> current virtualenv should eliminate any Cython output files in the repo,
> and no need for run_tox_cleanup.sh?
>

Correct, that script is deleted in this commit:
https://github.com/apache/beam/pull/10038/commits/c6dab09abf9f4091f0dbf7eac2964c5be0665763


Re: Cython unit test suites running without Cythonized sources

2019-12-11 Thread Udi Meiri
The `changedir = {envsitepackagesdir}` setting is definitely something I
haven't thought of.
It solves the shadowing issue without needing to split tests and packages
from one another. (though I still think it's unnecessary to include tests
in the published package)

IIUC, isolated_build=True and the removal of setup.py invocation in the
current virtualenv should eliminate any Cython output files in the repo,
and no need for run_tox_cleanup.sh?


On Wed, Dec 11, 2019 at 9:38 AM Chad Dombrova  wrote:

> Hi Udi,
>
>> Sorry I didn't realize you already had a solution for the shadowing issue
>> and BEAM-8572.
>>
>
> No worries at all.  I haven't had much time to invest into that PR lately
> (most of it I did at home on my own time), but I did get past most of the
> major issues.  You've been working on so many of the same problems I was
> trying to solve there, and so far you've been coming to the same
> conclusions independently (e.g. removing pytest-runner and
> setup_requires).   It's great to have that validation, and it's helped
> reduce the scope of my PR.  Moving forward, I would love to team up on
> this.  Happy to answer any questions you have about the approach I took.
>
> -chad
>
>
>


smime.p7s
Description: S/MIME Cryptographic Signature


Executing the runner validation tests for the Twister2 runner

2019-12-11 Thread Pulasthi Supun Wickramasinghe
Hi Dev's

I have been making some progress on the Twister2 runner for the beam that I
mentioned before on the mailing list. The runner is able to run the
wordcount example and produce correct results. So I am currently trying to
run the runner validation tests.

>From what I understood looking at a couple examples is that tests are
validated based on the exceptions that are thrown (or not) during test
runtime.  However in Twister2 currently the job submission client does not
get failure information such as exceptions back once the job is submitted.
These are however recorded in the worker log files.

So in order to validate the tests for Twister2 I would have to parse the
worker logfile and check what exceptions are in the logs. Would that be an
acceptable solution for the validation tests?

Best Regards,
Pulasthi




-- 
Pulasthi S. Wickramasinghe
PhD Candidate  | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
cell: 224-386-9035


Re: RFC: python static typing PR

2019-12-11 Thread Chad Dombrova
Hi all,
Robert has diligently reviewed the first batch of changes for this PR, and
all review notes are addressed and tests are passing:
https://github.com/apache/beam/pull/9915

Due to the number of file touched there's a short window of about one or
two days before a merge conflict arrives on master, and after resolving
that it usually takes another 1-2 days of pasting "Run Python PreCommit"
until they pass again, so it would be great to get this merged while the
window is open!  Despite the number of files touched, the changes are
almost entirely type comments, so the PR is designed to be quite safe.

-chad


On Tue, Nov 5, 2019 at 2:50 PM Chad Dombrova  wrote:

> Glad to hear we have such a forward-thinking community!
>
>
> On Tue, Nov 5, 2019 at 2:43 PM Robert Bradshaw 
> wrote:
>
>> Sounds like we have consensus. Let's move forward. I'll follow up with
>> the discussions on the PRs themselves.
>>
>> On Wed, Oct 30, 2019 at 2:38 PM Robert Bradshaw 
>> wrote:
>> >
>> > On Wed, Oct 30, 2019 at 1:26 PM Chad Dombrova 
>> wrote:
>> > >
>> > >> Do you believe that a future mypy plugin could replace pipeline type
>> checks in Beam, or are there limits to what it can do?
>> > >
>> > > mypy will get us quite far on its own once we completely annotate the
>> beam code.  That said, my PR does not include my efforts to turn
>> PTransforms into Generics, which will be required to properly analyze
>> pipelines, so there's still a lot more work to do.  I've experimented with
>> a mypy plugin to smooth over some of the rough spots in that workflow and I
>> will just say that the mypy API has a very steep learning curve.
>> > >
>> > > Another thing to note: mypy is very explicit about function
>> annotations.  It does not do the "implicit" inference that Beam does, such
>> as automatically detecting function return types.  I think it should be
>> possible to do a lot of that as a mypy plugin, and in fact, since it has
>> little to do with Beam it could grow into its own project with outside
>> contributors.
>> >
>> > Yeah, I don't think, as is, it can replace what we do, but with
>> > plugins I think it could possibly come closer. Certainly there is
>> > information that is only available at runtime (e.g. reading from a
>> > database or avro/parquet file could provide the schema which can be
>> > used for downstream checking) which may limit the ability to do
>> > everything statically (even Beam Java is moving this direction). Mypy
>> > clearly has an implementation of the "is compatible with" operator
>> > that I would love to borrow, but unfortunately it's not (easily?)
>> > exposed.
>> >
>> > That being said, we should leverage what we can for pipeline
>> > authoring, and it'll be a great development too regardless.
>>
>


Re: Cython unit test suites running without Cythonized sources

2019-12-11 Thread Chad Dombrova
Hi Udi,

> Sorry I didn't realize you already had a solution for the shadowing issue
> and BEAM-8572.
>

No worries at all.  I haven't had much time to invest into that PR lately
(most of it I did at home on my own time), but I did get past most of the
major issues.  You've been working on so many of the same problems I was
trying to solve there, and so far you've been coming to the same
conclusions independently (e.g. removing pytest-runner and
setup_requires).   It's great to have that validation, and it's helped
reduce the scope of my PR.  Moving forward, I would love to team up on
this.  Happy to answer any questions you have about the approach I took.

-chad


Re: Is org.apache.beam.sdk.transforms.FlattenTest.testFlattenMultipleCoders supposed to be supported ?

2019-12-11 Thread Etienne Chauchot

Ok,

Thanks Kenn.

Le Flatten javadoc says that by default the coder of the output should 
be the coder of the first input. But in the test, it sets the output 
coder to something different. Waiting for a consensus on this model 
point and a common impl in the runners, I'll just exclude this test as 
other runner do.


Etienne

On 11/12/2019 04:46, Kenneth Knowles wrote:
It is a good point. Nullable(VarLong) and VarLong are two different 
types, with least upper bound that is Nullable(VarLong). BigEndianLong 
and VarLong are two different types, with no least upper bound in the 
"coders" type system. Yet we understand that the values they encode 
are equal. I do not think this is clearly formalized anywhere what the 
rules are (corollary: not thought carefully about).


I think both possibilities are reasonable:

1. Make the rule that Flatten only accepts inputs with identical 
coders. This will be sometimes annoying, requiring vacuous "re-encode" 
noop ParDos (they will be fused away on maybe all runners).
2. Define types as the domain of values, and Flatten accepts sets of 
PCollections with the same domain of values. Runners must "do whatever 
it takes" to respect the coders on the collection.
2a. For very simple cases, Flatten takes the least upper bound of the 
input types. The output coder of Flatten has to be this least upper 
bound. For example, a non-nullable output coder would be an error.


Very interesting and nuanced problem. Flatten just became quite an 
interesting transform, for me :-)


Kenn

On Tue, Dec 10, 2019 at 12:37 AM Etienne Chauchot 
mailto:echauc...@apache.org>> wrote:


Hi all,

I have an interrogation around testFlattenMultipleCoders test:

This test uses 2 collections

1. long and null data encoded using NullableCoder(BigEndianLongCoder)

2. long data encoded using VarlongCoder

It then flattens the 2 collections and set the coder of the resulting
collection to NullableCoder(VarlongCoder)

Most runners translate flatten as a simple union of the 2
PCollections
without any re-encoding. As a result all the runners exclude this
test
from the test set because of coders issues. For example flink
raises an
exception if the type of elements in PCollection1 is different of the
type of PCollection2 in flatten translation. Another example is
direct
runner and spark (RDD based) runner that do not exclude this test
simply
because they don't need to serialize elements so they don't even call
the coders.

That means that having an output PCollection of the flatten with
heterogeneous coders is not really tested so it is not really
supported.

Should we drop this test case (that is executed by no runner) or
should
we force each runner to re-encode ?

Best

Etienne