Re: [VOTE] Release 2.12.0, release candidate #4

2019-04-16 Thread Kenneth Knowles
+1 Ran the verification scripts. Caveats: - I input a GCS bucket that did not exist, expecting it to be created, so the Dataflow tests failed. - I also skipped the Python tests that asked to write to GitHub. - You also have not built, staged, & signed the Python wheels. It is a bit hidden in

Re: Python SDK timestamp precision

2019-04-16 Thread Kenneth Knowles
I am not so sure this is a good idea. Here are some systems and their precision: Arrow - microseconds BigQuery - microseconds New Java instant - nanoseconds Firestore - microseconds Protobuf - nanoseconds Dataflow backend - microseconds Postgresql - microseconds Pubsub publish time - nanoseconds M

Python SDK timestamp precision

2019-04-16 Thread Thomas Weise
The Python SDK currently uses timestamps in microsecond resolution while Java SDK, as most would probably expect, uses milliseconds. This causes a few difficulties with portability (Python coders need to convert to millis for WindowedValue and Timers, which is related to a bug I'm looking into: h

Re: Insufficient CPU quota in apache-beam-testing causes test flakes

2019-04-16 Thread Valentyn Tymofieiev
Thanks, Yifan. 1. It appears that there are 32 jenkins-related instances, 16 cores each, which consume over 2/3 of available CPU quota. 2. Among old VMs there are 6 1-core VMs, that look like "gke-io-datastores-*" and "gke-metrics-*". They don't consume much quota, but I am curious why do we have

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-16 Thread Thomas Weise
I opened https://github.com/apache/beam/pull/8319 to eliminate the duplicate yaml file (and cover timestamp coder for the Python SDK). Would appreciate if someone could take a look. (PR doesn't affect the StrUtf8Coder subject, but it is required to fix a timer bug.) Thanks, Thomas On Fri, Apr 12

Re: Hi, some sample about Extracting data from Xlsx ?

2019-04-16 Thread Pablo Estrada
Hm I am not very familiar with POI, but if its transforms are able to take in a file descriptor, you should be able to use FileIO.match()[0] to find your files (local, or in GCS/S3/HDFS); and FileIO.readMatches()[1] to get file descriptors for these files. If the POI libraries require the files to

Re: Insufficient CPU quota in apache-beam-testing causes test flakes

2019-04-16 Thread Yifan Zou
We recently created 16 compute instances for the Jenkins. Each one of them has 16 CPUs, means they consume 256 CPU in total. I guess that is why the CPU usage in us-central1 remains high. We're working on the migrating the rest of old Jenkins agents, and the old instances will be removed once finis

Insufficient CPU quota in apache-beam-testing causes test flakes

2019-04-16 Thread Valentyn Tymofieiev
FYI, I have recently observed a large amount of test failures in Beam test suites where Dataflow Jobs failed due to a lack of CPU quota in apache-beam-testing project. We have been adding new suites for Python 3.x versions, which may have contributed to this. problem. I have not investigated yet

Re: pickler.py issue with nested classes

2019-04-16 Thread Udi Meiri
Not sure: my case is using a nested class and the error is a stack overflow (or infinite recursion detection is triggered). It is odd though that they have the same workaround. smime.p7s Description: S/MIME Cryptographic Signature

Re: pickler.py issue with nested classes

2019-04-16 Thread Valentyn Tymofieiev
This looks very similar to https://github.com/uqfoundation/dill/issues/300, however we observed that bug on Python 3, and not on Python 2.7. On Tue, Apr 16, 2019 at 10:58 AM Udi Meiri wrote: > I was looking at migrating unit tests to pytest and found this test which > doesn't pass: > https://gis

Re: Removing :beam-website:testWebsite from gradle build target

2019-04-16 Thread Kyle Weaver
> it would be good to have a sort of weekly report on dead links Seeing as checking for broken external links returns a lot of false positives, I'd rather not spam everyone with them. However, I don't know if making it a postcommit will give it sufficient visibility. Not sure what the best way to

pickler.py issue with nested classes

2019-04-16 Thread Udi Meiri
I was looking at migrating unit tests to pytest and found this test which doesn't pass: https://gist.github.com/udim/a71fcb278b56a9a5b7962f4588e14efb (stack overflow) (requires installing python3.7 and "python3.7 -m pip install pytest".) The same command passes with python2.7 and python3.5. I trie

Re: [DISCUSS] Adding GroupByKeyAndSort

2019-04-16 Thread Kenneth Knowles
On Tue, Apr 16, 2019 at 9:18 AM Reuven Lax wrote: > A common request (especially in streaming) is to support sorting values by > timestamp, not by the full value. > On this point, I think an explicit secondary key probably addresses the need. Naively implemented, the "sort by values" use case wo

Re: [DISCUSS] Adding GroupByKeyAndSort

2019-04-16 Thread Reuven Lax
This is a good conversation. Some things to consider: Since Beam is cross language, the "shufflers" can usually only sort by binary value. This is different than other systems where custom comparators can be used for sorting. We might need to introduce OrderPreservingCoder, and mark the coders tha

Re: [DISCUSS] Adding GroupByKeyAndSort

2019-04-16 Thread Kenneth Knowles
1. This is clearly useful, and extensively used. Agree with all that. I think it can work for batch and streaming equally well if sorting is required only per "pane", though I might be overlooking something. 2. A transform need not be primitive to be well-defined and executed in a special way by m

[DISCUSS] Adding GroupByKeyAndSort

2019-04-16 Thread Gleb Kanterov
At the moment, portability has GroupByKey transform. In most data processing frameworks, such as Hadoop MR and Apache Spark there is a concept of secondary sorting during the shuffle phase. Dataflow worker code has it under the name BatchViewOverrides.GroupByKeyAndSortValuesOnly [1], it's PTransfor

Re: Metrics support on Flink?

2019-04-16 Thread Łukasz Gajowy
Thanks, Ryan for a great introduction to the topic - it helped a lot! Let me try to fuse all the discussions we had in this one thread. You mentioned[1] that you thought of something similar and asked what problems did I face so let me explain it here as clear as I can: The main trouble I had is

Re: [ANNOUNCE] New committer announcement: Boyuan Zhang

2019-04-16 Thread Gleb Kanterov
Congratulations! On Sat, Apr 13, 2019 at 12:53 AM Thomas Weise wrote: > Congrats! > > > On Thu, Apr 11, 2019 at 6:03 PM Reuven Lax wrote: > >> Congratulations Boyuan! >> >> On Thu, Apr 11, 2019 at 4:53 PM Ankur Goenka wrote: >> >>> Congrats Boyuan! >>> >>> On Thu, Apr 11, 2019 at 4:52 PM Mark

Re: Removing :beam-website:testWebsite from gradle build target

2019-04-16 Thread Ismaël Mejía
+1 to removing link validation for website changes. However it would be good to have a sort of weekly report on dead links or another alternative to be aware of them. On Tue, Apr 16, 2019 at 2:43 AM Kyle Weaver wrote: > I agree with Andrew that the external links checks are ultra-flaky and > sel