Re: Cassandra and hadoop test broken on master and in previous releases

2019-05-10 Thread Jean-Baptiste Onofré
Hi, let me try to reproduce on my box. Regards JB On 11/05/2019 01:34, Ankur Goenka wrote: > Hi, > > Cassandra and Hadoop tests for targets :beam-sdks-java-io-cassandra:test > :beam-sdks-java-io-hadoop-format:test are failing at master and in > 2.12.0 release with jvm crash.  > > Gradle Scan:

Do we maintain offline artifact version in javadocs sdks/java/javadoc/build.gradle

2019-05-10 Thread Ankur Goenka
Hi, I see that the sdks/java/javadoc/build.gradle is not in sync with org/apache/beam/gradle/BeamModulePlugin.groovy . I wanted to check if we are maintaining or not based on that we can either remove or update sdks/java/javadoc/build.gradle. Thanks, Ankur

Re: Problem with gzip

2019-05-10 Thread Michael Luckey
Maybe the solution implemented on JdbcIO [1], [2] could be helpful in this cases. [1] https://issues.apache.org/jira/browse/BEAM-2803 [2] https://github.com/apache/beam/blob/master/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L1088-L1118 On Fri, May 10, 2019 at 11:36

Cassandra and hadoop test broken on master and in previous releases

2019-05-10 Thread Ankur Goenka
Hi, Cassandra and Hadoop tests for targets :beam-sdks-java-io-cassandra:test :beam-sdks-java-io-hadoop-format:test are failing at master and in 2.12.0 release with jvm crash. Gradle Scan: https://gradle.com/s/rhseoqeouup6e Any help on the debugging failure will be useful. Thanks, Ankur

Re: Problem with gzip

2019-05-10 Thread Lukasz Cwik
There is no such flag to turn of fusion. Writing 100s of GiBs of uncompressed data to reshuffle will take time when it is limited to a small number of workers. If you can split up your input into a lot of smaller files that are compressed then you shouldn't need to use the reshuffle but still

Re: Problem with gzip

2019-05-10 Thread Allie Chen
Re Lukasz: Thanks! I am not able to control the compression format but I will see whether the splitting gzip files will work. Is there a simple flag in Dataflow that could turn off the fusion? Re Reuven: No, I checked the run time on Dataflow UI, the GroupByKey and FlatMap in Reshuffle are very

Re: Problem with gzip

2019-05-10 Thread Reuven Lax
It's unlikely that Reshuffle itself takes hours. It's more likely that simply reading and decompressing all that data was very slow when there was no parallelism. *From: *Allie Chen *Date: *Fri, May 10, 2019 at 1:17 PM *To: * *Cc: * Yes, I do see the data after reshuffle are processed in

Re: Fwd: Your application for Season of Docs 2019 was unsuccessful

2019-05-10 Thread Aizhamal Nurmamat kyzy
I think it is still a good idea to make those Jira issues easily findable in the Beam website. Maybe in https://beam.apache.org/contribute/ next to 'improve the documentation' add the link to the documentation component or label.. or something similar. To solicit openly for docs contributors is

Re: Problem with gzip

2019-05-10 Thread Lukasz Cwik
The best solution would be to find a compression format that is splittable and add support for that to Apache Beam and use it. The issue with compressed files is that you can't read from an arbitrary offset. This stack overflow post[1] has some suggestions on seekable compression libraries. A

Re: Unexpected behavior of StateSpecs

2019-05-10 Thread Jan Lukavský
Hi Lukasz, I've created JIRA issue [1] and PR [2]. Jan [1] https://issues.apache.org/jira/browse/BEAM-7269 [2] https://github.com/apache/beam/pull/8555 On 5/10/19 7:39 PM, Lukasz Cwik wrote: That seems like the correct fix as well. We could open up a PR and see what the tests catch as a

Re: Python SDK timestamp precision

2019-05-10 Thread Robert Bradshaw
On Thu, May 9, 2019 at 9:32 AM PM Kenneth Knowles wrote: > From: Robert Bradshaw > Date: Wed, May 8, 2019 at 3:00 PM > To: dev > >> From: Kenneth Knowles >> Date: Wed, May 8, 2019 at 6:50 PM >> To: dev >> >> >> The end-of-window, for firing, can be approximate, but it seems it >> >> should be

Re: Beam's Conda package

2019-05-10 Thread Ahmet Altay
https://github.com/sodre seems to be the person behind it. Does anybody know who is that person? *From: *Charles Chen *Date: *Fri, May 10, 2019 at 1:13 PM *To: *dev Looks like this is where it's living: >

Re: Problem with gzip

2019-05-10 Thread Allie Chen
Yes, that is correct. *From: *Allie Chen *Date: *Fri, May 10, 2019 at 4:21 PM *To: * *Cc: * Yes. > > *From: *Lukasz Cwik > *Date: *Fri, May 10, 2019 at 4:19 PM > *To: *dev > *Cc: * > > When you had X gzip files and were not using Reshuffle, did you see X >> workers read and process the

Re: Beam's Conda package

2019-05-10 Thread Charles Chen
Looks like this is where it's living: https://github.com/conda-forge/apache-beam-feedstock/tree/c96274713fcc5970c967c20e84859e73d0efa0d0 *From: *Lukasz Cwik *Date: *Fri, May 10, 2019 at 1:02 PM *To: *dev I'm not aware of who set up conda as well. There seem to have been ~4500 > downloads of the

Re: Problem with gzip

2019-05-10 Thread Allie Chen
Yes. *From: *Lukasz Cwik *Date: *Fri, May 10, 2019 at 4:19 PM *To: *dev *Cc: * When you had X gzip files and were not using Reshuffle, did you see X > workers read and process the files? > > On Fri, May 10, 2019 at 1:17 PM Allie Chen wrote: > >> Yes, I do see the data after reshuffle are

Re: Problem with gzip

2019-05-10 Thread Allie Chen
Yes, I do see the data after reshuffle are processed in parallel. But Reshuffle transform itself takes hours or even days to run, according to one test (24 gzip files, 17 million lines in total) I did. The file format for our users are mostly gzip format, since uncompressed files would be too

Re: Problem with gzip

2019-05-10 Thread Lukasz Cwik
When you had X gzip files and were not using Reshuffle, did you see X workers read and process the files? On Fri, May 10, 2019 at 1:17 PM Allie Chen wrote: > Yes, I do see the data after reshuffle are processed in parallel. But > Reshuffle transform itself takes hours or even days to run,

Re: Beam's Conda package

2019-05-10 Thread Lukasz Cwik
I'm not aware of who set up conda as well. There seem to have been ~4500 downloads of the package so that is a good amount of users. On Fri, May 10, 2019 at 11:45 AM Ahmet Altay wrote: > Hi all, > > There a conda package for apache-beam [1]. As far as I know, we do not > release this package.

Re: Problem with gzip

2019-05-10 Thread Lukasz Cwik
+u...@beam.apache.org Reshuffle on Google Cloud Dataflow for a bounded pipeline waits till all the data has been read before the next transforms can run. After the reshuffle, the data should have been processed in parallel across the workers. Did you see this? Are you able to change the input

Problem with gzip

2019-05-10 Thread Allie Chen
Hi, I am trying to load a gzip file to BigQuey using Dataflow. Since the compressed file is not splittable, one worker is allocated to read the file. The same worker will do all the other transforms since Dataflow fused all transforms together. There are a large amount of data in the file, and

Beam's Conda package

2019-05-10 Thread Ahmet Altay
Hi all, There a conda package for apache-beam [1]. As far as I know, we do not release this package. Does anyone know who owns this? It was last updated to use 2.9.0, at least it would be good to add a newer version there. We also don't test in that environment so I am not sure how well it works

Re: Unexpected behavior of StateSpecs

2019-05-10 Thread Lukasz Cwik
That seems like the correct fix as well. We could open up a PR and see what the tests catch as a first pass for understanding the implications. On Fri, May 10, 2019 at 9:31 AM Jan Lukavský wrote: > Hm, yes, the fix might be also in fixing hashCode and equals of > SimpleStateTag, so that it

Re: Unexpected behavior of StateSpecs

2019-05-10 Thread Jan Lukavský
Hm, yes, the fix might be also in fixing hashCode and equals of SimpleStateTag, so that it doesn't hash and compare the StateSpec, but only the StructureId. That looks like best option to me. But I'm not sure about other implications this might have. Jan On 5/10/19 5:43 PM, Reuven Lax wrote:

Re: [DISCUSS] Portability representation of schemas

2019-05-10 Thread Brian Hulette
Ah thanks! I added some language there. *From: *Kenneth Knowles *Date: *Thu, May 9, 2019 at 5:31 PM *To: *dev > *From: *Brian Hulette > *Date: *Thu, May 9, 2019 at 2:02 PM > *To: * > > We briefly discussed using arrow schemas in place of beam schemas entirely >> in an arrow thread [1]. The

Re: Coder Evolution

2019-05-10 Thread Lukasz Cwik
Yes, having evolution actually work is quite difficult. For example, take the case of a map based side input where you try to lookup some value by a key. The runner will have stored a bunch of this data using the old format, would you ask that lookups are done using the old format or the new

Help reviewing DynamicMessage protobuf support PR

2019-05-10 Thread Alex Van Boxel
Hi, can someone help review my PR. As I needed to make some design decisions it would be great to have some feedback. https://github.com/apache/beam/pull/8496 I'm currently working on the new schema support for protobuf, as this also need to support DynamicMessages feedback would be helpful.

Re: Unexpected behavior of StateSpecs

2019-05-10 Thread Reuven Lax
Ok so this sounds like a bug in the DirectRunner then? *From: *Lukasz Cwik *Date: *Fri, May 10, 2019 at 8:38 AM *To: *dev StateSpec should not be used as a key within any maps. We should use the > logical name of the StateSpec relative to the DoFn as its id and should > only be using that id

Re: Unexpected behavior of StateSpecs

2019-05-10 Thread Lukasz Cwik
StateSpec should not be used as a key within any maps. We should use the logical name of the StateSpec relative to the DoFn as its id and should only be using that id for comparisons/lookups. On Fri, May 10, 2019 at 1:07 AM Jan Lukavský wrote: > I'm not sure. Generally it affects any runner

Re: Coder Evolution

2019-05-10 Thread Maximilian Michels
Thanks for the references Luke! I thought that there may have been prior discussions, so this thread could be a good place to consolidate. Dataflow also has an update feature, but it's limited by the fact that Beam does not have a good concept of Coder evolution. As a result we try very hard

Re: request for beam minor release

2019-05-10 Thread Maximilian Michels
Assuming 2.13 will include or otherwise be supported by flink-runner-1.7 then this should not be an issue. Yes, we will keep supporting Flink 1.7 for Beam 2.13. -Max On 08.05.19 19:54, Kenneth Knowles wrote: For the benefit of the thread, I will also call out our incubating LTS (Long-Term

Re: Streaming pipelines in all SDKs!

2019-05-10 Thread Maximilian Michels
So, FlinkRunner has some sort of special support for executing UnboundedSource via the runner in the portable world ? I see a transform override for bounded sources in PortableRunner [1] but nothing for unbounded sources. It's in the translation code:

Re: Unexpected behavior of StateSpecs

2019-05-10 Thread Jan Lukavský
I'm not sure. Generally it affects any runner that uses HashMap to store StateSpec. Jan On 5/9/19 6:32 PM, Reuven Lax wrote: Is this specific to the DirectRunner, or does it affect other runners? On Thu, May 9, 2019 at 8:13 AM Jan Lukavský > wrote: Because of

Re: Unexpected behavior of StateSpecs

2019-05-10 Thread Jan Lukavský
Hi Anton, yes, if the keyCoder doesn't have proper hashCode and equals, then it would manifest exactly as described. Jan On 5/9/19 6:28 PM, Anton Kedin wrote: Does it look similar to https://issues.apache.org/jira/browse/BEAM-6813 ? I also stumbled on a problem with a state in DirectRunner