[jira] [Commented] (BEAM-9629) JdbcIO seems to run out of connections in the connection pool and freezes pipeline
[ https://issues.apache.org/jira/browse/BEAM-9629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070806#comment-17070806 ] Tim Robertson commented on BEAM-9629: - I'd recommend setting the default to something low (1-3) but documenting that clearly and recommending people increase it to a sensible limit for their environment (e.g. have it in the default examples so it is prominent). I expect you could easily hit DB resource limits - e.g. a Spark cluster running 500 executors opening connections simultaneously by default. We have hit limits (not Beam related) with various microservices and Hadoop processes (e.g. Scoop) hammering PostgreSQL. > JdbcIO seems to run out of connections in the connection pool and freezes > pipeline > -- > > Key: BEAM-9629 > URL: https://issues.apache.org/jira/browse/BEAM-9629 > Project: Beam > Issue Type: Bug > Components: io-java-jdbc >Affects Versions: 2.19.0 > Environment: Dataflow, Direct Runner on macOS Catalina. >Reporter: Boris Shilov >Assignee: Ismaël Mejía >Priority: Major > Labels: performance > Fix For: 2.21.0 > > > Greetings, > I am using JdbcIO via the Scala wrappers provided in the Scio project. I am > trying to read a few dozen tables in parallel from MySQL, but above 8 > concurrent SELECT operations the pipeline freezes. With help of the Scio > maintainers we've been able to isolate the issue as likely originating in > JdbcIO running out of connections in the connection pool and idling > indefinitely. The issue occurs both on the Direct Runner and Dataflow. > Please see linked issue for more context: > https://github.com/spotify/scio/issues/2774 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-8306) improve estimation of data byte size reading from source in ElasticsearchIO
[ https://issues.apache.org/jira/browse/BEAM-8306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Robertson resolved BEAM-8306. - Fix Version/s: 2.17.0 Resolution: Fixed Thank you [~derek.he] and once again - apologies your original PR went so long before review. > improve estimation of data byte size reading from source in ElasticsearchIO > --- > > Key: BEAM-8306 > URL: https://issues.apache.org/jira/browse/BEAM-8306 > Project: Beam > Issue Type: Improvement > Components: io-java-elasticsearch >Affects Versions: 2.14.0 >Reporter: Derek He >Assignee: Derek He >Priority: Major > Fix For: 2.17.0 > > Time Spent: 4h 10m > Remaining Estimate: 0h > > ElasticsearchIO splits BoundedSource based on the Elasticsearch index size. > We expect it can be more accurate to split it base on query result size. > Currently, we have a big Elasticsearch index. But for query result, it only > contains a few documents in the index. ElasticsearchIO splits it into up > to1024 BoundedSources in Google dataflow. It takes long time to finish the > processing the small numbers of Elasticsearch document in Google dataflow. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-5192) Support Elasticsearch 7.x
[ https://issues.apache.org/jira/browse/BEAM-5192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938907#comment-16938907 ] Tim Robertson commented on BEAM-5192: - [~chetaldrich] - help would be greatly appreciated. Please feel free to reassign to you. > Support Elasticsearch 7.x > - > > Key: BEAM-5192 > URL: https://issues.apache.org/jira/browse/BEAM-5192 > Project: Beam > Issue Type: Improvement > Components: io-java-elasticsearch >Reporter: Etienne Chauchot >Assignee: Tim Robertson >Priority: Major > > ES v7 is not out yet. But Elastic team scheduled a breaking change for ES > 7.0: the removal of the type feature. See > [https://www.elastic.co/blog/index-type-parent-child-join-now-future-in-elasticsearch] > This will require a good amont of changes in the IO. > This ticket is there to track the future work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-3386) Dependency conflict when Calcite is included in a project.
[ https://issues.apache.org/jira/browse/BEAM-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916767#comment-16916767 ] Tim Robertson commented on BEAM-3386: - Same as [~JozoVilcek] - trying to run on Spark and Hive today using Beam 2.15 and this issue remains. > Dependency conflict when Calcite is included in a project. > -- > > Key: BEAM-3386 > URL: https://issues.apache.org/jira/browse/BEAM-3386 > Project: Beam > Issue Type: Bug > Components: dsl-sql >Affects Versions: 2.2.0, 2.3.0, 2.4.0, 2.5.0, 2.6.0 >Reporter: Austin Haas >Assignee: Kai Jiang >Priority: Critical > > When Calcite (v. 1.13.0) is included in a project that also includes Beam and > the Beam SQL extension, then the following error is thrown when trying to run > Beam code. > ClassCastException > org.apache.beam.sdk.extensions.sql.impl.planner.BeamRelDataTypeSystem cannot > be cast to org.apache.calcite.rel.type.RelDataTypeSystem > org.apache.calcite.jdbc.CalciteConnectionImpl. > (CalciteConnectionImpl.java:120) > > org.apache.calcite.jdbc.CalciteJdbc41Factory$CalciteJdbc41Connection. > (CalciteJdbc41Factory.java:114) > org.apache.calcite.jdbc.CalciteJdbc41Factory.newConnection > (CalciteJdbc41Factory.java:59) > org.apache.calcite.jdbc.CalciteJdbc41Factory.newConnection > (CalciteJdbc41Factory.java:44) > org.apache.calcite.jdbc.CalciteFactory.newConnection > (CalciteFactory.java:53) > org.apache.calcite.avatica.UnregisteredDriver.connect > (UnregisteredDriver.java:138) > java.sql.DriverManager.getConnection (DriverManager.java:664) > java.sql.DriverManager.getConnection (DriverManager.java:208) > > org.apache.beam.sdks.java.extensions.sql.repackaged.org.apache.calcite.tools.Frameworks.withPrepare > (Frameworks.java:145) > > org.apache.beam.sdks.java.extensions.sql.repackaged.org.apache.calcite.tools.Frameworks.withPlanner > (Frameworks.java:106) > > org.apache.beam.sdks.java.extensions.sql.repackaged.org.apache.calcite.prepare.PlannerImpl.ready > (PlannerImpl.java:140) > > org.apache.beam.sdks.java.extensions.sql.repackaged.org.apache.calcite.prepare.PlannerImpl.parse > (PlannerImpl.java:170) -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (BEAM-7561) HdfsFileSystem is unable to match a directory
[ https://issues.apache.org/jira/browse/BEAM-7561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867384#comment-16867384 ] Tim Robertson commented on BEAM-7561: - [~kedin] - this is now merged. Thank you for waiting for this. > HdfsFileSystem is unable to match a directory > - > > Key: BEAM-7561 > URL: https://issues.apache.org/jira/browse/BEAM-7561 > Project: Beam > Issue Type: Bug > Components: io-java-hadoop-file-system >Affects Versions: 2.13.0 >Reporter: David Moravek >Assignee: David Moravek >Priority: Major > Fix For: 2.14.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > [FileSystems.match|https://beam.apache.org/releases/javadoc/2.3.0/org/apache/beam/sdk/io/FileSystems.html#match-java.util.List-] > method should be able to match a directory according to javadoc. Unlike > _HdfsFileSystems_, _LocalFileSystems_ behaves as I would so I'm assuming this > is a bug on _HdfsFileSystems_ side. > *Current behavior:* > _HadoopFileSystem.match("hdfs:///tmp/dir")_ returns a _MatchResult_ with an > empty metadata > *Expected behavior:* > _HadoopFileSystem.match("hdfs:///tmp/dir")_ returns a _MatchResult_ with > metadata about the directory > ** -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (BEAM-7561) HdfsFileSystem is unable to match a directory
[ https://issues.apache.org/jira/browse/BEAM-7561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Robertson resolved BEAM-7561. - Resolution: Fixed > HdfsFileSystem is unable to match a directory > - > > Key: BEAM-7561 > URL: https://issues.apache.org/jira/browse/BEAM-7561 > Project: Beam > Issue Type: Bug > Components: io-java-hadoop-file-system >Affects Versions: 2.13.0 >Reporter: David Moravek >Assignee: David Moravek >Priority: Major > Fix For: 2.14.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > [FileSystems.match|https://beam.apache.org/releases/javadoc/2.3.0/org/apache/beam/sdk/io/FileSystems.html#match-java.util.List-] > method should be able to match a directory according to javadoc. Unlike > _HdfsFileSystems_, _LocalFileSystems_ behaves as I would so I'm assuming this > is a bug on _HdfsFileSystems_ side. > *Current behavior:* > _HadoopFileSystem.match("hdfs:///tmp/dir")_ returns a _MatchResult_ with an > empty metadata > *Expected behavior:* > _HadoopFileSystem.match("hdfs:///tmp/dir")_ returns a _MatchResult_ with > metadata about the directory > ** -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-7591) Unify FileSystems tests across implementation
[ https://issues.apache.org/jira/browse/BEAM-7591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867381#comment-16867381 ] Tim Robertson commented on BEAM-7591: - Copied from PR discussion: Also, there is a ResourceIdTester ( https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/fs/ResourceIdTester.java) which is meant to validate ResourceId objects returned. We may want to collapse the two together into one FileSystemTester. > Unify FileSystems tests across implementation > - > > Key: BEAM-7591 > URL: https://issues.apache.org/jira/browse/BEAM-7591 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: David Moravek >Priority: Minor > > Follow-up for a discussion from [https://github.com/apache/beam/pull/8849] PR. > Currently there is no unified tester for FileSystems implementations, that > would make sure interface is correctly implemented. We should introduce a > single FileSystemsTester and reuse it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-7517) Document dynamic destinations for avro
Tim Robertson created BEAM-7517: --- Summary: Document dynamic destinations for avro Key: BEAM-7517 URL: https://issues.apache.org/jira/browse/BEAM-7517 Project: Beam Issue Type: Improvement Components: website Reporter: Tim Robertson I've seen several folk struggle with writing Avro files to dynamic locations which I think might be a good addition to the pipelines patterns. An example showing how it may be done is in [this example|https://github.com/gbif/pipelines/blob/master/pipelines/export-gbif-hbase/src/main/java/org/gbif/pipelines/hbase/beam/ExportHBase.java#L81] Notably it is not clear why this is necessary {code}.withDestinationCoder(StringUtf8Coder.of()){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-7400) Authentication for ESIO seem to be failing
[ https://issues.apache.org/jira/browse/BEAM-7400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846521#comment-16846521 ] Tim Robertson commented on BEAM-7400: - Given the complexities around security in the ES distro, we might want to explore if the Amazon distro ([https://aws.amazon.com/blogs/aws/new-open-distro-for-elasticsearch/]) makes this easier, as it claims to have security enabled by default and "is not a fork". > Authentication for ESIO seem to be failing > -- > > Key: BEAM-7400 > URL: https://issues.apache.org/jira/browse/BEAM-7400 > Project: Beam > Issue Type: Bug > Components: io-java-elasticsearch >Reporter: Etienne Chauchot >Priority: Major > > It seem to be failing at list on elasticsearch cloud: > {code} > [WARNING] > org.elasticsearch.client.ResponseException: GET > https://a3626ec04ef549318d444cde10db468d.europe-west1.gcp.cloud.es.io:9243/: > HTTP/1.1 401 Unauthorized > {"error":{"root_cause":[{"type":"security_exception","reason":"action > [cluster:monitor/main] requires > authentication","header":{"WWW-Authenticate":["Bearer > realm=\"security\"","ApiKey","Basic realm=\"security\" > charset=\"UTF-8\""]}}],"type":"security_exception","reason":"action > [cluster:monitor/main] requires > authentication","header":{"WWW-Authenticate":["Bearer > realm=\"security\"","ApiKey","Basic realm=\"security\" > charset=\"UTF-8\""]}},"status":401} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (BEAM-6851) HadoopFileSystem doesn't Support Recursive Globs (**)
[ https://issues.apache.org/jira/browse/BEAM-6851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Robertson closed BEAM-6851. --- Resolution: Fixed Fix Version/s: 2.12.0 Thanks for opening this and supplying the fix [~winkelman.kyle] > HadoopFileSystem doesn't Support Recursive Globs (**) > - > > Key: BEAM-6851 > URL: https://issues.apache.org/jira/browse/BEAM-6851 > Project: Beam > Issue Type: Improvement > Components: io-java-hadoop >Reporter: Kyle Winkelman >Priority: Minor > Fix For: 2.12.0 > > Time Spent: 50m > Remaining Estimate: 0h > > The LocalFileSystem supports recursive globs, **, through use of the [glob > syntax|https://docs.oracle.com/javase/tutorial/essential/io/fileOps.html#glob] > and it would be nice to similarly allow it in the HadoopFileSystem. > The FileSystem docs say: > {code:java} > All FileSystem implementations should support glob in the final hierarchical > path component of ResourceIdT. This allows SDK libraries to construct file > system agnostic spec. FileSystems can support additional patterns for > user-provided specs. > {code} > This clearly points out that recursive globs aren't required but may be > supported as an additional pattern for user-provided specs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-6842) Refactor KuduIO (Java) tests to use embedded Kudu
Tim Robertson created BEAM-6842: --- Summary: Refactor KuduIO (Java) tests to use embedded Kudu Key: BEAM-6842 URL: https://issues.apache.org/jira/browse/BEAM-6842 Project: Beam Issue Type: Improvement Components: io-java-utilities Reporter: Tim Robertson Kudu 1.9.0 [introduces|https://kudu.apache.org/docs/release_notes.html] the ability to start an embedded cluster for testing against. Beam unit tests currently mock the Kudu server but the embedded server would be a far better test. The integration tests should continue to use Kudu in Docker. Note that Kudu mark this as experimental but developers in Kudu will appreciate feedback if issues arise. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-6842) Refactor KuduIO tests to use embedded Kudu
[ https://issues.apache.org/jira/browse/BEAM-6842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Robertson updated BEAM-6842: Summary: Refactor KuduIO tests to use embedded Kudu (was: Refactor KuduIO (Java) tests to use embedded Kudu) > Refactor KuduIO tests to use embedded Kudu > -- > > Key: BEAM-6842 > URL: https://issues.apache.org/jira/browse/BEAM-6842 > Project: Beam > Issue Type: Improvement > Components: io-java-utilities >Reporter: Tim Robertson >Priority: Minor > > Kudu 1.9.0 [introduces|https://kudu.apache.org/docs/release_notes.html] the > ability to start an embedded cluster for testing against. Beam unit tests > currently mock the Kudu server but the embedded server would be a far better > test. The integration tests should continue to use Kudu in Docker. > Note that Kudu mark this as experimental but developers in Kudu will > appreciate feedback if issues arise. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-5925) Test flake in ElasticsearchIOTest.testWriteFullAddressing
[ https://issues.apache.org/jira/browse/BEAM-5925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677889#comment-16677889 ] Tim Robertson commented on BEAM-5925: - CC [~wouts] too, a new ES contributor who is also looking at timeouts in his environment > Test flake in ElasticsearchIOTest.testWriteFullAddressing > - > > Key: BEAM-5925 > URL: https://issues.apache.org/jira/browse/BEAM-5925 > Project: Beam > Issue Type: Bug > Components: io-java-elasticsearch >Reporter: Kenneth Knowles >Assignee: Etienne Chauchot >Priority: Critical > > https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Java_GradleBuild/1789/ > https://scans.gradle.com/s/j42mwdsn5svcs > {code} > org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.io.IOException: > listener timeout after waiting for [3] ms > {code} > Log looks like this: > {code} > [2018-10-31T04:06:07,571][INFO ][o.a.b.s.i.e.ElasticsearchIOTest] > [testWriteFullAddressing]: before test > [2018-10-31T04:06:07,572][INFO ][o.a.b.s.i.e.ElasticsearchIOTest] > [ElasticsearchIOTest#testWriteFullAddressing]: setting up test > [2018-10-31T04:06:07,589][INFO ][o.e.c.m.MetaDataIndexTemplateService] > [node_s0] adding template [random_index_template] for index patterns [*] > [2018-10-31T04:06:07,645][INFO ][o.a.b.s.i.e.ElasticsearchIOTest] > [ElasticsearchIOTest#testWriteFullAddressing]: all set up test > [2018-10-31T04:06:10,536][INFO ][o.e.c.m.MetaDataCreateIndexService] > [node_s0] [galilei] creating index, cause [auto(bulk api)], templates > [random_index_template], shards [6]/[0], mappings [] > [2018-10-31T04:06:33,963][INFO ][o.e.c.m.MetaDataCreateIndexService] > [node_s0] [curie] creating index, cause [auto(bulk api)], templates > [random_index_template], shards [6]/[0], mappings [] > [2018-10-31T04:06:34,034][INFO ][o.e.c.m.MetaDataCreateIndexService] > [node_s0] [darwin] creating index, cause [auto(bulk api)], templates > [random_index_template], shards [6]/[0], mappings [] > [2018-10-31T04:06:34,050][INFO ][o.e.c.m.MetaDataCreateIndexService] > [node_s0] [copernicus] creating index, cause [auto(bulk api)], templates > [random_index_template], shards [6]/[0], mappings [] > [2018-10-31T04:06:34,075][INFO ][o.e.c.m.MetaDataCreateIndexService] > [node_s0] [faraday] creating index, cause [auto(bulk api)], templates > [random_index_template], shards [6]/[0], mappings [] > [2018-10-31T04:06:34,095][INFO ][o.e.c.m.MetaDataCreateIndexService] > [node_s0] [bohr] creating index, cause [auto(bulk api)], templates > [random_index_template], shards [6]/[0], mappings [] > [2018-10-31T04:06:34,113][INFO ][o.e.c.m.MetaDataCreateIndexService] > [node_s0] [pasteur] creating index, cause [auto(bulk api)], templates > [random_index_template], shards [6]/[0], mappings [] > [2018-10-31T04:06:34,142][INFO ][o.e.c.m.MetaDataCreateIndexService] > [node_s0] [einstein] creating index, cause [auto(bulk api)], templates > [random_index_template], shards [6]/[0], mappings [] > [2018-10-31T04:06:34,205][INFO ][o.e.c.m.MetaDataCreateIndexService] > [node_s0] [maxwell] creating index, cause [auto(bulk api)], templates > [random_index_template], shards [6]/[0], mappings [] > [2018-10-31T04:06:34,226][INFO ][o.e.c.m.MetaDataCreateIndexService] > [node_s0] [newton] creating index, cause [auto(bulk api)], templates > [random_index_template], shards [6]/[0], mappings [] > [2018-10-31T04:06:36,914][INFO ][o.e.c.r.a.AllocationService] [node_s0] > Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards > started [[galilei][4], [galilei][5]] ...]). > [2018-10-31T04:06:36,970][INFO ][o.e.c.m.MetaDataMappingService] [node_s0] > [galilei/Vn1b8XXVSAmrTb5BVe2IJQ] create_mapping [TYPE_1] > [2018-10-31T04:06:37,137][INFO ][o.e.c.m.MetaDataMappingService] [node_s0] > [newton/bjnImLt_QguBGEFH9lBJ6Q] create_mapping [TYPE_-1] > [2018-10-31T04:06:37,385][INFO ][o.e.c.m.MetaDataMappingService] [node_s0] > [maxwell/-RZ32NbRRZWaGaVfaptFIA] create_mapping [TYPE_0] > [2018-10-31T04:06:37,636][INFO ][o.e.c.m.MetaDataMappingService] [node_s0] > [einstein/2lgF5Vj6Ti2KTS-pYSzv3Q] create_mapping [TYPE_1] > [2018-10-31T04:06:37,806][INFO ][o.e.c.m.MetaDataMappingService] [node_s0] > [pasteur/832OwzleRSOHsWx85vOH-w] create_mapping [TYPE_0] > [2018-10-31T04:06:38,103][INFO ][o.e.c.m.MetaDataMappingService] [node_s0] > [bohr/9YTwB1yvTYKf9YjYCmHjwg] create_mapping [TYPE_1] > [2018-10-31T04:06:38,229][INFO ][o.e.c.m.MetaDataMappingService] [node_s0] > [faraday/vIMYG8vpTQKqNkyajcFOxw] create_mapping [TYPE_0] > [2018-10-31T04:06:38,576][INFO ][o.e.c.m.MetaDataMappingService] [node_s0] > [copernicus/NzCZssInSiOdZKTmLCoXRw] create_mapping [TYPE_1] > [2018-10-31T04:06:38,890][INFO ][o.e.c.m.MetaDataMappingService] [node_s0] >
[jira] [Resolved] (BEAM-5725) ElasticsearchIO RetryConfiguration response parse failure
[ https://issues.apache.org/jira/browse/BEAM-5725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Robertson resolved BEAM-5725. - Resolution: Fixed Fix Version/s: 2.9.0 Thank you [~wouts] - I believe this was your first contribution? It's great to see people motivated to fix the issues they find. > ElasticsearchIO RetryConfiguration response parse failure > - > > Key: BEAM-5725 > URL: https://issues.apache.org/jira/browse/BEAM-5725 > Project: Beam > Issue Type: Bug > Components: io-java-elasticsearch >Reporter: Wout Scheepers >Assignee: Wout Scheepers >Priority: Major > Fix For: 2.9.0 > > Time Spent: 1h > Remaining Estimate: 0h > > When using .withRetryConfiguration() for ElasticsearchIO, I get the following > stacktrace: > > > {code:java} > Caused by: com.fasterxml.jackson.databind.exc.MismatchedInputException: No > content to map due to end-of-input > at [Source: (org.apache.http.nio.entity.ContentInputStream); line: 1, column: > 0] > at > com.fasterxml.jackson.databind.exc.MismatchedInputException.from(MismatchedInputException.java:59) > at > com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:4133) > at > com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3988) > at > com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3058) > at > org.apache.beam.sdk.io.elasticsearch.ElasticsearchIO.parseResponse(ElasticsearchIO.java:173) > at > org.apache.beam.sdk.io.elasticsearch.ElasticsearchIO.checkForErrors(ElasticsearchIO.java:177) > at > org.apache.beam.sdk.io.elasticsearch.ElasticsearchIO$Write$WriteFn.flushBatch(ElasticsearchIO.java:1204) > at > org.apache.beam.sdk.io.elasticsearch.ElasticsearchIO$Write$WriteFn.finishBundle(ElasticsearchIO.java:1175) > {code} > > > Probably the elastic response object's content stream is consumed twice, > resulting in a MismatchedInputException. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()
[ https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Robertson resolved BEAM-5036. - Resolution: Fixed Fix Version/s: 2.9.0 Thanks everyone for sticking with this - merged > Optimize FileBasedSink's WriteOperation.moveToOutput() > -- > > Key: BEAM-5036 > URL: https://issues.apache.org/jira/browse/BEAM-5036 > Project: Beam > Issue Type: Improvement > Components: io-java-files >Affects Versions: 2.5.0 >Reporter: Jozef Vilcek >Assignee: Tim Robertson >Priority: Major > Fix For: 2.9.0 > > Time Spent: 14h 20m > Remaining Estimate: 0h > > moveToOutput() methods in FileBasedSink.WriteOperation implements move by > copy+delete. It would be better to use a rename() which can be much more > effective for some filesystems. > Filesystem must support cross-directory rename. BEAM-4861 is related to this > for the case of HDFS filesystem. > Feature was discussed here: > http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (BEAM-4783) Add bundleSize parameter to control splitting of Spark sources (useful for Dynamic Allocation)
[ https://issues.apache.org/jira/browse/BEAM-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Robertson resolved BEAM-4783. - Resolution: Fixed Thanks [~iemejia] > Add bundleSize parameter to control splitting of Spark sources (useful for > Dynamic Allocation) > -- > > Key: BEAM-4783 > URL: https://issues.apache.org/jira/browse/BEAM-4783 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.8.0 >Reporter: Kyle Winkelman >Assignee: Kyle Winkelman >Priority: Major > Fix For: 2.9.0, 2.8.0 > > Time Spent: 5h 50m > Remaining Estimate: 0h > > When the spark-runner is used along with the configuration > spark.dynamicAllocation.enabled=true the SourceRDD does not detect this. It > then falls back to the value calculated in this description: > // when running on YARN/SparkDeploy it's the result of max(totalCores, > 2). > // when running on Mesos it's 8. > // when running local it's the total number of cores (local = 1, > local[N] = N, > // local[*] = estimation of the machine's cores). > // ** the configuration "spark.default.parallelism" takes precedence > over all of the above ** > So in most cases this default is quite small. This is an issue when using a > very large input file as it will only get split in half. > I believe that when Dynamic Allocation is enable the SourceRDD should use the > DEFAULT_BUNDLE_SIZE and possibly expose a SparkPipelineOptions that allows > you to change this DEFAULT_BUNDLE_SIZE. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-5725) ElasticsearchIO RetryConfiguration response parse failure
[ https://issues.apache.org/jira/browse/BEAM-5725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16668334#comment-16668334 ] Tim Robertson commented on BEAM-5725: - Copying from the user@ mailing list: I took a super quick look at the code:. # On a retry scenario it calls handleRetry() # Within handleRetry() it gets the DefaultRetryPredicate and calls test(response) - this reads the response stream to JSON # When the retry is successful (no 429 code) the response is returned # The response is then passed in to checkForErrors(...) # This then tried to parse the response by reading the response stream. It was already read in step 2 hence the stream cannot be read again [~wouts] - this is something we also would like to see addressed. 2.9.0 will be cut in 3 weeks - shall we aim for that? Are you still keen to fix it, or would you like us to look? (CC [~aalbatross] for info too). > ElasticsearchIO RetryConfiguration response parse failure > - > > Key: BEAM-5725 > URL: https://issues.apache.org/jira/browse/BEAM-5725 > Project: Beam > Issue Type: Bug > Components: io-java-elasticsearch >Reporter: Wout Scheepers >Assignee: Wout Scheepers >Priority: Major > > When using .withRetryConfiguration() for ElasticsearchIO, I get the following > stacktrace: > > > {code:java} > Caused by: com.fasterxml.jackson.databind.exc.MismatchedInputException: No > content to map due to end-of-input > at [Source: (org.apache.http.nio.entity.ContentInputStream); line: 1, column: > 0] > at > com.fasterxml.jackson.databind.exc.MismatchedInputException.from(MismatchedInputException.java:59) > at > com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:4133) > at > com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3988) > at > com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3058) > at > org.apache.beam.sdk.io.elasticsearch.ElasticsearchIO.parseResponse(ElasticsearchIO.java:173) > at > org.apache.beam.sdk.io.elasticsearch.ElasticsearchIO.checkForErrors(ElasticsearchIO.java:177) > at > org.apache.beam.sdk.io.elasticsearch.ElasticsearchIO$Write$WriteFn.flushBatch(ElasticsearchIO.java:1204) > at > org.apache.beam.sdk.io.elasticsearch.ElasticsearchIO$Write$WriteFn.finishBundle(ElasticsearchIO.java:1175) > {code} > > > Probably the elastic response object's content stream is consumed twice, > resulting in a MismatchedInputException. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()
[ https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660399#comment-16660399 ] Tim Robertson commented on BEAM-5036: - [~JozoVilcek] - I anticipate fixing it within a week or so if you can wait that long (sorry been super busy)? We've been running the PR I wrote for a while though in batch pipelines ([we backported it|https://github.com/gbif/pipelines/tree/master/pipelines/beam-common/src/main/java/org/apache/beam/sdk/io]) as without it Beam is mostly unusable on HDFS. > Optimize FileBasedSink's WriteOperation.moveToOutput() > -- > > Key: BEAM-5036 > URL: https://issues.apache.org/jira/browse/BEAM-5036 > Project: Beam > Issue Type: Improvement > Components: io-java-files >Affects Versions: 2.5.0 >Reporter: Jozef Vilcek >Assignee: Tim Robertson >Priority: Major > Time Spent: 11h > Remaining Estimate: 0h > > moveToOutput() methods in FileBasedSink.WriteOperation implements move by > copy+delete. It would be better to use a rename() which can be much more > effective for some filesystems. > Filesystem must support cross-directory rename. BEAM-4861 is related to this > for the case of HDFS filesystem. > Feature was discussed here: > http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()
[ https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658891#comment-16658891 ] Tim Robertson commented on BEAM-5036: - Hi [~JozoVilcek] The solution caused confusion which in itself is an issue - if reviewers don't understand it, then it will be a problem for us to maintain. {quote}...to make change only in HDFS filesystem implementation, how do you propose to do it? {quote} Move the retry behaviour from the original PR into the HDFS implementation and stop throwing FileAlreadyExistsException. {quote}The WriteOperation.moveToOutput() is now using `FileSystems.copy`, this would need to change to `rename` anyway, right? {quote} Yes > Optimize FileBasedSink's WriteOperation.moveToOutput() > -- > > Key: BEAM-5036 > URL: https://issues.apache.org/jira/browse/BEAM-5036 > Project: Beam > Issue Type: Improvement > Components: io-java-files >Affects Versions: 2.5.0 >Reporter: Jozef Vilcek >Assignee: Tim Robertson >Priority: Major > Time Spent: 11h > Remaining Estimate: 0h > > moveToOutput() methods in FileBasedSink.WriteOperation implements move by > copy+delete. It would be better to use a rename() which can be much more > effective for some filesystems. > Filesystem must support cross-directory rename. BEAM-4861 is related to this > for the case of HDFS filesystem. > Feature was discussed here: > http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)