[jira] [Commented] (BEAM-9629) JdbcIO seems to run out of connections in the connection pool and freezes pipeline

2020-03-30 Thread Tim Robertson (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070806#comment-17070806
 ] 

Tim Robertson commented on BEAM-9629:
-

I'd recommend setting the default to something low (1-3) but documenting that 
clearly and recommending people increase it to a sensible limit for their 
environment (e.g. have it in the default examples so it is prominent).

I expect you could easily hit DB resource limits - e.g. a Spark cluster running 
500 executors opening connections simultaneously by default. We have hit limits 
(not Beam related) with various microservices and Hadoop processes (e.g. Scoop) 
hammering PostgreSQL.

> JdbcIO seems to run out of connections in the connection pool and freezes 
> pipeline
> --
>
> Key: BEAM-9629
> URL: https://issues.apache.org/jira/browse/BEAM-9629
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-jdbc
>Affects Versions: 2.19.0
> Environment: Dataflow, Direct Runner on macOS Catalina.
>Reporter: Boris Shilov
>Assignee: Ismaël Mejía
>Priority: Major
>  Labels: performance
> Fix For: 2.21.0
>
>
> Greetings,
> I am using JdbcIO via the Scala wrappers provided in the Scio project. I am 
> trying to read a few dozen tables in parallel from MySQL, but above 8 
> concurrent SELECT operations the pipeline freezes. With help of the Scio 
> maintainers we've been able to isolate the issue as likely originating in 
> JdbcIO running out of connections in the connection pool and idling 
> indefinitely. The issue occurs both on the Direct Runner and Dataflow.
> Please see linked issue for more context: 
> https://github.com/spotify/scio/issues/2774



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-8306) improve estimation of data byte size reading from source in ElasticsearchIO

2019-10-31 Thread Tim Robertson (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson resolved BEAM-8306.
-
Fix Version/s: 2.17.0
   Resolution: Fixed

Thank you [~derek.he] and once again - apologies your original PR went so long 
before review.

> improve estimation of data byte size reading from source in ElasticsearchIO
> ---
>
> Key: BEAM-8306
> URL: https://issues.apache.org/jira/browse/BEAM-8306
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Affects Versions: 2.14.0
>Reporter: Derek He
>Assignee: Derek He
>Priority: Major
> Fix For: 2.17.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> ElasticsearchIO splits BoundedSource based on the Elasticsearch index size. 
> We expect it can be more accurate to split it base on query result size.
> Currently, we have a big Elasticsearch index. But for query result, it only 
> contains a few documents in the index.  ElasticsearchIO splits it into up 
> to1024 BoundedSources in Google dataflow. It takes long time to finish the 
> processing the small numbers of Elasticsearch document in Google dataflow.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-5192) Support Elasticsearch 7.x

2019-09-26 Thread Tim Robertson (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-5192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938907#comment-16938907
 ] 

Tim Robertson commented on BEAM-5192:
-

[~chetaldrich] - help would be greatly appreciated. 
Please feel free to reassign to you.


> Support Elasticsearch 7.x
> -
>
> Key: BEAM-5192
> URL: https://issues.apache.org/jira/browse/BEAM-5192
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Etienne Chauchot
>Assignee: Tim Robertson
>Priority: Major
>
> ES v7 is not out yet. But Elastic team scheduled a breaking change for ES 
> 7.0: the removal of the type feature. See 
> [https://www.elastic.co/blog/index-type-parent-child-join-now-future-in-elasticsearch]
> This will require a good amont of changes in the IO. 
> This ticket is there to track the future work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-3386) Dependency conflict when Calcite is included in a project.

2019-08-27 Thread Tim Robertson (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916767#comment-16916767
 ] 

Tim Robertson commented on BEAM-3386:
-

Same as [~JozoVilcek] - trying to run on Spark and Hive today using Beam 2.15 
and this issue remains.

> Dependency conflict when Calcite is included in a project.
> --
>
> Key: BEAM-3386
> URL: https://issues.apache.org/jira/browse/BEAM-3386
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql
>Affects Versions: 2.2.0, 2.3.0, 2.4.0, 2.5.0, 2.6.0
>Reporter: Austin Haas
>Assignee: Kai Jiang
>Priority: Critical
>
> When Calcite (v. 1.13.0) is included in a project that also includes Beam and 
> the Beam SQL extension, then the following error is thrown when trying to run 
> Beam code.
> ClassCastException 
> org.apache.beam.sdk.extensions.sql.impl.planner.BeamRelDataTypeSystem cannot 
> be cast to org.apache.calcite.rel.type.RelDataTypeSystem
> org.apache.calcite.jdbc.CalciteConnectionImpl. 
> (CalciteConnectionImpl.java:120)
> 
> org.apache.calcite.jdbc.CalciteJdbc41Factory$CalciteJdbc41Connection. 
> (CalciteJdbc41Factory.java:114)
> org.apache.calcite.jdbc.CalciteJdbc41Factory.newConnection 
> (CalciteJdbc41Factory.java:59)
> org.apache.calcite.jdbc.CalciteJdbc41Factory.newConnection 
> (CalciteJdbc41Factory.java:44)
> org.apache.calcite.jdbc.CalciteFactory.newConnection 
> (CalciteFactory.java:53)
> org.apache.calcite.avatica.UnregisteredDriver.connect 
> (UnregisteredDriver.java:138)
> java.sql.DriverManager.getConnection (DriverManager.java:664)
> java.sql.DriverManager.getConnection (DriverManager.java:208)
> 
> org.apache.beam.sdks.java.extensions.sql.repackaged.org.apache.calcite.tools.Frameworks.withPrepare
>  (Frameworks.java:145)
> 
> org.apache.beam.sdks.java.extensions.sql.repackaged.org.apache.calcite.tools.Frameworks.withPlanner
>  (Frameworks.java:106)
> 
> org.apache.beam.sdks.java.extensions.sql.repackaged.org.apache.calcite.prepare.PlannerImpl.ready
>  (PlannerImpl.java:140)
> 
> org.apache.beam.sdks.java.extensions.sql.repackaged.org.apache.calcite.prepare.PlannerImpl.parse
>  (PlannerImpl.java:170)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (BEAM-7561) HdfsFileSystem is unable to match a directory

2019-06-19 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867384#comment-16867384
 ] 

Tim Robertson commented on BEAM-7561:
-

[~kedin] - this is now merged. Thank you for waiting for this.

> HdfsFileSystem is unable to match a directory
> -
>
> Key: BEAM-7561
> URL: https://issues.apache.org/jira/browse/BEAM-7561
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop-file-system
>Affects Versions: 2.13.0
>Reporter: David Moravek
>Assignee: David Moravek
>Priority: Major
> Fix For: 2.14.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> [FileSystems.match|https://beam.apache.org/releases/javadoc/2.3.0/org/apache/beam/sdk/io/FileSystems.html#match-java.util.List-]
>  method should be able to match a directory according to javadoc. Unlike 
> _HdfsFileSystems_, _LocalFileSystems_ behaves as I would so I'm assuming this 
> is a bug on _HdfsFileSystems_ side.
> *Current behavior:*
> _HadoopFileSystem.match("hdfs:///tmp/dir")_ returns a _MatchResult_ with an 
> empty metadata
> *Expected behavior:*
> _HadoopFileSystem.match("hdfs:///tmp/dir")_ returns a _MatchResult_ with 
> metadata about the directory
> **



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (BEAM-7561) HdfsFileSystem is unable to match a directory

2019-06-19 Thread Tim Robertson (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson resolved BEAM-7561.
-
Resolution: Fixed

> HdfsFileSystem is unable to match a directory
> -
>
> Key: BEAM-7561
> URL: https://issues.apache.org/jira/browse/BEAM-7561
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop-file-system
>Affects Versions: 2.13.0
>Reporter: David Moravek
>Assignee: David Moravek
>Priority: Major
> Fix For: 2.14.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> [FileSystems.match|https://beam.apache.org/releases/javadoc/2.3.0/org/apache/beam/sdk/io/FileSystems.html#match-java.util.List-]
>  method should be able to match a directory according to javadoc. Unlike 
> _HdfsFileSystems_, _LocalFileSystems_ behaves as I would so I'm assuming this 
> is a bug on _HdfsFileSystems_ side.
> *Current behavior:*
> _HadoopFileSystem.match("hdfs:///tmp/dir")_ returns a _MatchResult_ with an 
> empty metadata
> *Expected behavior:*
> _HadoopFileSystem.match("hdfs:///tmp/dir")_ returns a _MatchResult_ with 
> metadata about the directory
> **



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-7591) Unify FileSystems tests across implementation

2019-06-19 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867381#comment-16867381
 ] 

Tim Robertson commented on BEAM-7591:
-

Copied from PR discussion:

Also, there is a ResourceIdTester (
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/fs/ResourceIdTester.java)
which is meant to validate ResourceId objects returned. We may want to
collapse the two together into one FileSystemTester.

> Unify FileSystems tests across implementation
> -
>
> Key: BEAM-7591
> URL: https://issues.apache.org/jira/browse/BEAM-7591
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Reporter: David Moravek
>Priority: Minor
>
> Follow-up for a discussion from [https://github.com/apache/beam/pull/8849] PR.
> Currently there is no unified tester for FileSystems implementations, that 
> would make sure interface is correctly implemented. We should introduce a 
> single FileSystemsTester and reuse it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-7517) Document dynamic destinations for avro

2019-06-08 Thread Tim Robertson (JIRA)
Tim Robertson created BEAM-7517:
---

 Summary: Document dynamic destinations for avro 
 Key: BEAM-7517
 URL: https://issues.apache.org/jira/browse/BEAM-7517
 Project: Beam
  Issue Type: Improvement
  Components: website
Reporter: Tim Robertson


I've seen several folk struggle with writing Avro files to dynamic locations 
which I think might be a good addition to the pipelines patterns. 
 
An example showing how it may be done is in [this 
example|https://github.com/gbif/pipelines/blob/master/pipelines/export-gbif-hbase/src/main/java/org/gbif/pipelines/hbase/beam/ExportHBase.java#L81]

Notably it is not clear why this is necessary 
{code}.withDestinationCoder(StringUtf8Coder.of()){code} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-7400) Authentication for ESIO seem to be failing

2019-05-23 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846521#comment-16846521
 ] 

Tim Robertson commented on BEAM-7400:
-

Given the complexities around security in the ES distro, we might want to 
explore if the Amazon distro 
([https://aws.amazon.com/blogs/aws/new-open-distro-for-elasticsearch/]) makes 
this easier, as it claims to have security enabled by default and "is not a 
fork". 

> Authentication for ESIO seem to be failing
> --
>
> Key: BEAM-7400
> URL: https://issues.apache.org/jira/browse/BEAM-7400
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-elasticsearch
>Reporter: Etienne Chauchot
>Priority: Major
>
> It seem to be failing at list on elasticsearch cloud:
> {code}
> [WARNING] 
> org.elasticsearch.client.ResponseException: GET 
> https://a3626ec04ef549318d444cde10db468d.europe-west1.gcp.cloud.es.io:9243/: 
> HTTP/1.1 401 Unauthorized
> {"error":{"root_cause":[{"type":"security_exception","reason":"action 
> [cluster:monitor/main] requires 
> authentication","header":{"WWW-Authenticate":["Bearer 
> realm=\"security\"","ApiKey","Basic realm=\"security\" 
> charset=\"UTF-8\""]}}],"type":"security_exception","reason":"action 
> [cluster:monitor/main] requires 
> authentication","header":{"WWW-Authenticate":["Bearer 
> realm=\"security\"","ApiKey","Basic realm=\"security\" 
> charset=\"UTF-8\""]}},"status":401}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (BEAM-6851) HadoopFileSystem doesn't Support Recursive Globs (**)

2019-03-19 Thread Tim Robertson (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson closed BEAM-6851.
---
   Resolution: Fixed
Fix Version/s: 2.12.0

Thanks for opening this and supplying the fix [~winkelman.kyle]

> HadoopFileSystem doesn't Support Recursive Globs (**)
> -
>
> Key: BEAM-6851
> URL: https://issues.apache.org/jira/browse/BEAM-6851
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-hadoop
>Reporter: Kyle Winkelman
>Priority: Minor
> Fix For: 2.12.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The LocalFileSystem supports recursive globs, **, through use of the [glob 
> syntax|https://docs.oracle.com/javase/tutorial/essential/io/fileOps.html#glob]
>  and it would be nice to similarly allow it in the HadoopFileSystem.
> The FileSystem docs say:
> {code:java}
> All FileSystem implementations should support glob in the final hierarchical 
> path component of ResourceIdT. This allows SDK libraries to construct file 
> system agnostic spec. FileSystems can support additional patterns for 
> user-provided specs.
> {code}
> This clearly points out that recursive globs aren't required but may be 
> supported as an additional pattern for user-provided specs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-6842) Refactor KuduIO (Java) tests to use embedded Kudu

2019-03-15 Thread Tim Robertson (JIRA)
Tim Robertson created BEAM-6842:
---

 Summary: Refactor KuduIO (Java) tests to use embedded Kudu
 Key: BEAM-6842
 URL: https://issues.apache.org/jira/browse/BEAM-6842
 Project: Beam
  Issue Type: Improvement
  Components: io-java-utilities
Reporter: Tim Robertson


Kudu 1.9.0 [introduces|https://kudu.apache.org/docs/release_notes.html] the 
ability to start an embedded cluster for testing against. Beam unit tests 
currently mock the Kudu server but the embedded server would be a far better 
test. The integration tests should continue to use Kudu in Docker.

Note that Kudu mark this as experimental but developers in Kudu will appreciate 
feedback if issues arise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-6842) Refactor KuduIO tests to use embedded Kudu

2019-03-15 Thread Tim Robertson (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson updated BEAM-6842:

Summary: Refactor KuduIO tests to use embedded Kudu  (was: Refactor KuduIO 
(Java) tests to use embedded Kudu)

> Refactor KuduIO tests to use embedded Kudu
> --
>
> Key: BEAM-6842
> URL: https://issues.apache.org/jira/browse/BEAM-6842
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-utilities
>Reporter: Tim Robertson
>Priority: Minor
>
> Kudu 1.9.0 [introduces|https://kudu.apache.org/docs/release_notes.html] the 
> ability to start an embedded cluster for testing against. Beam unit tests 
> currently mock the Kudu server but the embedded server would be a far better 
> test. The integration tests should continue to use Kudu in Docker.
> Note that Kudu mark this as experimental but developers in Kudu will 
> appreciate feedback if issues arise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5925) Test flake in ElasticsearchIOTest.testWriteFullAddressing

2018-11-07 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677889#comment-16677889
 ] 

Tim Robertson commented on BEAM-5925:
-

CC [~wouts] too, a new ES contributor who is also looking at timeouts in his 
environment 

> Test flake in ElasticsearchIOTest.testWriteFullAddressing
> -
>
> Key: BEAM-5925
> URL: https://issues.apache.org/jira/browse/BEAM-5925
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-elasticsearch
>Reporter: Kenneth Knowles
>Assignee: Etienne Chauchot
>Priority: Critical
>
> https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Java_GradleBuild/1789/
> https://scans.gradle.com/s/j42mwdsn5svcs
> {code}
> org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.io.IOException: 
> listener timeout after waiting for [3] ms
> {code}
> Log looks like this:
> {code}
> [2018-10-31T04:06:07,571][INFO ][o.a.b.s.i.e.ElasticsearchIOTest] 
> [testWriteFullAddressing]: before test
> [2018-10-31T04:06:07,572][INFO ][o.a.b.s.i.e.ElasticsearchIOTest] 
> [ElasticsearchIOTest#testWriteFullAddressing]: setting up test
> [2018-10-31T04:06:07,589][INFO ][o.e.c.m.MetaDataIndexTemplateService] 
> [node_s0] adding template [random_index_template] for index patterns [*]
> [2018-10-31T04:06:07,645][INFO ][o.a.b.s.i.e.ElasticsearchIOTest] 
> [ElasticsearchIOTest#testWriteFullAddressing]: all set up test
> [2018-10-31T04:06:10,536][INFO ][o.e.c.m.MetaDataCreateIndexService] 
> [node_s0] [galilei] creating index, cause [auto(bulk api)], templates 
> [random_index_template], shards [6]/[0], mappings []
> [2018-10-31T04:06:33,963][INFO ][o.e.c.m.MetaDataCreateIndexService] 
> [node_s0] [curie] creating index, cause [auto(bulk api)], templates 
> [random_index_template], shards [6]/[0], mappings []
> [2018-10-31T04:06:34,034][INFO ][o.e.c.m.MetaDataCreateIndexService] 
> [node_s0] [darwin] creating index, cause [auto(bulk api)], templates 
> [random_index_template], shards [6]/[0], mappings []
> [2018-10-31T04:06:34,050][INFO ][o.e.c.m.MetaDataCreateIndexService] 
> [node_s0] [copernicus] creating index, cause [auto(bulk api)], templates 
> [random_index_template], shards [6]/[0], mappings []
> [2018-10-31T04:06:34,075][INFO ][o.e.c.m.MetaDataCreateIndexService] 
> [node_s0] [faraday] creating index, cause [auto(bulk api)], templates 
> [random_index_template], shards [6]/[0], mappings []
> [2018-10-31T04:06:34,095][INFO ][o.e.c.m.MetaDataCreateIndexService] 
> [node_s0] [bohr] creating index, cause [auto(bulk api)], templates 
> [random_index_template], shards [6]/[0], mappings []
> [2018-10-31T04:06:34,113][INFO ][o.e.c.m.MetaDataCreateIndexService] 
> [node_s0] [pasteur] creating index, cause [auto(bulk api)], templates 
> [random_index_template], shards [6]/[0], mappings []
> [2018-10-31T04:06:34,142][INFO ][o.e.c.m.MetaDataCreateIndexService] 
> [node_s0] [einstein] creating index, cause [auto(bulk api)], templates 
> [random_index_template], shards [6]/[0], mappings []
> [2018-10-31T04:06:34,205][INFO ][o.e.c.m.MetaDataCreateIndexService] 
> [node_s0] [maxwell] creating index, cause [auto(bulk api)], templates 
> [random_index_template], shards [6]/[0], mappings []
> [2018-10-31T04:06:34,226][INFO ][o.e.c.m.MetaDataCreateIndexService] 
> [node_s0] [newton] creating index, cause [auto(bulk api)], templates 
> [random_index_template], shards [6]/[0], mappings []
> [2018-10-31T04:06:36,914][INFO ][o.e.c.r.a.AllocationService] [node_s0] 
> Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards 
> started [[galilei][4], [galilei][5]] ...]).
> [2018-10-31T04:06:36,970][INFO ][o.e.c.m.MetaDataMappingService] [node_s0] 
> [galilei/Vn1b8XXVSAmrTb5BVe2IJQ] create_mapping [TYPE_1]
> [2018-10-31T04:06:37,137][INFO ][o.e.c.m.MetaDataMappingService] [node_s0] 
> [newton/bjnImLt_QguBGEFH9lBJ6Q] create_mapping [TYPE_-1]
> [2018-10-31T04:06:37,385][INFO ][o.e.c.m.MetaDataMappingService] [node_s0] 
> [maxwell/-RZ32NbRRZWaGaVfaptFIA] create_mapping [TYPE_0]
> [2018-10-31T04:06:37,636][INFO ][o.e.c.m.MetaDataMappingService] [node_s0] 
> [einstein/2lgF5Vj6Ti2KTS-pYSzv3Q] create_mapping [TYPE_1]
> [2018-10-31T04:06:37,806][INFO ][o.e.c.m.MetaDataMappingService] [node_s0] 
> [pasteur/832OwzleRSOHsWx85vOH-w] create_mapping [TYPE_0]
> [2018-10-31T04:06:38,103][INFO ][o.e.c.m.MetaDataMappingService] [node_s0] 
> [bohr/9YTwB1yvTYKf9YjYCmHjwg] create_mapping [TYPE_1]
> [2018-10-31T04:06:38,229][INFO ][o.e.c.m.MetaDataMappingService] [node_s0] 
> [faraday/vIMYG8vpTQKqNkyajcFOxw] create_mapping [TYPE_0]
> [2018-10-31T04:06:38,576][INFO ][o.e.c.m.MetaDataMappingService] [node_s0] 
> [copernicus/NzCZssInSiOdZKTmLCoXRw] create_mapping [TYPE_1]
> [2018-10-31T04:06:38,890][INFO ][o.e.c.m.MetaDataMappingService] [node_s0] 
> 

[jira] [Resolved] (BEAM-5725) ElasticsearchIO RetryConfiguration response parse failure

2018-11-07 Thread Tim Robertson (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson resolved BEAM-5725.
-
   Resolution: Fixed
Fix Version/s: 2.9.0

Thank you [~wouts] - I believe this was your first contribution? It's great to 
see people motivated to fix the issues they find. 

> ElasticsearchIO RetryConfiguration response parse failure
> -
>
> Key: BEAM-5725
> URL: https://issues.apache.org/jira/browse/BEAM-5725
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-elasticsearch
>Reporter: Wout Scheepers
>Assignee: Wout Scheepers
>Priority: Major
> Fix For: 2.9.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When using .withRetryConfiguration() for ElasticsearchIO, I get the following 
> stacktrace:
>  
>  
> {code:java}
> Caused by: com.fasterxml.jackson.databind.exc.MismatchedInputException: No 
> content to map due to end-of-input
> at [Source: (org.apache.http.nio.entity.ContentInputStream); line: 1, column: 
> 0]
> at 
> com.fasterxml.jackson.databind.exc.MismatchedInputException.from(MismatchedInputException.java:59)
> at 
> com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:4133)
> at 
> com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3988)
> at 
> com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3058)
> at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIO.parseResponse(ElasticsearchIO.java:173)
> at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIO.checkForErrors(ElasticsearchIO.java:177)
> at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIO$Write$WriteFn.flushBatch(ElasticsearchIO.java:1204)
> at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIO$Write$WriteFn.finishBundle(ElasticsearchIO.java:1175)
> {code}
>  
>  
> Probably the elastic response object's content stream is consumed twice, 
> resulting in a MismatchedInputException.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-10-30 Thread Tim Robertson (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson resolved BEAM-5036.
-
   Resolution: Fixed
Fix Version/s: 2.9.0

Thanks everyone for sticking with this - merged

 

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
> Fix For: 2.9.0
>
>  Time Spent: 14h 20m
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (BEAM-4783) Add bundleSize parameter to control splitting of Spark sources (useful for Dynamic Allocation)

2018-10-30 Thread Tim Robertson (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson resolved BEAM-4783.
-
Resolution: Fixed

Thanks [~iemejia] 

> Add bundleSize parameter to control splitting of Spark sources (useful for 
> Dynamic Allocation)
> --
>
> Key: BEAM-4783
> URL: https://issues.apache.org/jira/browse/BEAM-4783
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.8.0
>Reporter: Kyle Winkelman
>Assignee: Kyle Winkelman
>Priority: Major
> Fix For: 2.9.0, 2.8.0
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> When the spark-runner is used along with the configuration 
> spark.dynamicAllocation.enabled=true the SourceRDD does not detect this. It 
> then falls back to the value calculated in this description:
>   // when running on YARN/SparkDeploy it's the result of max(totalCores, 
> 2).
>   // when running on Mesos it's 8.
>   // when running local it's the total number of cores (local = 1, 
> local[N] = N,
>   // local[*] = estimation of the machine's cores).
>   // ** the configuration "spark.default.parallelism" takes precedence 
> over all of the above **
> So in most cases this default is quite small. This is an issue when using a 
> very large input file as it will only get split in half.
> I believe that when Dynamic Allocation is enable the SourceRDD should use the 
> DEFAULT_BUNDLE_SIZE and possibly expose a SparkPipelineOptions that allows 
> you to change this DEFAULT_BUNDLE_SIZE.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5725) ElasticsearchIO RetryConfiguration response parse failure

2018-10-30 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16668334#comment-16668334
 ] 

Tim Robertson commented on BEAM-5725:
-

Copying from the user@ mailing list:

I took a super quick look at the code:.
 # On a retry scenario it calls handleRetry()
 # Within handleRetry() it gets the DefaultRetryPredicate and calls 
test(response) - this reads the response stream to JSON
 # When the retry is successful (no 429 code) the response is returned
 # The response is then passed in to checkForErrors(...)
 # This then tried to parse the response by reading the response stream. It was 
already read in step 2 hence the stream cannot be read again

 
[~wouts] - this is something we also would like to see addressed. 2.9.0 will be 
cut in 3 weeks - shall we aim for that? 
Are you still keen to fix it, or would you like us to look? (CC [~aalbatross] 
for info too).

> ElasticsearchIO RetryConfiguration response parse failure
> -
>
> Key: BEAM-5725
> URL: https://issues.apache.org/jira/browse/BEAM-5725
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-elasticsearch
>Reporter: Wout Scheepers
>Assignee: Wout Scheepers
>Priority: Major
>
> When using .withRetryConfiguration() for ElasticsearchIO, I get the following 
> stacktrace:
>  
>  
> {code:java}
> Caused by: com.fasterxml.jackson.databind.exc.MismatchedInputException: No 
> content to map due to end-of-input
> at [Source: (org.apache.http.nio.entity.ContentInputStream); line: 1, column: 
> 0]
> at 
> com.fasterxml.jackson.databind.exc.MismatchedInputException.from(MismatchedInputException.java:59)
> at 
> com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:4133)
> at 
> com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3988)
> at 
> com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3058)
> at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIO.parseResponse(ElasticsearchIO.java:173)
> at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIO.checkForErrors(ElasticsearchIO.java:177)
> at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIO$Write$WriteFn.flushBatch(ElasticsearchIO.java:1204)
> at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIO$Write$WriteFn.finishBundle(ElasticsearchIO.java:1175)
> {code}
>  
>  
> Probably the elastic response object's content stream is consumed twice, 
> resulting in a MismatchedInputException.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-10-23 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660399#comment-16660399
 ] 

Tim Robertson commented on BEAM-5036:
-

[~JozoVilcek] - I anticipate fixing it within a week or so if you can wait that 
long (sorry been super busy)? We've been running the PR I wrote for a while 
though in batch pipelines ([we backported 
it|https://github.com/gbif/pipelines/tree/master/pipelines/beam-common/src/main/java/org/apache/beam/sdk/io])
 as without it Beam is mostly unusable on HDFS. 

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 11h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-10-22 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658891#comment-16658891
 ] 

Tim Robertson commented on BEAM-5036:
-

Hi [~JozoVilcek]

The solution caused confusion which in itself is an issue - if reviewers don't 
understand it, then it will be a problem for us to maintain.
{quote}...to make change only in HDFS filesystem implementation, how do you 
propose to do it? 
{quote}
Move the retry behaviour from the original PR into the HDFS implementation and 
stop throwing FileAlreadyExistsException.
{quote}The WriteOperation.moveToOutput() is now using `FileSystems.copy`, this 
would need to change to `rename` anyway, right?
{quote}
Yes

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 11h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)