from:"Etienne Chauchot $Jira$"

[jira] [Commented] (BEAM-14064) ElasticSearchIO#Write buffering and outputting across windows

2022-03-09 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-14064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503640#comment-17503640
 ] 

Etienne Chauchot commented on BEAM-14064:
-

[~egalpin] don't worry you did a great job fixing it ! but obviously as the PR 
was merged before the ticket creation you could not have referred it. Only it 
is a pitty that I did not know it was linked to a bug otherwise I would have 
reviewed quicker and I would have included it in 2.37.0 :)  

> ElasticSearchIO#Write buffering and outputting across windows
> -
>
> Key: BEAM-14064
> URL: https://issues.apache.org/jira/browse/BEAM-14064
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-elasticsearch
>Affects Versions: 2.35.0, 2.36.0
>Reporter: Luke Cwik
>Assignee: Evan Galpin
>Priority: P2
>
> Source: https://lists.apache.org/thread/mtwtno2o88lx3zl12jlz7o5w1lcgm2db
> Bug PR: https://github.com/apache/beam/pull/15381
> ElasticsearchIO is collecting results from elements in window X and then 
> trying to output them in window Y when flushing the batch. This exposed a bug 
> where elements that were being buffered were being output as part of a 
> different window than what the window that produced them was.
> This became visible because validation was added recently to ensure that when 
> the pipeline is processing elements in window X that output with a timestamp 
> is valid for window X. Note that this validation only occurs in 
> *@ProcessElement* since output is associated with the current window with the 
> input element that is being processed.
> It is ok to do this in *@FinishBundle* since there is no existing windowing 
> context and when you output that element is assigned to an appropriate window.
> *Further Context*
> We’ve bisected it to being introduced in 2.35.0, and I’m reasonably certain 
> it’s this PR https://github.com/apache/beam/pull/15381
> Our scenario is pretty trivial, we read off Pubsub and write to Elastic in a 
> streaming job, the config for the source and sink is respectively
> {noformat}
> pipeline.apply(
> PubsubIO.readStrings().fromSubscription(subscription)
> ).apply(ParseJsons.of(OurObject::class.java))
> .setCoder(KryoCoder.of())
> {noformat}
> and
> {noformat}
> ElasticsearchIO.write()
> .withUseStatefulBatches(true)
> .withMaxParallelRequestsPerWindow(1)
> .withMaxBufferingDuration(Duration.standardSeconds(30))
> // 5 bytes **> KiB **> MiB, so 5 MiB
> .withMaxBatchSizeBytes(5L * 1024 * 1024)
> // # of docs
> .withMaxBatchSize(1000)
> .withConnectionConfiguration(
> ElasticsearchIO.ConnectionConfiguration.create(
> arrayOf(host),
> "fubar",
> "_doc"
> ).withConnectTimeout(5000)
> .withSocketTimeout(3)
> )
> .withRetryConfiguration(
> ElasticsearchIO.RetryConfiguration.create(
> 10,
> // the duration is wall clock, against the connection and 
> socket timeouts specified
> // above. I.e., 10 x 30s is gonna be more than 3 minutes, 
> so if we're getting
> // 10 socket timeouts in a row, this would ignore the 
> "10" part and terminate
> // after 6. The idea is that in a mixed failure mode, 
> you'd get different timeouts
> // of different durations, and on average 10 x fails < 4m.
> // That said, 4m is arbitrary, so adjust as and when 
> needed.
> Duration.standardMinutes(4)
> )
> )
> .withIdFn { f: JsonNode -> f["id"].asText() }
> .withIndexFn { f: JsonNode -> f["schema_name"].asText() }
> .withIsDeleteFn { f: JsonNode -> f["_action"].asText("noop") == 
> "delete" }
> {noformat}
> We recently tried upgrading 2.33 to 2.36 and immediately hit a bug in the 
> consumer, due to alleged time skew, specifically
> {noformat}
> 2022-03-07 10:48:37.886 GMTError message from worker: 
> java.lang.IllegalArgumentException: Cannot output with timestamp 
> 2022-03-07T10:43:38.640Z. Output timestamps must be no earlier than the 
> timestamp of the 
> current input (2022-03-07T10:43:43.562Z) minus the allowed skew (0 
> milliseconds) and no later than 294247-01-10T04:00:54.775Z. See the 
> DoFn#getAllowedTimestampSkew() Javadoc 
> for details on changing the allowed skew. 
> org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.checkTimestamp(SimpleDoFnRunner.java:446)
>  
> org.apache.beam.runners.dataflow.w

[jira] [Commented] (BEAM-14064) ElasticSearchIO#Write buffering and outputting across windows

2022-03-09 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-14064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503622#comment-17503622
 ] 

Etienne Chauchot commented on BEAM-14064:
-

[~lcwik] thanks for opening this issue and [~egalpin] thanks for the fix! That 
being said, 2.37.0 is being deployed so I guess it is too late, this change 
will be put to 2.38 as 2.37 was cut before the change was merged. Adding the 
current ticket to the name of the PR/commits would have helped in merging it 
quicker (as the PR was sumitted before 2.37.0 cut) by making it obvious that it 
was linked to a (blocker) bug.

> ElasticSearchIO#Write buffering and outputting across windows
> -
>
> Key: BEAM-14064
> URL: https://issues.apache.org/jira/browse/BEAM-14064
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-elasticsearch
>Affects Versions: 2.35.0, 2.36.0
>Reporter: Luke Cwik
>Assignee: Evan Galpin
>Priority: P2
>
> Source: https://lists.apache.org/thread/mtwtno2o88lx3zl12jlz7o5w1lcgm2db
> Bug PR: https://github.com/apache/beam/pull/15381
> ElasticsearchIO is collecting results from elements in window X and then 
> trying to output them in window Y when flushing the batch. This exposed a bug 
> where elements that were being buffered were being output as part of a 
> different window than what the window that produced them was.
> This became visible because validation was added recently to ensure that when 
> the pipeline is processing elements in window X that output with a timestamp 
> is valid for window X. Note that this validation only occurs in 
> *@ProcessElement* since output is associated with the current window with the 
> input element that is being processed.
> It is ok to do this in *@FinishBundle* since there is no existing windowing 
> context and when you output that element is assigned to an appropriate window.
> *Further Context*
> We’ve bisected it to being introduced in 2.35.0, and I’m reasonably certain 
> it’s this PR https://github.com/apache/beam/pull/15381
> Our scenario is pretty trivial, we read off Pubsub and write to Elastic in a 
> streaming job, the config for the source and sink is respectively
> {noformat}
> pipeline.apply(
> PubsubIO.readStrings().fromSubscription(subscription)
> ).apply(ParseJsons.of(OurObject::class.java))
> .setCoder(KryoCoder.of())
> {noformat}
> and
> {noformat}
> ElasticsearchIO.write()
> .withUseStatefulBatches(true)
> .withMaxParallelRequestsPerWindow(1)
> .withMaxBufferingDuration(Duration.standardSeconds(30))
> // 5 bytes **> KiB **> MiB, so 5 MiB
> .withMaxBatchSizeBytes(5L * 1024 * 1024)
> // # of docs
> .withMaxBatchSize(1000)
> .withConnectionConfiguration(
> ElasticsearchIO.ConnectionConfiguration.create(
> arrayOf(host),
> "fubar",
> "_doc"
> ).withConnectTimeout(5000)
> .withSocketTimeout(3)
> )
> .withRetryConfiguration(
> ElasticsearchIO.RetryConfiguration.create(
> 10,
> // the duration is wall clock, against the connection and 
> socket timeouts specified
> // above. I.e., 10 x 30s is gonna be more than 3 minutes, 
> so if we're getting
> // 10 socket timeouts in a row, this would ignore the 
> "10" part and terminate
> // after 6. The idea is that in a mixed failure mode, 
> you'd get different timeouts
> // of different durations, and on average 10 x fails < 4m.
> // That said, 4m is arbitrary, so adjust as and when 
> needed.
> Duration.standardMinutes(4)
> )
> )
> .withIdFn { f: JsonNode -> f["id"].asText() }
> .withIndexFn { f: JsonNode -> f["schema_name"].asText() }
> .withIsDeleteFn { f: JsonNode -> f["_action"].asText("noop") == 
> "delete" }
> {noformat}
> We recently tried upgrading 2.33 to 2.36 and immediately hit a bug in the 
> consumer, due to alleged time skew, specifically
> {noformat}
> 2022-03-07 10:48:37.886 GMTError message from worker: 
> java.lang.IllegalArgumentException: Cannot output with timestamp 
> 2022-03-07T10:43:38.640Z. Output timestamps must be no earlier than the 
> timestamp of the 
> current input (2022-03-07T10:43:43.562Z) minus the allowed skew (0 
> milliseconds) and no later than 294247-01-10T04:00:54.775Z. See the 
> DoFn#getAllowedTimestampSkew() Javadoc 
> for details on changing the allowed skew. 
> org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runn

[jira] [Updated] (BEAM-13956) Serialize/deserialize StsClient when serializing/deserializing StsAssumeRoleCredentialsProvider

2022-02-24 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-13956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot updated BEAM-13956:

Summary: Serialize/deserialize StsClient when serializing/deserializing 
StsAssumeRoleCredentialsProvider  (was: Serialize/deserialize used StsClient 
when serializing/deserializing StsAssumeRoleCredentialsProvider)

> Serialize/deserialize StsClient when serializing/deserializing 
> StsAssumeRoleCredentialsProvider
> ---
>
> Key: BEAM-13956
> URL: https://issues.apache.org/jira/browse/BEAM-13956
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-aws
>Reporter: Igor Maravić
>Priority: P3
>  Labels: aws, aws-sdk-v2
>
> To use _StsAssumeRoleCredentialsProvider_ from the environment that doesn't 
> have access to AWS defaults credentials one needs to provide configured 
> _StsClient_ to {_}StsAssumeRoleCredentialsProvider{_}. 
> If we don't serialize and consequently deserialize _StsClient_ that was 
> provided to _StsAssumeRoleCredentialsProvider,_ we're not going to be able to 
> use _StsAssumeRoleCredentialsProvider_ from the Beam pipeline.
> The goal of this ticket is to introduce this functionality.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-10838) Add a withSSLContext builder method to ElasticsearchIO

2022-02-02 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-10838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485634#comment-17485634
 ] 

Etienne Chauchot commented on BEAM-10838:
-

[~clandry94] SSL tests were defered to ITests because they needed shield plugin 
installed in an ES cluster. But if with your change shield is no more needed, 
then I'd prefer to test SSL in a UTest

> Add a withSSLContext builder method to ElasticsearchIO 
> ---
>
> Key: BEAM-10838
> URL: https://issues.apache.org/jira/browse/BEAM-10838
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-elasticsearch
>Reporter: Conor Landry
>Assignee: Conor Landry
>Priority: P2
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently, the ElasticsearchIO transforms have only one way to securely 
> read/write Elasticsearch by using the withKeystorePath builder method and 
> providing the location of a keystore containing a client key in jks format.
> This is a bit limiting, especially for Elasticsearch users not depending on 
> shield to secure their clusters. I'd like to propose the addition of the 
> builder method withSSLContext(SSLContext sslContext, which delegates to 
> `httpClientBuilder.setSSLContext`.
> If this is (y), I can start working on it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-13137) make ElasticsearchIO$BoundedElasticsearchSource#getEstimatedSizeBytes deterministic

2022-01-26 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-13137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17482538#comment-17482538
 ] 

Etienne Chauchot commented on BEAM-13137:
-

ticket [BEAM-5172|https://issues.apache.org/jira/browse/BEAM-5172] closed

> make ElasticsearchIO$BoundedElasticsearchSource#getEstimatedSizeBytes 
> deterministic
> ---
>
> Key: BEAM-13137
> URL: https://issues.apache.org/jira/browse/BEAM-13137
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Etienne Chauchot
>Assignee: Evan Galpin
>Priority: P2
> Fix For: 2.37.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Index size estimation is statistical in ES and varies. But it is the base for 
> splitting so it needs to be more deterministic because that causes flakiness 
> in the UTests in _testSplit_ and _testSizes_ and maybe entails sub-optimal 
> splitting in production in some cases.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (BEAM-5172) org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky

2022-01-26 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot updated BEAM-5172:
---
Fix Version/s: 2.37.0
   Resolution: Fixed
   Status: Resolved  (was: Open)

[~egalpin] I agree, closing this ticket.

> org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky
> -
>
> Key: BEAM-5172
> URL: https://issues.apache.org/jira/browse/BEAM-5172
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-elasticsearch, test-failures
>Affects Versions: 2.31.0
>Reporter: Valentyn Tymofieiev
>Priority: P1
>  Labels: flake, sickbay
> Fix For: 2.37.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> In a recent PostCommit builld, 
> https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/1290/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testRead/
>  failed with:
> Error Message
> java.lang.AssertionError: Count/Flatten.PCollections.out: 
> Expected: <400L>
>  but: was <470L>
> Stacktrace
> java.lang.AssertionError: Count/Flatten.PCollections.out: 
> Expected: <400L>
>  but: was <470L>
>   at 
> org.apache.beam.sdk.testing.PAssert$PAssertionSite.capture(PAssert.java:168)
>   at org.apache.beam.sdk.testing.PAssert.thatSingleton(PAssert.java:413)
>   at org.apache.beam.sdk.testing.PAssert.thatSingleton(PAssert.java:404)
>   at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTestCommon.testRead(ElasticsearchIOTestCommon.java:124)
>   at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testRead(ElasticsearchIOTest.java:125)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:239)
>   at 
> org.apache.beam.sdk.testing.TestPipeline$1.evaluate(TestPipeline.java:319)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:106)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38)
>   at 
> org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:66)
>   at 
> org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.j

[jira] [Commented] (BEAM-13137) make ElasticsearchIO$BoundedElasticsearchSource#getEstimatedSizeBytes deterministic

2022-01-26 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-13137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17482316#comment-17482316
 ] 

Etienne Chauchot commented on BEAM-13137:
-

[~egalpin] if you took a look at all ES tests for flakiness and if there was 
none also during the past day, I think we can close the ticket.

> make ElasticsearchIO$BoundedElasticsearchSource#getEstimatedSizeBytes 
> deterministic
> ---
>
> Key: BEAM-13137
> URL: https://issues.apache.org/jira/browse/BEAM-13137
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Etienne Chauchot
>Assignee: Evan Galpin
>Priority: P2
> Fix For: 2.37.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Index size estimation is statistical in ES and varies. But it is the base for 
> splitting so it needs to be more deterministic because that causes flakiness 
> in the UTests in _testSplit_ and _testSizes_ and maybe entails sub-optimal 
> splitting in production in some cases.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-13738) Renable ignored test in SqsUnboundedReaderTest

2022-01-25 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-13738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481744#comment-17481744
 ] 

Etienne Chauchot commented on BEAM-13738:
-

Thanks, I assigned to you to keep track as you'll be pinged on the Elasticmq 
future fix.

> Renable ignored test in SqsUnboundedReaderTest
> --
>
> Key: BEAM-13738
> URL: https://issues.apache.org/jira/browse/BEAM-13738
> Project: Beam
>  Issue Type: Task
>  Components: io-java-aws
>Reporter: Moritz Mack
>Assignee: Moritz Mack
>Priority: P3
>  Labels: SQS, aws-sdk-v2, testing
>
> Elasticmq is used for testing SQS, unfortunately a bug prevents testing batch 
> extension of message visibility if the request contains an invalid receipt 
> handle:
> [https://github.com/softwaremill/elasticmq/issues/632]
> {quote}A ChangeMessageVisibilityBatchRequest that contains an entry with an 
> invalid receipt handle fails entirely with status code 400 
> (InvalidParameterValue; see the SQS docs), rather than returning partial 
> success containing a BatchResultErrorEntry with code ReceiptHandleIsInvalid 
> for that particular entry. DeleteMessageBatchRequest behaves as expected.
> {quote}
> Re-enable *SqsUnboundedReaderTest.testExtendDeletedMessage* once fixed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-13738) Renable ignored test in SqsUnboundedReaderTest

2022-01-25 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-13738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481743#comment-17481743
 ] 

Etienne Chauchot commented on BEAM-13738:
-

good catch Moritz !

> Renable ignored test in SqsUnboundedReaderTest
> --
>
> Key: BEAM-13738
> URL: https://issues.apache.org/jira/browse/BEAM-13738
> Project: Beam
>  Issue Type: Task
>  Components: io-java-aws
>Reporter: Moritz Mack
>Assignee: Moritz Mack
>Priority: P3
>  Labels: SQS, aws-sdk-v2, testing
>
> Elasticmq is used for testing SQS, unfortunately a bug prevents testing batch 
> extension of message visibility if the request contains an invalid receipt 
> handle:
> [https://github.com/softwaremill/elasticmq/issues/632]
> {quote}A ChangeMessageVisibilityBatchRequest that contains an entry with an 
> invalid receipt handle fails entirely with status code 400 
> (InvalidParameterValue; see the SQS docs), rather than returning partial 
> success containing a BatchResultErrorEntry with code ReceiptHandleIsInvalid 
> for that particular entry. DeleteMessageBatchRequest behaves as expected.
> {quote}
> Re-enable *SqsUnboundedReaderTest.testExtendDeletedMessage* once fixed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (BEAM-13738) Renable ignored test in SqsUnboundedReaderTest

2022-01-25 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-13738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot reassigned BEAM-13738:
---

Assignee: Moritz Mack

> Renable ignored test in SqsUnboundedReaderTest
> --
>
> Key: BEAM-13738
> URL: https://issues.apache.org/jira/browse/BEAM-13738
> Project: Beam
>  Issue Type: Task
>  Components: io-java-aws
>Reporter: Moritz Mack
>Assignee: Moritz Mack
>Priority: P3
>  Labels: SQS, aws-sdk-v2, testing
>
> Elasticmq is used for testing SQS, unfortunately a bug prevents testing batch 
> extension of message visibility if the request contains an invalid receipt 
> handle:
> [https://github.com/softwaremill/elasticmq/issues/632]
> {quote}A ChangeMessageVisibilityBatchRequest that contains an entry with an 
> invalid receipt handle fails entirely with status code 400 
> (InvalidParameterValue; see the SQS docs), rather than returning partial 
> success containing a BatchResultErrorEntry with code ReceiptHandleIsInvalid 
> for that particular entry. DeleteMessageBatchRequest behaves as expected.
> {quote}
> Re-enable *SqsUnboundedReaderTest.testExtendDeletedMessage* once fixed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-13136) Clean leftovers of old ElasticSearchIO versions / test mechanism

2022-01-25 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-13136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481673#comment-17481673
 ] 

Etienne Chauchot commented on BEAM-13136:
-

[~egalpin] just added another leftover found yesterday. See updated 
description. Feel free to point me as a reviewer for the ongoing PR.

> Clean leftovers of old ElasticSearchIO versions / test mechanism
> 
>
> Key: BEAM-13136
> URL: https://issues.apache.org/jira/browse/BEAM-13136
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Etienne Chauchot
>Priority: P2
>
> While auditing about flakiness issues, I saw lefovers on the test code and 
> production code of the IO. This list is not comprehensive but spots some 
> places:
>  * In ESTestCommon#testSplit there is a reference to ES2: ES2 cannot split 
> the shards (missing API) so the split are fixed to the number of shards (5 by 
> default in the test). As ES2 support was removed, this special case in the 
> test can be removed
>  * org.elasticsearch.bootstrap.JarHell was a hacky fix to disable jarHell 
> test in ES test framework. This framework is no more used (testContainers) so 
> we could remove the hack in all places
>  * if (backendVersion == 2)  in prod code + remove _shardPreference_ from 
> _BoundedElasticsearchSource_
>  * @ThreadLeakScope(ThreadLeakScope.Scope.NONE) come from es test framework 
> that is no longer used
>  * remove module elasticsearch-tests-2
>  * please audit for other places that could need cleaning.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (BEAM-13137) make ElasticsearchIO$BoundedElasticsearchSource#getEstimatedSizeBytes deterministic

2022-01-25 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-13137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot updated BEAM-13137:

Fix Version/s: 2.37.0
   Resolution: Fixed
   Status: Resolved  (was: Open)

> make ElasticsearchIO$BoundedElasticsearchSource#getEstimatedSizeBytes 
> deterministic
> ---
>
> Key: BEAM-13137
> URL: https://issues.apache.org/jira/browse/BEAM-13137
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Etienne Chauchot
>Assignee: Evan Galpin
>Priority: P2
> Fix For: 2.37.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Index size estimation is statistical in ES and varies. But it is the base for 
> splitting so it needs to be more deterministic because that causes flakiness 
> in the UTests in _testSplit_ and _testSizes_ and maybe entails sub-optimal 
> splitting in production in some cases.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-13137) make ElasticsearchIO$BoundedElasticsearchSource#getEstimatedSizeBytes deterministic

2022-01-25 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-13137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481647#comment-17481647
 ] 

Etienne Chauchot commented on BEAM-13137:
-

[~egalpin] thx for the update. knowing the activity on the Beam project, I 
guess 1 week is enough to monitor flakiness, closing the ticket.

> make ElasticsearchIO$BoundedElasticsearchSource#getEstimatedSizeBytes 
> deterministic
> ---
>
> Key: BEAM-13137
> URL: https://issues.apache.org/jira/browse/BEAM-13137
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Etienne Chauchot
>Assignee: Evan Galpin
>Priority: P2
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Index size estimation is statistical in ES and varies. But it is the base for 
> splitting so it needs to be more deterministic because that causes flakiness 
> in the UTests in _testSplit_ and _testSizes_ and maybe entails sub-optimal 
> splitting in production in some cases.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-10945) ElasticsearchIO performs 0 division on DirectRunner

2022-01-24 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-10945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480935#comment-17480935
 ] 

Etienne Chauchot commented on BEAM-10945:
-

Well, an important parameter that we do not discuss here is 
_desiredBundleSizeBytes_ that the runner sends to the source so that a source 
data is not bigger than the available engine memory for it. See in the spark SS 
runner: 
[https://github.com/apache/beam/blob/737be255d874af17f3cbf40f7c03fc66b3896b13/runners/spark/2/src/main/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/DatasetSourceBatch.java#L100]

if nbBundles ends up being too high at the source side, it means that either 
the index size is too high (unliquely) or the 

_desiredBundleSizeBytes_ in the runner is too low meaning that the runner 
_numPartitions_ (parallelism configured in the runner) is too high . We should 
use a debugger to understand why and then see if everything is legitimate and 
move accordingly.

> ElasticsearchIO performs 0 division on DirectRunner
> ---
>
> Key: BEAM-10945
> URL: https://issues.apache.org/jira/browse/BEAM-10945
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-elasticsearch, runner-direct
>Affects Versions: 2.23.0
> Environment: Beam 2.23
> Java 1.8.0_265
> Ubuntu 16.04
> Elastic version of cluster 7.9.1, cross cluster setup
> Parallelism of direct runner 8
>Reporter: Milan Nikl
>Priority: P3
>
> h1. Environment configuration
> In my company we use [Elasticsearch cross 
> cluster|https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cross-cluster-search.html#ccs-supported-apis]
>  setup for search. Cluster version is 7.9.1.
> I intended to use ElasticsearchIO for reading application logs and 
> subsequently producing some aggregated data.
> h1. Problem description
>  # In cross cluster ES setup, there is no {{//_stats}} API available, 
> so it is not possible to compute 
> [ElasticsearchIO#getEstimatedSizeBytes|https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L692]
>  properly.
>  # {{statsJson}} returned by the cluster looks like this:
> {quote}Unknown macro: \{ "_shards" }
>  ,
>  "_all" :
>  Unknown macro: \{ "primaries" }
>  ,
>  "total" : \{ }
>  },
>  "indices" : \{ }
>  }
> {quote}
>  # That means that {{totalCount}} value cannot be parsed from the json and is 
> thus set to {{0}}.
>  # Which means that {{estimatedByteSize}} value [is set to 
> 1|https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L707]
>  (Which itself is a workaround for similar issue.)
>  # {{ElasticsearchIO#getEstimatedSizeBytes}} is used in 
> [BoundedReadEvaluatorFactory#getInitialInputs|https://github.com/apache/beam/blob/master/runners/direct-java/src/main/java/org/apache/beam/runners/direct/BoundedReadEvaluatorFactory.java#L212]
>  which does not check the value and performs division of two {{long}} values, 
> which of course results in {{0}} for any {{targetParallelism > 1}}.
>  # Then 
> [ElasticsearchIO#split|https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L665]
>  is called with {{indexSize = 1}} and {{desiredBundleSizeBytes = 1}}. Which 
> sets {{nbBundlesFloat}} value to infinity.
>  # Even though the number of bundles is ceiled at {{1024}}, reading from 1024 
> BoundedElasticsearchSources concurrently makes the ElasticsearchIO virtually 
> impossible to use on direct runner.
> h1. Resolution suggestion
> I still haven't tested reading from ElasticsearchIO on proper runner (we use 
> flink 1.10.2), so I cannot either confirm or deny its functionality on our 
> elastic setup. At the moment I'm just suggesting few checks of input values 
> so the zero division and unnecessary parallelism problems are eliminated on 
> direct runner.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (BEAM-13136) Clean leftovers of old ElasticSearchIO versions / test mechanism

2022-01-24 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-13136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot updated BEAM-13136:

Description: 
While auditing about flakiness issues, I saw lefovers on the test code and 
production code of the IO. This list is not comprehensive but spots some places:
 * In ESTestCommon#testSplit there is a reference to ES2: ES2 cannot split the 
shards (missing API) so the split are fixed to the number of shards (5 by 
default in the test). As ES2 support was removed, this special case in the test 
can be removed
 * org.elasticsearch.bootstrap.JarHell was a hacky fix to disable jarHell test 
in ES test framework. This framework is no more used (testContainers) so we 
could remove the hack in all places
 * if (backendVersion == 2)  in prod code + remove _shardPreference_ from 
_BoundedElasticsearchSource_
 * @ThreadLeakScope(ThreadLeakScope.Scope.NONE) come from es test framework 
that is no longer used
 * remove module elasticsearch-tests-2
 * please audit for other places that could need cleaning.

  was:
While auditing about flakiness issues, I saw lefovers on the test code and 
production code of the IO. This list is not comprehensive but spots some places:
 * In ESTestCommon#testSplit there is a reference to ES2: ES2 cannot split the 
shards (missing API) so the split are fixed to the number of shards (5 by 
default in the test). As ES2 support was removed, this special case in the test 
can be removed
 * org.elasticsearch.bootstrap.JarHell was a hacky fix to disable jarHell test 
in ES test framework. This framework is no more used (testContainers) so we 
could remove the hack in all places
 * if (backendVersion == 2)  in prod code
 * @ThreadLeakScope(ThreadLeakScope.Scope.NONE) come from es test framework 
that is no longer used
* remove module elasticsearch-tests-2 
 * please audit for other places that could need cleaning.


> Clean leftovers of old ElasticSearchIO versions / test mechanism
> 
>
> Key: BEAM-13136
> URL: https://issues.apache.org/jira/browse/BEAM-13136
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Etienne Chauchot
>Priority: P2
>
> While auditing about flakiness issues, I saw lefovers on the test code and 
> production code of the IO. This list is not comprehensive but spots some 
> places:
>  * In ESTestCommon#testSplit there is a reference to ES2: ES2 cannot split 
> the shards (missing API) so the split are fixed to the number of shards (5 by 
> default in the test). As ES2 support was removed, this special case in the 
> test can be removed
>  * org.elasticsearch.bootstrap.JarHell was a hacky fix to disable jarHell 
> test in ES test framework. This framework is no more used (testContainers) so 
> we could remove the hack in all places
>  * if (backendVersion == 2)  in prod code + remove _shardPreference_ from 
> _BoundedElasticsearchSource_
>  * @ThreadLeakScope(ThreadLeakScope.Scope.NONE) come from es test framework 
> that is no longer used
>  * remove module elasticsearch-tests-2
>  * please audit for other places that could need cleaning.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-10838) Add a withSSLContext builder method to ElasticsearchIO

2022-01-20 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-10838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479263#comment-17479263
 ] 

Etienne Chauchot commented on BEAM-10838:
-

[~kenn] can you give [~clandry94] the right privileges for assignment to avoid 
parallel work ?

 

> Add a withSSLContext builder method to ElasticsearchIO 
> ---
>
> Key: BEAM-10838
> URL: https://issues.apache.org/jira/browse/BEAM-10838
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-elasticsearch
>Reporter: Conor Landry
>Priority: P2
>
> Currently, the ElasticsearchIO transforms have only one way to securely 
> read/write Elasticsearch by using the withKeystorePath builder method and 
> providing the location of a keystore containing a client key in jks format.
> This is a bit limiting, especially for Elasticsearch users not depending on 
> shield to secure their clusters. I'd like to propose the addition of the 
> builder method withSSLContext(SSLContext sslContext, which delegates to 
> `httpClientBuilder.setSSLContext`.
> If this is (y), I can start working on it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-10838) Add a withSSLContext builder method to ElasticsearchIO

2022-01-20 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-10838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479261#comment-17479261
 ] 

Etienne Chauchot commented on BEAM-10838:
-

[~clandry94] or lazy init (init just before use after deserialization) ?

> Add a withSSLContext builder method to ElasticsearchIO 
> ---
>
> Key: BEAM-10838
> URL: https://issues.apache.org/jira/browse/BEAM-10838
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-elasticsearch
>Reporter: Conor Landry
>Priority: P2
>
> Currently, the ElasticsearchIO transforms have only one way to securely 
> read/write Elasticsearch by using the withKeystorePath builder method and 
> providing the location of a keystore containing a client key in jks format.
> This is a bit limiting, especially for Elasticsearch users not depending on 
> shield to secure their clusters. I'd like to propose the addition of the 
> builder method withSSLContext(SSLContext sslContext, which delegates to 
> `httpClientBuilder.setSSLContext`.
> If this is (y), I can start working on it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (BEAM-10838) Add a withSSLContext builder method to ElasticsearchIO

2022-01-20 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-10838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478389#comment-17478389
 ] 

Etienne Chauchot edited comment on BEAM-10838 at 1/20/22, 10:52 AM:


[~kenn] [~clandry94] +1 on this.
[~clandry94] thanks for offering your help ! I'll assign the ticket to you


was (Author: echauchot):
[~kenn] [~clandry94] +1 on this.
[~clandry94] thanks for offering your help ! I assigned the ticket to you

> Add a withSSLContext builder method to ElasticsearchIO 
> ---
>
> Key: BEAM-10838
> URL: https://issues.apache.org/jira/browse/BEAM-10838
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-elasticsearch
>Reporter: Conor Landry
>Priority: P2
>
> Currently, the ElasticsearchIO transforms have only one way to securely 
> read/write Elasticsearch by using the withKeystorePath builder method and 
> providing the location of a keystore containing a client key in jks format.
> This is a bit limiting, especially for Elasticsearch users not depending on 
> shield to secure their clusters. I'd like to propose the addition of the 
> builder method withSSLContext(SSLContext sslContext, which delegates to 
> `httpClientBuilder.setSSLContext`.
> If this is (y), I can start working on it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-10945) ElasticsearchIO performs 0 division on DirectRunner

2022-01-20 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-10945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479258#comment-17479258
 ] 

Etienne Chauchot commented on BEAM-10945:
-

1024 is the ES slice API limitation.

> ElasticsearchIO performs 0 division on DirectRunner
> ---
>
> Key: BEAM-10945
> URL: https://issues.apache.org/jira/browse/BEAM-10945
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-elasticsearch, runner-direct
>Affects Versions: 2.23.0
> Environment: Beam 2.23
> Java 1.8.0_265
> Ubuntu 16.04
> Elastic version of cluster 7.9.1, cross cluster setup
> Parallelism of direct runner 8
>Reporter: Milan Nikl
>Priority: P3
>
> h1. Environment configuration
> In my company we use [Elasticsearch cross 
> cluster|https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cross-cluster-search.html#ccs-supported-apis]
>  setup for search. Cluster version is 7.9.1.
> I intended to use ElasticsearchIO for reading application logs and 
> subsequently producing some aggregated data.
> h1. Problem description
>  # In cross cluster ES setup, there is no {{//_stats}} API available, 
> so it is not possible to compute 
> [ElasticsearchIO#getEstimatedSizeBytes|https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L692]
>  properly.
>  # {{statsJson}} returned by the cluster looks like this:
> {quote}Unknown macro: \{ "_shards" }
>  ,
>  "_all" :
>  Unknown macro: \{ "primaries" }
>  ,
>  "total" : \{ }
>  },
>  "indices" : \{ }
>  }
> {quote}
>  # That means that {{totalCount}} value cannot be parsed from the json and is 
> thus set to {{0}}.
>  # Which means that {{estimatedByteSize}} value [is set to 
> 1|https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L707]
>  (Which itself is a workaround for similar issue.)
>  # {{ElasticsearchIO#getEstimatedSizeBytes}} is used in 
> [BoundedReadEvaluatorFactory#getInitialInputs|https://github.com/apache/beam/blob/master/runners/direct-java/src/main/java/org/apache/beam/runners/direct/BoundedReadEvaluatorFactory.java#L212]
>  which does not check the value and performs division of two {{long}} values, 
> which of course results in {{0}} for any {{targetParallelism > 1}}.
>  # Then 
> [ElasticsearchIO#split|https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L665]
>  is called with {{indexSize = 1}} and {{desiredBundleSizeBytes = 1}}. Which 
> sets {{nbBundlesFloat}} value to infinity.
>  # Even though the number of bundles is ceiled at {{1024}}, reading from 1024 
> BoundedElasticsearchSources concurrently makes the ElasticsearchIO virtually 
> impossible to use on direct runner.
> h1. Resolution suggestion
> I still haven't tested reading from ElasticsearchIO on proper runner (we use 
> flink 1.10.2), so I cannot either confirm or deny its functionality on our 
> elastic setup. At the moment I'm just suggesting few checks of input values 
> so the zero division and unnecessary parallelism problems are eliminated on 
> direct runner.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-10945) ElasticsearchIO performs 0 division on DirectRunner

2022-01-20 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-10945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479257#comment-17479257
 ] 

Etienne Chauchot commented on BEAM-10945:
-

[~egalpin] allowing the user to specify the number of bundles would be a 
implementation leak and a violation of Beam portability. What happens is that 
it is the native technology (spark, flink, dataflow ...) that chooses the 
parallelism based on the one configured for the cluster. For ex for a single 
cluster with 4 cores, an OP would configure spark (for ex) parallelism to 4. 
When the beam source is translated to a spark source this configured 
parallelism is read through the spark API by Beam and the corresponding number 
of splits (4) is created. Split is the name of bundles for the source. If no 
SDF is used and no reparallelise transform either, this parallelism of 4 will 
be the one used throught the whole pipeline for parallel operations such as 
map. 

> ElasticsearchIO performs 0 division on DirectRunner
> ---
>
> Key: BEAM-10945
> URL: https://issues.apache.org/jira/browse/BEAM-10945
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-elasticsearch, runner-direct
>Affects Versions: 2.23.0
> Environment: Beam 2.23
> Java 1.8.0_265
> Ubuntu 16.04
> Elastic version of cluster 7.9.1, cross cluster setup
> Parallelism of direct runner 8
>Reporter: Milan Nikl
>Priority: P3
>
> h1. Environment configuration
> In my company we use [Elasticsearch cross 
> cluster|https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cross-cluster-search.html#ccs-supported-apis]
>  setup for search. Cluster version is 7.9.1.
> I intended to use ElasticsearchIO for reading application logs and 
> subsequently producing some aggregated data.
> h1. Problem description
>  # In cross cluster ES setup, there is no {{//_stats}} API available, 
> so it is not possible to compute 
> [ElasticsearchIO#getEstimatedSizeBytes|https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L692]
>  properly.
>  # {{statsJson}} returned by the cluster looks like this:
> {quote}Unknown macro: \{ "_shards" }
>  ,
>  "_all" :
>  Unknown macro: \{ "primaries" }
>  ,
>  "total" : \{ }
>  },
>  "indices" : \{ }
>  }
> {quote}
>  # That means that {{totalCount}} value cannot be parsed from the json and is 
> thus set to {{0}}.
>  # Which means that {{estimatedByteSize}} value [is set to 
> 1|https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L707]
>  (Which itself is a workaround for similar issue.)
>  # {{ElasticsearchIO#getEstimatedSizeBytes}} is used in 
> [BoundedReadEvaluatorFactory#getInitialInputs|https://github.com/apache/beam/blob/master/runners/direct-java/src/main/java/org/apache/beam/runners/direct/BoundedReadEvaluatorFactory.java#L212]
>  which does not check the value and performs division of two {{long}} values, 
> which of course results in {{0}} for any {{targetParallelism > 1}}.
>  # Then 
> [ElasticsearchIO#split|https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L665]
>  is called with {{indexSize = 1}} and {{desiredBundleSizeBytes = 1}}. Which 
> sets {{nbBundlesFloat}} value to infinity.
>  # Even though the number of bundles is ceiled at {{1024}}, reading from 1024 
> BoundedElasticsearchSources concurrently makes the ElasticsearchIO virtually 
> impossible to use on direct runner.
> h1. Resolution suggestion
> I still haven't tested reading from ElasticsearchIO on proper runner (we use 
> flink 1.10.2), so I cannot either confirm or deny its functionality on our 
> elastic setup. At the moment I'm just suggesting few checks of input values 
> so the zero division and unnecessary parallelism problems are eliminated on 
> direct runner.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (BEAM-13503) BulkIO public constructor: Missing required property: throwWriteErrors

2022-01-19 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-13503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot updated BEAM-13503:

Fix Version/s: 2.37.0
   Resolution: Fixed
   Status: Resolved  (was: Open)

> BulkIO public constructor: Missing required property: throwWriteErrors
> --
>
> Key: BEAM-13503
> URL: https://issues.apache.org/jira/browse/BEAM-13503
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-elasticsearch
>Affects Versions: 2.35.0
>Reporter: Kyle Hersey
>Assignee: Kyle Hersey
>Priority: P1
>  Labels: elasticsearch
> Fix For: 2.37.0
>
>   Original Estimate: 1h
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Using 2.35.0 latest RC to debug an issue while writing to Elasticsearch, but 
> ran into a build issue:
> {code:java}
> java.lang.IllegalStateException: Missing required properties: 
> throwWriteErrors {code}
> Tracing this error through throwWriteErrors is not Nullable, and is therefor 
> required in the AutoValue_ElasticsearchIO_BuilkIO$Builder
> Should be an easy fix, just need to set a default value for it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-10723) SSL authentication key set to trustMaterial instead of keyMaterial

2022-01-19 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-10723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478427#comment-17478427
 ] 

Etienne Chauchot commented on BEAM-10723:
-

[~kenn] I took a look at the PR, it still lacks tests. I pinged 
[~marek.simunek] in the PR

> SSL authentication key set to trustMaterial instead of keyMaterial
> --
>
> Key: BEAM-10723
> URL: https://issues.apache.org/jira/browse/BEAM-10723
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-elasticsearch
>Affects Versions: 2.19.0
>Reporter: Marek Simunek
>Assignee: Marek Simunek
>Priority: P2
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> If I set 
> ElasticsearchIO.ConnectionConfiguration#withKeystorePath
> the keystore is set to trustMaterial which I think is wrong, because this 
> keystore is suppose to be truststore for certificates.
> So if I use keyStoreKey instead of username and pass:
> {code:java}
>   ElasticsearchIO.write()
>   .withConnectionConfiguration(
>   ElasticsearchIO.ConnectionConfiguration
>   .create(config.addresses().toArray(new String[0]), config.index(), 
> config.type())
>   .withKeystorePath(config.keystorePath())
> .withKeystorePassword("somepassword")
> .withTrustSelfSignedCerts(true));
> {code}
> I cannot authenticate.
> I got
> {code:java}
> Caused by: javax.net.ssl.SSLException: Received fatal alert: bad_certificate
> {code}
> because the authetication key is set to trustMaterial instead of keyMaterial
> {code:java}
> SSLContexts.custom().loadTrustMaterial(keyStore, trustStrategy).build();
> {code}
> via 
> [code|https://github.com/apache/beam/blob/release-2.19.0/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L439]
> I am working on fix



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-10945) ElasticsearchIO performs 0 division on DirectRunner

2022-01-19 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-10945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478416#comment-17478416
 ] 

Etienne Chauchot commented on BEAM-10945:
-

[~kenn] [~DraCzech] indeed this setup was never tested because it did not exist 
at the time the IO was made. That being said we must protect against its 
particularities of not being able to estimate the size of an index (I guess 
because the index is spread across several clusters with their own stats). 
[~kenn] I bet it is still happening  as there were no change in the size 
estimation in production code for ages. [~DraCzech] if you'd like to 
contribute, you can submit whatever is needed to impl these protections. If you 
need to submit to both DR and ESIO it needs to be 2 separate PRs with links to 
each other. I could then review the ES part with [~egalpin] if needed. You can 
test on flink but DR is the reference runner for tests so it will need to be 
changed if it lacks defensive code. [~kenn]can you give [~DraCzech] contributor 
role to assign the ticket ?

> ElasticsearchIO performs 0 division on DirectRunner
> ---
>
> Key: BEAM-10945
> URL: https://issues.apache.org/jira/browse/BEAM-10945
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-elasticsearch, runner-direct
>Affects Versions: 2.23.0
> Environment: Beam 2.23
> Java 1.8.0_265
> Ubuntu 16.04
> Elastic version of cluster 7.9.1, cross cluster setup
> Parallelism of direct runner 8
>Reporter: Milan Nikl
>Priority: P3
>
> h1. Environment configuration
> In my company we use [Elasticsearch cross 
> cluster|https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cross-cluster-search.html#ccs-supported-apis]
>  setup for search. Cluster version is 7.9.1.
> I intended to use ElasticsearchIO for reading application logs and 
> subsequently producing some aggregated data.
> h1. Problem description
>  # In cross cluster ES setup, there is no {{//_stats}} API available, 
> so it is not possible to compute 
> [ElasticsearchIO#getEstimatedSizeBytes|https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L692]
>  properly.
>  # {{statsJson}} returned by the cluster looks like this:
> {quote}Unknown macro: \{ "_shards" }
>  ,
>  "_all" :
>  Unknown macro: \{ "primaries" }
>  ,
>  "total" : \{ }
>  },
>  "indices" : \{ }
>  }
> {quote}
>  # That means that {{totalCount}} value cannot be parsed from the json and is 
> thus set to {{0}}.
>  # Which means that {{estimatedByteSize}} value [is set to 
> 1|https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L707]
>  (Which itself is a workaround for similar issue.)
>  # {{ElasticsearchIO#getEstimatedSizeBytes}} is used in 
> [BoundedReadEvaluatorFactory#getInitialInputs|https://github.com/apache/beam/blob/master/runners/direct-java/src/main/java/org/apache/beam/runners/direct/BoundedReadEvaluatorFactory.java#L212]
>  which does not check the value and performs division of two {{long}} values, 
> which of course results in {{0}} for any {{targetParallelism > 1}}.
>  # Then 
> [ElasticsearchIO#split|https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L665]
>  is called with {{indexSize = 1}} and {{desiredBundleSizeBytes = 1}}. Which 
> sets {{nbBundlesFloat}} value to infinity.
>  # Even though the number of bundles is ceiled at {{1024}}, reading from 1024 
> BoundedElasticsearchSources concurrently makes the ElasticsearchIO virtually 
> impossible to use on direct runner.
> h1. Resolution suggestion
> I still haven't tested reading from ElasticsearchIO on proper runner (we use 
> flink 1.10.2), so I cannot either confirm or deny its functionality on our 
> elastic setup. At the moment I'm just suggesting few checks of input values 
> so the zero division and unnecessary parallelism problems are eliminated on 
> direct runner.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-10838) Add a withSSLContext builder method to ElasticsearchIO

2022-01-19 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-10838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478403#comment-17478403
 ] 

Etienne Chauchot commented on BEAM-10838:
-

[~kenn] can you give Conor the contributor role so that he get assigned to the 
ticket ?

> Add a withSSLContext builder method to ElasticsearchIO 
> ---
>
> Key: BEAM-10838
> URL: https://issues.apache.org/jira/browse/BEAM-10838
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-elasticsearch
>Reporter: Conor Landry
>Priority: P2
>
> Currently, the ElasticsearchIO transforms only one way to securely read/write 
> Elasticsearch by using the withKeystorePath builder method and providing the 
> location of a keystore containing a client key in jks format. 
> This is a bit limiting, especially for Elasticsearch users not depending on 
> shield to secure their clusters. I'd like to propose the addition of the 
> builder method withSSLContext(SSLContext sslContext, which delegates to 
> `httpClientBuilder.setSSLContext`. 
> If this is (y), I can start working on it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-10838) Add a withSSLContext builder method to ElasticsearchIO

2022-01-19 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-10838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478389#comment-17478389
 ] 

Etienne Chauchot commented on BEAM-10838:
-

[~kenn] [~clandry94] +1 on this.
[~clandry94] thanks for offering your help ! I assigned the ticket to you

> Add a withSSLContext builder method to ElasticsearchIO 
> ---
>
> Key: BEAM-10838
> URL: https://issues.apache.org/jira/browse/BEAM-10838
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-elasticsearch
>Reporter: Conor Landry
>Priority: P2
>
> Currently, the ElasticsearchIO transforms only one way to securely read/write 
> Elasticsearch by using the withKeystorePath builder method and providing the 
> location of a keystore containing a client key in jks format. 
> This is a bit limiting, especially for Elasticsearch users not depending on 
> shield to secure their clusters. I'd like to propose the addition of the 
> builder method withSSLContext(SSLContext sslContext, which delegates to 
> `httpClientBuilder.setSSLContext`. 
> If this is (y), I can start working on it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-13137) make ElasticsearchIO$BoundedElasticsearchSource#getEstimatedSizeBytes deterministic

2022-01-17 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-13137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477044#comment-17477044
 ] 

Etienne Chauchot commented on BEAM-13137:
-

Merged through 
[8c2e1fd|https://github.com/apache/beam/commit/8c2e1fd6109871496457c45ddcbfb10747c3cdcf].
 Waiting for 2 weeks to monitor flakiness before closing the ticket. 100 local 
runs showed no flakiness, let's see with the load on jenkins servers.

> make ElasticsearchIO$BoundedElasticsearchSource#getEstimatedSizeBytes 
> deterministic
> ---
>
> Key: BEAM-13137
> URL: https://issues.apache.org/jira/browse/BEAM-13137
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Etienne Chauchot
>Assignee: Evan Galpin
>Priority: P2
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Index size estimation is statistical in ES and varies. But it is the base for 
> splitting so it needs to be more deterministic because that causes flakiness 
> in the UTests in _testSplit_ and _testSizes_ and maybe entails sub-optimal 
> splitting in production in some cases.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-13137) make ElasticsearchIO$BoundedElasticsearchSource#getEstimatedSizeBytes deterministic

2022-01-14 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-13137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17476243#comment-17476243
 ] 

Etienne Chauchot commented on BEAM-13137:
-

yes I saw in the PR. better indeed to make it explicit

> make ElasticsearchIO$BoundedElasticsearchSource#getEstimatedSizeBytes 
> deterministic
> ---
>
> Key: BEAM-13137
> URL: https://issues.apache.org/jira/browse/BEAM-13137
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Etienne Chauchot
>Assignee: Evan Galpin
>Priority: P2
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Index size estimation is statistical in ES and varies. But it is the base for 
> splitting so it needs to be more deterministic because that causes flakiness 
> in the UTests in _testSplit_ and _testSizes_ and maybe entails sub-optimal 
> splitting in production in some cases.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-13136) Clean leftovers of old ElasticSearchIO versions / test mechanism

2022-01-14 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-13136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17476104#comment-17476104
 ] 

Etienne Chauchot commented on BEAM-13136:
-

[~egalpin] Not urgent this one (compared to the flakiness one you sent a PR 
for). And by the way I added a leftover to the description that I found 
yesterday 

> Clean leftovers of old ElasticSearchIO versions / test mechanism
> 
>
> Key: BEAM-13136
> URL: https://issues.apache.org/jira/browse/BEAM-13136
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Etienne Chauchot
>Assignee: Evan Galpin
>Priority: P2
>  Labels: stale-assigned
>
> While auditing about flakiness issues, I saw lefovers on the test code and 
> production code of the IO. This list is not comprehensive but spots some 
> places:
>  * In ESTestCommon#testSplit there is a reference to ES2: ES2 cannot split 
> the shards (missing API) so the split are fixed to the number of shards (5 by 
> default in the test). As ES2 support was removed, this special case in the 
> test can be removed
>  * org.elasticsearch.bootstrap.JarHell was a hacky fix to disable jarHell 
> test in ES test framework. This framework is no more used (testContainers) so 
> we could remove the hack in all places
>  * if (backendVersion == 2)  in prod code
>  * @ThreadLeakScope(ThreadLeakScope.Scope.NONE) come from es test framework 
> that is no longer used
> * remove module elasticsearch-tests-2 
>  * please audit for other places that could need cleaning.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-13137) make ElasticsearchIO$BoundedElasticsearchSource#getEstimatedSizeBytes deterministic

2022-01-14 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-13137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17476093#comment-17476093
 ] 

Etienne Chauchot commented on BEAM-13137:
-

[~egalpin] Thanks for taking a look ! There use to be stats specific parameters 
in tests that were removed when passed to testContainers:
- _settings.put("index.store.stats_refresh_interval", 0)_ in the index settings 
in test. 
-   _request.addParameters(Collections.singletonMap("refresh", 
"wait_for"));_
Maybe it is the cause of more flakiness in stats related features (testSplit 
and testSizes).
They were only for test so did not impact production clusters of users.


> make ElasticsearchIO$BoundedElasticsearchSource#getEstimatedSizeBytes 
> deterministic
> ---
>
> Key: BEAM-13137
> URL: https://issues.apache.org/jira/browse/BEAM-13137
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Etienne Chauchot
>Assignee: Evan Galpin
>Priority: P2
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Index size estimation is statistical in ES and varies. But it is the base for 
> splitting so it needs to be more deterministic because that causes flakiness 
> in the UTests in _testSplit_ and _testSizes_ and maybe entails sub-optimal 
> splitting in production in some cases.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (BEAM-13136) Clean leftovers of old ElasticSearchIO versions / test mechanism

2022-01-13 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-13136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot updated BEAM-13136:

Description: 
While auditing about flakiness issues, I saw lefovers on the test code and 
production code of the IO. This list is not comprehensive but spots some places:
 * In ESTestCommon#testSplit there is a reference to ES2: ES2 cannot split the 
shards (missing API) so the split are fixed to the number of shards (5 by 
default in the test). As ES2 support was removed, this special case in the test 
can be removed
 * org.elasticsearch.bootstrap.JarHell was a hacky fix to disable jarHell test 
in ES test framework. This framework is no more used (testContainers) so we 
could remove the hack in all places
 * if (backendVersion == 2)  in prod code
 * @ThreadLeakScope(ThreadLeakScope.Scope.NONE) come from es test framework 
that is no longer used
* remove module elasticsearch-tests-2 
 * please audit for other places that could need cleaning.

  was:
While auditing about flakiness issues, I saw lefovers on the test code and 
production code of the IO. This list is not comprehensive but spots some places:
 * In ESTestCommon#testSplit there is a reference to ES2: ES2 cannot split the 
shards (missing API) so the split are fixed to the number of shards (5 by 
default in the test). As ES2 support was removed, this special case in the test 
can be removed
 * org.elasticsearch.bootstrap.JarHell was a hacky fix to disable jarHell test 
in ES test framework. This framework is no more used (testContainers) so we 
could remove the hack in all places
 * if (backendVersion == 2)  in prod code
 * @ThreadLeakScope(ThreadLeakScope.Scope.NONE) come from es test framework 
that is no longer used
 * please audit for other places that could need cleaning.


> Clean leftovers of old ElasticSearchIO versions / test mechanism
> 
>
> Key: BEAM-13136
> URL: https://issues.apache.org/jira/browse/BEAM-13136
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Etienne Chauchot
>Assignee: Evan Galpin
>Priority: P2
>  Labels: stale-assigned
>
> While auditing about flakiness issues, I saw lefovers on the test code and 
> production code of the IO. This list is not comprehensive but spots some 
> places:
>  * In ESTestCommon#testSplit there is a reference to ES2: ES2 cannot split 
> the shards (missing API) so the split are fixed to the number of shards (5 by 
> default in the test). As ES2 support was removed, this special case in the 
> test can be removed
>  * org.elasticsearch.bootstrap.JarHell was a hacky fix to disable jarHell 
> test in ES test framework. This framework is no more used (testContainers) so 
> we could remove the hack in all places
>  * if (backendVersion == 2)  in prod code
>  * @ThreadLeakScope(ThreadLeakScope.Scope.NONE) come from es test framework 
> that is no longer used
> * remove module elasticsearch-tests-2 
>  * please audit for other places that could need cleaning.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-3926) Support MetricsPusher in Dataflow Runner

2022-01-12 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-3926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17474507#comment-17474507
 ] 

Etienne Chauchot commented on BEAM-3926:


[~kenn] still no metricsPusher support in DataFlow runner

> Support MetricsPusher in Dataflow Runner
> 
>
> Key: BEAM-3926
> URL: https://issues.apache.org/jira/browse/BEAM-3926
> Project: Beam
>  Issue Type: Sub-task
>  Components: runner-dataflow
>Reporter: Scott Wegner
>Priority: P3
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> See [relevant email 
> thread|https://lists.apache.org/thread.html/2e87f0adcdf8d42317765f298e3e6fdba72917a72d4a12e71e67e4b5@%3Cdev.beam.apache.org%3E].
>  From [~echauchot]:
>   
> _AFAIK Dataflow being a cloud hosted engine, the related runner is very 
> different from the others. It just submits a job to the cloud hosted engine. 
> So, no access to metrics container etc... from the runner. So I think that 
> the MetricsPusher (component responsible for merging metrics and pushing them 
> to a sink backend) must not be instanciated in DataflowRunner otherwise it 
> would be more a client (driver) piece of code and we will lose all the 
> interest of being close to the execution engine (among other things 
> instrumentation of the execution of the pipelines).  I think that the 
> MetricsPusher needs to be instanciated in the actual Dataflow engine._
>  
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-4088) ExecutorServiceParallelExecutorTest#ensureMetricsThreadDoesntLeak in PR #4965 does not pass in gradle

2022-01-12 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-4088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17474504#comment-17474504
 ] 

Etienne Chauchot commented on BEAM-4088:


[~kenn] still disabled not fixed

> ExecutorServiceParallelExecutorTest#ensureMetricsThreadDoesntLeak in PR #4965 
> does not pass in gradle
> -
>
> Key: BEAM-4088
> URL: https://issues.apache.org/jira/browse/BEAM-4088
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Etienne Chauchot
>Priority: P3
>  Labels: gradle
> Fix For: Not applicable
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> This is a new test being added to ensure threads don't leak. The failure 
> seems to indicate that threads do leak.
> This test fails using gradle but previously passed using maven
> PR: https://github.com/apache/beam/pull/4965
> Presubmit Failure: 
>  * https://builds.apache.org/job/beam_PreCommit_Java_GradleBuild/4059/
>  * 
> https://scans.gradle.com/s/grha56432j3t2/tests/jqhvlvf72f7pg-ipde5etqqejoa?openStackTraces=WzBd



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-5172) org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky

2022-01-12 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17474503#comment-17474503
 ] 

Etienne Chauchot commented on BEAM-5172:


[~kenn] ongoing related tests

> org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky
> -
>
> Key: BEAM-5172
> URL: https://issues.apache.org/jira/browse/BEAM-5172
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-elasticsearch, test-failures
>Affects Versions: 2.31.0
>Reporter: Valentyn Tymofieiev
>Priority: P1
>  Labels: flake, sickbay
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> In a recent PostCommit builld, 
> https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/1290/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testRead/
>  failed with:
> Error Message
> java.lang.AssertionError: Count/Flatten.PCollections.out: 
> Expected: <400L>
>  but: was <470L>
> Stacktrace
> java.lang.AssertionError: Count/Flatten.PCollections.out: 
> Expected: <400L>
>  but: was <470L>
>   at 
> org.apache.beam.sdk.testing.PAssert$PAssertionSite.capture(PAssert.java:168)
>   at org.apache.beam.sdk.testing.PAssert.thatSingleton(PAssert.java:413)
>   at org.apache.beam.sdk.testing.PAssert.thatSingleton(PAssert.java:404)
>   at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTestCommon.testRead(ElasticsearchIOTestCommon.java:124)
>   at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testRead(ElasticsearchIOTest.java:125)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:239)
>   at 
> org.apache.beam.sdk.testing.TestPipeline$1.evaluate(TestPipeline.java:319)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:106)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38)
>   at 
> org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:66)
>   at 
> org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
>   at 
> org.gradle.internal.dispatch.ReflectionDispatch.dispatch

[jira] [Commented] (BEAM-3310) Push metrics to a backend in an runner agnostic way

2022-01-12 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-3310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17474501#comment-17474501
 ] 

Etienne Chauchot commented on BEAM-3310:


[~kenn] lacks the support for the other runners (see subtasks)

> Push metrics to a backend in an runner agnostic way
> ---
>
> Key: BEAM-3310
> URL: https://issues.apache.org/jira/browse/BEAM-3310
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-extensions-metrics, sdk-java-core
>Reporter: Etienne Chauchot
>Priority: P3
>  Labels: Clarified
>  Time Spent: 18h 50m
>  Remaining Estimate: 0h
>
> The idea is to avoid relying on the runners to provide access to the metrics 
> (either at the end of the pipeline or while it runs) because they don't have 
> all the same capabilities towards metrics (e.g. spark runner configures sinks 
>  like csv, graphite or in memory sinks using the spark engine conf). The 
> target is to push the metrics in the common runner code so that no matter the 
> chosen runner, a user can get his metrics out of beam.
> Here is the link to the discussion thread on the dev ML: 
> https://lists.apache.org/thread.html/01a80d62f2df6b84bfa41f05e15fda900178f882877c294fed8be91e@%3Cdev.beam.apache.org%3E
> And the design doc:
> https://s.apache.org/runner_independent_metrics_extraction



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-3816) [nexmark] Something is slightly off with Query 6

2022-01-12 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17474499#comment-17474499
 ] 

Etienne Chauchot commented on BEAM-3816:


AFAICT the PR just put the put the test to sickbay so it is not resolving the 
issue, we should not close the ticket IMHO

> [nexmark] Something is slightly off with Query 6
> 
>
> Key: BEAM-3816
> URL: https://issues.apache.org/jira/browse/BEAM-3816
> Project: Beam
>  Issue Type: Bug
>  Components: testing-nexmark
>Reporter: Andrew Pilloud
>Priority: P3
>  Labels: easyfix, newbie, nexmark, test
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> java.lang.AssertionError: Query6/Query6.Stamp/ParMultiDo(Anonymous).output: 
> wrong pipeline output Expected: <[\{"seller":1048,"price":83609648}, 
> \{"seller":1052,"price":61788353}, \{"seller":1086,"price":33744823}, 
> \{"seller":1078,"price":19876735}, \{"seller":1058,"price":50692833}, 
> \{"seller":1044,"price":6719489}, \{"seller":1096,"price":31287415}, 
> \{"seller":1095,"price":37004879}, \{"seller":1082,"price":22528654}, 
> \{"seller":1006,"price":57288736}, \{"seller":1051,"price":3967261}, 
> \{"seller":1084,"price":6394160}, \{"seller":1020,"price":3871757}, 
> \{"seller":1007,"price":185293}, \{"seller":1031,"price":11840889}, 
> \{"seller":1080,"price":26896442}, \{"seller":1030,"price":294928}, 
> \{"seller":1066,"price":26839191}, \{"seller":1000,"price":28257749}, 
> \{"seller":1055,"price":17087173}, \{"seller":1072,"price":45662210}, 
> \{"seller":1057,"price":4568399}, \{"seller":1025,"price":29008970}, 
> \{"seller":1064,"price":85810641}, \{"seller":1040,"price":99819658}, 
> \{"seller":1014,"price":11256690}, \{"seller":1098,"price":97259323}, 
> \{"seller":1011,"price":20447800}, \{"seller":1092,"price":77520938}, 
> \{"seller":1010,"price":53323687}, \{"seller":1060,"price":70032044}, 
> \{"seller":1062,"price":29076960}, \{"seller":1075,"price":19451464}, 
> \{"seller":1087,"price":27669185}, \{"seller":1009,"price":22951354}, 
> \{"seller":1065,"price":71875611}, \{"seller":1063,"price":87596779}, 
> \{"seller":1021,"price":62918185}, \{"seller":1034,"price":18472448}, 
> \{"seller":1028,"price":68556008}, \{"seller":1070,"price":92550447}]> but: 
> was <[\{"seller":1048,"price":83609648}, \{"seller":1052,"price":61788353}, 
> \{"seller":1086,"price":33744823}, \{"seller":1078,"price":19876735}, 
> \{"seller":1058,"price":50692833}, \{"seller":1044,"price":6719489}, 
> \{"seller":1096,"price":31287415}, \{"seller":1095,"price":37004879}, 
> \{"seller":1082,"price":22528654}, \{"seller":1006,"price":57288736}, 
> \{"seller":1051,"price":3967261}, \{"seller":1084,"price":6394160}, 
> \{"seller":1000,"price":34395558}, \{"seller":1020,"price":3871757}, 
> \{"seller":1007,"price":185293}, \{"seller":1031,"price":11840889}, 
> \{"seller":1080,"price":26896442}, \{"seller":1030,"price":294928}, 
> \{"seller":1066,"price":26839191}, \{"seller":1055,"price":17087173}, 
> \{"seller":1072,"price":45662210}, \{"seller":1057,"price":4568399}, 
> \{"seller":1025,"price":29008970}, \{"seller":1064,"price":85810641}, 
> \{"seller":1040,"price":99819658}, \{"seller":1014,"price":11256690}, 
> \{"seller":1098,"price":97259323}, \{"seller":1011,"price":20447800}, 
> \{"seller":1092,"price":77520938}, \{"seller":1010,"price":53323687}, 
> \{"seller":1060,"price":70032044}, \{"seller":1062,"price":29076960}, 
> \{"seller":1075,"price":19451464}, \{"seller":1087,"price":27669185}, 
> \{"seller":1009,"price":22951354}, \{"seller":1065,"price":71875611}, 
> \{"seller":1063,"price":87596779}, \{"seller":1021,"price":62918185}, 
> \{"seller":1034,"price":18472448}, \{"seller":1028,"price":68556008}, 
> \{"seller":1070,"price":92550447}]>



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (BEAM-3816) [nexmark] Something is slightly off with Query 6

2022-01-12 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17474499#comment-17474499
 ] 

Etienne Chauchot edited comment on BEAM-3816 at 1/12/22, 12:55 PM:
---

AFAICT the PR just put the test to sickbay so it is not resolving the issue, we 
should not close the ticket IMHO


was (Author: echauchot):
AFAICT the PR just put the put the test to sickbay so it is not resolving the 
issue, we should not close the ticket IMHO

> [nexmark] Something is slightly off with Query 6
> 
>
> Key: BEAM-3816
> URL: https://issues.apache.org/jira/browse/BEAM-3816
> Project: Beam
>  Issue Type: Bug
>  Components: testing-nexmark
>Reporter: Andrew Pilloud
>Priority: P3
>  Labels: easyfix, newbie, nexmark, test
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> java.lang.AssertionError: Query6/Query6.Stamp/ParMultiDo(Anonymous).output: 
> wrong pipeline output Expected: <[\{"seller":1048,"price":83609648}, 
> \{"seller":1052,"price":61788353}, \{"seller":1086,"price":33744823}, 
> \{"seller":1078,"price":19876735}, \{"seller":1058,"price":50692833}, 
> \{"seller":1044,"price":6719489}, \{"seller":1096,"price":31287415}, 
> \{"seller":1095,"price":37004879}, \{"seller":1082,"price":22528654}, 
> \{"seller":1006,"price":57288736}, \{"seller":1051,"price":3967261}, 
> \{"seller":1084,"price":6394160}, \{"seller":1020,"price":3871757}, 
> \{"seller":1007,"price":185293}, \{"seller":1031,"price":11840889}, 
> \{"seller":1080,"price":26896442}, \{"seller":1030,"price":294928}, 
> \{"seller":1066,"price":26839191}, \{"seller":1000,"price":28257749}, 
> \{"seller":1055,"price":17087173}, \{"seller":1072,"price":45662210}, 
> \{"seller":1057,"price":4568399}, \{"seller":1025,"price":29008970}, 
> \{"seller":1064,"price":85810641}, \{"seller":1040,"price":99819658}, 
> \{"seller":1014,"price":11256690}, \{"seller":1098,"price":97259323}, 
> \{"seller":1011,"price":20447800}, \{"seller":1092,"price":77520938}, 
> \{"seller":1010,"price":53323687}, \{"seller":1060,"price":70032044}, 
> \{"seller":1062,"price":29076960}, \{"seller":1075,"price":19451464}, 
> \{"seller":1087,"price":27669185}, \{"seller":1009,"price":22951354}, 
> \{"seller":1065,"price":71875611}, \{"seller":1063,"price":87596779}, 
> \{"seller":1021,"price":62918185}, \{"seller":1034,"price":18472448}, 
> \{"seller":1028,"price":68556008}, \{"seller":1070,"price":92550447}]> but: 
> was <[\{"seller":1048,"price":83609648}, \{"seller":1052,"price":61788353}, 
> \{"seller":1086,"price":33744823}, \{"seller":1078,"price":19876735}, 
> \{"seller":1058,"price":50692833}, \{"seller":1044,"price":6719489}, 
> \{"seller":1096,"price":31287415}, \{"seller":1095,"price":37004879}, 
> \{"seller":1082,"price":22528654}, \{"seller":1006,"price":57288736}, 
> \{"seller":1051,"price":3967261}, \{"seller":1084,"price":6394160}, 
> \{"seller":1000,"price":34395558}, \{"seller":1020,"price":3871757}, 
> \{"seller":1007,"price":185293}, \{"seller":1031,"price":11840889}, 
> \{"seller":1080,"price":26896442}, \{"seller":1030,"price":294928}, 
> \{"seller":1066,"price":26839191}, \{"seller":1055,"price":17087173}, 
> \{"seller":1072,"price":45662210}, \{"seller":1057,"price":4568399}, 
> \{"seller":1025,"price":29008970}, \{"seller":1064,"price":85810641}, 
> \{"seller":1040,"price":99819658}, \{"seller":1014,"price":11256690}, 
> \{"seller":1098,"price":97259323}, \{"seller":1011,"price":20447800}, 
> \{"seller":1092,"price":77520938}, \{"seller":1010,"price":53323687}, 
> \{"seller":1060,"price":70032044}, \{"seller":1062,"price":29076960}, 
> \{"seller":1075,"price":19451464}, \{"seller":1087,"price":27669185}, 
> \{"seller":1009,"price":22951354}, \{"seller":1065,"price":71875611}, 
> \{"seller":1063,"price":87596779}, \{"seller":1021,"price":62918185}, 
> \{"seller":1034,"price":18472448}, \{"seller":1028,"price":68556008}, 
> \{"seller":1070,"price":92550447}]>



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-13136) Clean leftovers of old ElasticSearchIO versions / test mechanism

2022-01-12 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-13136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17474494#comment-17474494
 ] 

Etienne Chauchot commented on BEAM-13136:
-

[~egalpin] would you be able to work on this ?

> Clean leftovers of old ElasticSearchIO versions / test mechanism
> 
>
> Key: BEAM-13136
> URL: https://issues.apache.org/jira/browse/BEAM-13136
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Etienne Chauchot
>Assignee: Evan Galpin
>Priority: P2
>  Labels: stale-assigned
>
> While auditing about flakiness issues, I saw lefovers on the test code and 
> production code of the IO. This list is not comprehensive but spots some 
> places:
>  * In ESTestCommon#testSplit there is a reference to ES2: ES2 cannot split 
> the shards (missing API) so the split are fixed to the number of shards (5 by 
> default in the test). As ES2 support was removed, this special case in the 
> test can be removed
>  * org.elasticsearch.bootstrap.JarHell was a hacky fix to disable jarHell 
> test in ES test framework. This framework is no more used (testContainers) so 
> we could remove the hack in all places
>  * if (backendVersion == 2)  in prod code
>  * @ThreadLeakScope(ThreadLeakScope.Scope.NONE) come from es test framework 
> that is no longer used
>  * please audit for other places that could need cleaning.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (BEAM-13137) make ElasticsearchIO$BoundedElasticsearchSource#getEstimatedSizeBytes deterministic

2022-01-12 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-13137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot reassigned BEAM-13137:
---

Assignee: Evan Galpin

> make ElasticsearchIO$BoundedElasticsearchSource#getEstimatedSizeBytes 
> deterministic
> ---
>
> Key: BEAM-13137
> URL: https://issues.apache.org/jira/browse/BEAM-13137
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Etienne Chauchot
>Assignee: Evan Galpin
>Priority: P2
>
> Index size estimation is statistical in ES and varies. But it is the base for 
> splitting so it needs to be more deterministic because that causes flakiness 
> in the UTests in _testSplit_ and _testSizes_ and maybe entails sub-optimal 
> splitting in production in some cases.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-13137) make ElasticsearchIO$BoundedElasticsearchSource#getEstimatedSizeBytes deterministic

2022-01-12 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-13137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17474493#comment-17474493
 ] 

Etienne Chauchot commented on BEAM-13137:
-

[~egalpin] would you be able to work on this ?

> make ElasticsearchIO$BoundedElasticsearchSource#getEstimatedSizeBytes 
> deterministic
> ---
>
> Key: BEAM-13137
> URL: https://issues.apache.org/jira/browse/BEAM-13137
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Etienne Chauchot
>Priority: P2
>
> Index size estimation is statistical in ES and varies. But it is the base for 
> splitting so it needs to be more deterministic because that causes flakiness 
> in the UTests in _testSplit_ and _testSizes_ and maybe entails sub-optimal 
> splitting in production in some cases.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (BEAM-10430) Can't run WordCount on EMR With Flink Runner via YARN

2021-12-06 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-10430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453863#comment-17453863
 ] 

Etienne Chauchot edited comment on BEAM-10430 at 12/6/21, 9:03 AM:
---

Also the important is to be in total sync of Flink versions: Flink version 
packaged with Beam must me exactly (even minor version) the same as Flink 
version packaged in the EMR environment


was (Author: echauchot):
Also the important is to be in total sync of Flink versions: Flink packaged 
with Beam must me exactly (even minor version) the same as Flink packaged in 
the EMR environment

> Can't run WordCount on EMR With Flink Runner via YARN
> -
>
> Key: BEAM-10430
> URL: https://issues.apache.org/jira/browse/BEAM-10430
> Project: Beam
>  Issue Type: Bug
>  Components: examples-java, runner-flink
>Affects Versions: 2.22.0
> Environment: AWS EMR 5.30.0 running Spark 2.4.5, Flink 1.10.0
>Reporter: Shashi
>Assignee: Etienne Chauchot
>Priority: P3
>  Labels: Clarified
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> 1) I setup WordCount project as detailed on Beam website..
>  {{mvn archetype:generate \
>   -DarchetypeGroupId=org.apache.beam \
>   -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
>   -DarchetypeVersion=2.22.0 \
>   -DgroupId=org.example \
>   -DartifactId=word-count-beam \
>   -Dversion="0.1" \
>   -Dpackage=org.apache.beam.examples \
>   -DinteractiveMode=false}}
> 2) mvn clean package -Pflink-runner
> 3) Ran the application on AWS EMR 5.30.0 with Flink 1.10.0
> flink run -m yarn-cluster -yid  -p 4  -c 
> org.apache.beam.examples.WordCount word-count-beam-bundled-0.1.jar 
> –runner=FlinkRunner --inputFile  --output 
> 
> 4) Launch failed with the following exception stack trace 
> java.util.ServiceConfigurationError: com.fasterxml.jackson.databind.Module: 
> Provider com.fasterxml.jackson.module.jaxb.JaxbAnnotationModule not a subtype
>  at java.util.ServiceLoader.fail(ServiceLoader.java:239)
>  at java.util.ServiceLoader.access$300(ServiceLoader.java:185)
>  at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:376)
>  at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
>  at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
>  at 
> com.fasterxml.jackson.databind.ObjectMapper.findModules(ObjectMapper.java:1054)
>  at 
> org.apache.beam.sdk.options.PipelineOptionsFactory.(PipelineOptionsFactory.java:471)
>  at org.apache.beam.examples.WordCount.main(WordCount.java:190)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:321)
>  at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205)
>  at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:138)
>  at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:664)
>  at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213)
>  at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:895)
>  at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>  at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>  at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-10430) Can't run WordCount on EMR With Flink Runner via YARN

2021-12-06 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-10430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453863#comment-17453863
 ] 

Etienne Chauchot commented on BEAM-10430:
-

Also the important is to be in total sync of Flink versions Flink packaged with 
Beam must me exactly (even minor version) the same as Flink packaged in the EMR 
environment

> Can't run WordCount on EMR With Flink Runner via YARN
> -
>
> Key: BEAM-10430
> URL: https://issues.apache.org/jira/browse/BEAM-10430
> Project: Beam
>  Issue Type: Bug
>  Components: examples-java, runner-flink
>Affects Versions: 2.22.0
> Environment: AWS EMR 5.30.0 running Spark 2.4.5, Flink 1.10.0
>Reporter: Shashi
>Assignee: Etienne Chauchot
>Priority: P3
>  Labels: Clarified
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> 1) I setup WordCount project as detailed on Beam website..
>  {{mvn archetype:generate \
>   -DarchetypeGroupId=org.apache.beam \
>   -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
>   -DarchetypeVersion=2.22.0 \
>   -DgroupId=org.example \
>   -DartifactId=word-count-beam \
>   -Dversion="0.1" \
>   -Dpackage=org.apache.beam.examples \
>   -DinteractiveMode=false}}
> 2) mvn clean package -Pflink-runner
> 3) Ran the application on AWS EMR 5.30.0 with Flink 1.10.0
> flink run -m yarn-cluster -yid  -p 4  -c 
> org.apache.beam.examples.WordCount word-count-beam-bundled-0.1.jar 
> –runner=FlinkRunner --inputFile  --output 
> 
> 4) Launch failed with the following exception stack trace 
> java.util.ServiceConfigurationError: com.fasterxml.jackson.databind.Module: 
> Provider com.fasterxml.jackson.module.jaxb.JaxbAnnotationModule not a subtype
>  at java.util.ServiceLoader.fail(ServiceLoader.java:239)
>  at java.util.ServiceLoader.access$300(ServiceLoader.java:185)
>  at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:376)
>  at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
>  at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
>  at 
> com.fasterxml.jackson.databind.ObjectMapper.findModules(ObjectMapper.java:1054)
>  at 
> org.apache.beam.sdk.options.PipelineOptionsFactory.(PipelineOptionsFactory.java:471)
>  at org.apache.beam.examples.WordCount.main(WordCount.java:190)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:321)
>  at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205)
>  at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:138)
>  at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:664)
>  at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213)
>  at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:895)
>  at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>  at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>  at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (BEAM-10430) Can't run WordCount on EMR With Flink Runner via YARN

2021-12-06 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-10430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453863#comment-17453863
 ] 

Etienne Chauchot edited comment on BEAM-10430 at 12/6/21, 9:03 AM:
---

Also the important is to be in total sync of Flink versions: Flink packaged 
with Beam must me exactly (even minor version) the same as Flink packaged in 
the EMR environment


was (Author: echauchot):
Also the important is to be in total sync of Flink versions Flink packaged with 
Beam must me exactly (even minor version) the same as Flink packaged in the EMR 
environment

> Can't run WordCount on EMR With Flink Runner via YARN
> -
>
> Key: BEAM-10430
> URL: https://issues.apache.org/jira/browse/BEAM-10430
> Project: Beam
>  Issue Type: Bug
>  Components: examples-java, runner-flink
>Affects Versions: 2.22.0
> Environment: AWS EMR 5.30.0 running Spark 2.4.5, Flink 1.10.0
>Reporter: Shashi
>Assignee: Etienne Chauchot
>Priority: P3
>  Labels: Clarified
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> 1) I setup WordCount project as detailed on Beam website..
>  {{mvn archetype:generate \
>   -DarchetypeGroupId=org.apache.beam \
>   -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
>   -DarchetypeVersion=2.22.0 \
>   -DgroupId=org.example \
>   -DartifactId=word-count-beam \
>   -Dversion="0.1" \
>   -Dpackage=org.apache.beam.examples \
>   -DinteractiveMode=false}}
> 2) mvn clean package -Pflink-runner
> 3) Ran the application on AWS EMR 5.30.0 with Flink 1.10.0
> flink run -m yarn-cluster -yid  -p 4  -c 
> org.apache.beam.examples.WordCount word-count-beam-bundled-0.1.jar 
> –runner=FlinkRunner --inputFile  --output 
> 
> 4) Launch failed with the following exception stack trace 
> java.util.ServiceConfigurationError: com.fasterxml.jackson.databind.Module: 
> Provider com.fasterxml.jackson.module.jaxb.JaxbAnnotationModule not a subtype
>  at java.util.ServiceLoader.fail(ServiceLoader.java:239)
>  at java.util.ServiceLoader.access$300(ServiceLoader.java:185)
>  at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:376)
>  at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
>  at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
>  at 
> com.fasterxml.jackson.databind.ObjectMapper.findModules(ObjectMapper.java:1054)
>  at 
> org.apache.beam.sdk.options.PipelineOptionsFactory.(PipelineOptionsFactory.java:471)
>  at org.apache.beam.examples.WordCount.main(WordCount.java:190)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:321)
>  at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205)
>  at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:138)
>  at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:664)
>  at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213)
>  at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:895)
>  at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>  at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>  at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (BEAM-10430) Can't run WordCount on EMR With Flink Runner via YARN

2021-12-03 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-10430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot updated BEAM-10430:

  Assignee: Etienne Chauchot
Resolution: Not A Problem
Status: Resolved  (was: Open)

> Can't run WordCount on EMR With Flink Runner via YARN
> -
>
> Key: BEAM-10430
> URL: https://issues.apache.org/jira/browse/BEAM-10430
> Project: Beam
>  Issue Type: Bug
>  Components: examples-java, runner-flink
>Affects Versions: 2.22.0
> Environment: AWS EMR 5.30.0 running Spark 2.4.5, Flink 1.10.0
>Reporter: Shashi
>Assignee: Etienne Chauchot
>Priority: P3
>  Labels: Clarified
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> 1) I setup WordCount project as detailed on Beam website..
>  {{mvn archetype:generate \
>   -DarchetypeGroupId=org.apache.beam \
>   -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
>   -DarchetypeVersion=2.22.0 \
>   -DgroupId=org.example \
>   -DartifactId=word-count-beam \
>   -Dversion="0.1" \
>   -Dpackage=org.apache.beam.examples \
>   -DinteractiveMode=false}}
> 2) mvn clean package -Pflink-runner
> 3) Ran the application on AWS EMR 5.30.0 with Flink 1.10.0
> flink run -m yarn-cluster -yid  -p 4  -c 
> org.apache.beam.examples.WordCount word-count-beam-bundled-0.1.jar 
> –runner=FlinkRunner --inputFile  --output 
> 
> 4) Launch failed with the following exception stack trace 
> java.util.ServiceConfigurationError: com.fasterxml.jackson.databind.Module: 
> Provider com.fasterxml.jackson.module.jaxb.JaxbAnnotationModule not a subtype
>  at java.util.ServiceLoader.fail(ServiceLoader.java:239)
>  at java.util.ServiceLoader.access$300(ServiceLoader.java:185)
>  at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:376)
>  at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
>  at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
>  at 
> com.fasterxml.jackson.databind.ObjectMapper.findModules(ObjectMapper.java:1054)
>  at 
> org.apache.beam.sdk.options.PipelineOptionsFactory.(PipelineOptionsFactory.java:471)
>  at org.apache.beam.examples.WordCount.main(WordCount.java:190)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:321)
>  at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205)
>  at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:138)
>  at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:664)
>  at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213)
>  at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:895)
>  at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>  at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>  at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-10430) Can't run WordCount on EMR With Flink Runner via YARN

2021-12-03 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-10430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453005#comment-17453005
 ] 

Etienne Chauchot commented on BEAM-10430:
-

you might want to add also: 


com.fasterxml.jackson.datatype
jackson-datatype-jsr310
${jackson.version}


 

> Can't run WordCount on EMR With Flink Runner via YARN
> -
>
> Key: BEAM-10430
> URL: https://issues.apache.org/jira/browse/BEAM-10430
> Project: Beam
>  Issue Type: Bug
>  Components: examples-java, runner-flink
>Affects Versions: 2.22.0
> Environment: AWS EMR 5.30.0 running Spark 2.4.5, Flink 1.10.0
>Reporter: Shashi
>Priority: P3
>  Labels: Clarified
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> 1) I setup WordCount project as detailed on Beam website..
>  {{mvn archetype:generate \
>   -DarchetypeGroupId=org.apache.beam \
>   -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
>   -DarchetypeVersion=2.22.0 \
>   -DgroupId=org.example \
>   -DartifactId=word-count-beam \
>   -Dversion="0.1" \
>   -Dpackage=org.apache.beam.examples \
>   -DinteractiveMode=false}}
> 2) mvn clean package -Pflink-runner
> 3) Ran the application on AWS EMR 5.30.0 with Flink 1.10.0
> flink run -m yarn-cluster -yid  -p 4  -c 
> org.apache.beam.examples.WordCount word-count-beam-bundled-0.1.jar 
> –runner=FlinkRunner --inputFile  --output 
> 
> 4) Launch failed with the following exception stack trace 
> java.util.ServiceConfigurationError: com.fasterxml.jackson.databind.Module: 
> Provider com.fasterxml.jackson.module.jaxb.JaxbAnnotationModule not a subtype
>  at java.util.ServiceLoader.fail(ServiceLoader.java:239)
>  at java.util.ServiceLoader.access$300(ServiceLoader.java:185)
>  at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:376)
>  at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
>  at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
>  at 
> com.fasterxml.jackson.databind.ObjectMapper.findModules(ObjectMapper.java:1054)
>  at 
> org.apache.beam.sdk.options.PipelineOptionsFactory.(PipelineOptionsFactory.java:471)
>  at org.apache.beam.examples.WordCount.main(WordCount.java:190)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:321)
>  at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205)
>  at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:138)
>  at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:664)
>  at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213)
>  at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:895)
>  at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>  at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>  at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (BEAM-10430) Can't run WordCount on EMR With Flink Runner via YARN

2021-12-03 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-10430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453005#comment-17453005
 ] 

Etienne Chauchot edited comment on BEAM-10430 at 12/3/21, 1:27 PM:
---

you might want to add also: 
{code:java}

com.fasterxml.jackson.datatype
jackson-datatype-jsr310
${jackson.version}
{code}
 


was (Author: echauchot):
you might want to add also: 


com.fasterxml.jackson.datatype
jackson-datatype-jsr310
${jackson.version}


 

> Can't run WordCount on EMR With Flink Runner via YARN
> -
>
> Key: BEAM-10430
> URL: https://issues.apache.org/jira/browse/BEAM-10430
> Project: Beam
>  Issue Type: Bug
>  Components: examples-java, runner-flink
>Affects Versions: 2.22.0
> Environment: AWS EMR 5.30.0 running Spark 2.4.5, Flink 1.10.0
>Reporter: Shashi
>Priority: P3
>  Labels: Clarified
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> 1) I setup WordCount project as detailed on Beam website..
>  {{mvn archetype:generate \
>   -DarchetypeGroupId=org.apache.beam \
>   -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
>   -DarchetypeVersion=2.22.0 \
>   -DgroupId=org.example \
>   -DartifactId=word-count-beam \
>   -Dversion="0.1" \
>   -Dpackage=org.apache.beam.examples \
>   -DinteractiveMode=false}}
> 2) mvn clean package -Pflink-runner
> 3) Ran the application on AWS EMR 5.30.0 with Flink 1.10.0
> flink run -m yarn-cluster -yid  -p 4  -c 
> org.apache.beam.examples.WordCount word-count-beam-bundled-0.1.jar 
> –runner=FlinkRunner --inputFile  --output 
> 
> 4) Launch failed with the following exception stack trace 
> java.util.ServiceConfigurationError: com.fasterxml.jackson.databind.Module: 
> Provider com.fasterxml.jackson.module.jaxb.JaxbAnnotationModule not a subtype
>  at java.util.ServiceLoader.fail(ServiceLoader.java:239)
>  at java.util.ServiceLoader.access$300(ServiceLoader.java:185)
>  at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:376)
>  at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
>  at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
>  at 
> com.fasterxml.jackson.databind.ObjectMapper.findModules(ObjectMapper.java:1054)
>  at 
> org.apache.beam.sdk.options.PipelineOptionsFactory.(PipelineOptionsFactory.java:471)
>  at org.apache.beam.examples.WordCount.main(WordCount.java:190)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:321)
>  at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205)
>  at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:138)
>  at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:664)
>  at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213)
>  at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:895)
>  at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>  at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>  at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-10430) Can't run WordCount on EMR With Flink Runner via YARN

2021-12-03 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-10430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452971#comment-17452971
 ] 

Etienne Chauchot commented on BEAM-10430:
-

you need to add to your project: 
{code:java}

com.fasterxml.jackson.module
jackson-module-jaxb-annotations
${jackson.version}
{code}

> Can't run WordCount on EMR With Flink Runner via YARN
> -
>
> Key: BEAM-10430
> URL: https://issues.apache.org/jira/browse/BEAM-10430
> Project: Beam
>  Issue Type: Bug
>  Components: examples-java, runner-flink
>Affects Versions: 2.22.0
> Environment: AWS EMR 5.30.0 running Spark 2.4.5, Flink 1.10.0
>Reporter: Shashi
>Priority: P3
>  Labels: Clarified
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> 1) I setup WordCount project as detailed on Beam website..
>  {{mvn archetype:generate \
>   -DarchetypeGroupId=org.apache.beam \
>   -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
>   -DarchetypeVersion=2.22.0 \
>   -DgroupId=org.example \
>   -DartifactId=word-count-beam \
>   -Dversion="0.1" \
>   -Dpackage=org.apache.beam.examples \
>   -DinteractiveMode=false}}
> 2) mvn clean package -Pflink-runner
> 3) Ran the application on AWS EMR 5.30.0 with Flink 1.10.0
> flink run -m yarn-cluster -yid  -p 4  -c 
> org.apache.beam.examples.WordCount word-count-beam-bundled-0.1.jar 
> –runner=FlinkRunner --inputFile  --output 
> 
> 4) Launch failed with the following exception stack trace 
> java.util.ServiceConfigurationError: com.fasterxml.jackson.databind.Module: 
> Provider com.fasterxml.jackson.module.jaxb.JaxbAnnotationModule not a subtype
>  at java.util.ServiceLoader.fail(ServiceLoader.java:239)
>  at java.util.ServiceLoader.access$300(ServiceLoader.java:185)
>  at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:376)
>  at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
>  at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
>  at 
> com.fasterxml.jackson.databind.ObjectMapper.findModules(ObjectMapper.java:1054)
>  at 
> org.apache.beam.sdk.options.PipelineOptionsFactory.(PipelineOptionsFactory.java:471)
>  at org.apache.beam.examples.WordCount.main(WordCount.java:190)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:321)
>  at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205)
>  at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:138)
>  at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:664)
>  at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213)
>  at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:895)
>  at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>  at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>  at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-8470) Create a new Spark runner based on Spark Structured streaming framework

2021-11-15 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443867#comment-17443867
 ] 

Etienne Chauchot commented on BEAM-8470:


Hi [~yunta] we can run nexmark but only in batch mode (it is run as part of the 
post-commit tests). I'm no more working on this topic, but if you want to 
contribute the streaming mode, I'd be happy

 

> Create a new Spark runner based on Spark Structured streaming framework
> ---
>
> Key: BEAM-8470
> URL: https://issues.apache.org/jira/browse/BEAM-8470
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-spark
>Reporter: Etienne Chauchot
>Assignee: Etienne Chauchot
>Priority: P2
>  Labels: structured-streaming
> Fix For: 2.18.0
>
>  Time Spent: 16h
>  Remaining Estimate: 0h
>
> h1. Why is it worth creating a new runner based on structured streaming:
> Because this new framework brings:
>  * Unified batch and streaming semantics:
>  * no more RDD/DStream distinction, as in Beam (only PCollection)
>  * Better state management:
>  * incremental state instead of saving all each time
>  * No more synchronous saving delaying computation: per batch and partition 
> delta file saved asynchronously + in-memory hashmap synchronous put/get
>  * Schemas in datasets:
>  * The dataset knows the structure of the data (fields) and can optimize 
> later on
>  * Schemas in PCollection in Beam
>  * New Source API
>  * Very close to Beam bounded source and unbounded sources
> h1. Why make a new runner from scratch?
>  * Structured streaming framework is very different from the RDD/Dstream 
> framework
> h1. We hope to gain
>  * More up to date runner in terms of libraries: leverage new features
>  * Leverage learnt practices from the previous runners
>  * Better performance thanks to the DAG optimizer (catalyst) and by 
> simplifying the code.
>  * Simplify the code and ease the maintenance
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-13203) Potential data loss when using SnsIO.writeAsync

2021-11-09 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-13203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441190#comment-17441190
 ] 

Etienne Chauchot commented on BEAM-13203:
-

yes, proper async mechanism with retrials, limited buffered requests and timed 
out wait for async response seem to be necessary.

If the write is backpressured, there is no way to slow down the source (because 
its watermark will got forward) so the records will be buffered in the write 
operation indeed

> Potential data loss when using SnsIO.writeAsync
> ---
>
> Key: BEAM-13203
> URL: https://issues.apache.org/jira/browse/BEAM-13203
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-aws
>Reporter: Moritz Mack
>Priority: P0
>
> This needs to be investigated, reading the code suggests we might be losing 
> data under certain conditions e.g. when terminating the pipeline. The async 
> processing model here is far too simplistic.
> The bundle won't ever know about pending writes and won't block to wait for 
> any such operation. The same way exceptions are thrown into nowhere. Test 
> cases don't capture this as they operate on completed futures only (so 
> exceptions in the callbacks get thrown on the thread of processElement).
> {code:java}
> client.publish(publishRequest).whenComplete((response, ex) -> {
>   if (ex == null) {
> SnsResponse snsResponse = SnsResponse.of(context.element(), response);
> context.output(snsResponse);
>   } else {
> LOG.error("Error while publishing request to SNS", ex);
> throw new SnsWriteException("Error while publishing request to SNS", ex);
>   }
> }); {code}
> Also, this entirely removes backpressure from a stream. When used with a much 
> faster source we will continue to accumulate more and more memory as the 
> number of concurrent pending async operations is not limited.
> Spotify's scio contains a 
> [JavaAsyncDoFn|https://github.com/spotify/scio/blob/main/scio-core/src/main/java/com/spotify/scio/transforms/JavaAsyncDoFn.java]
>  that illustrates how it can be done.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-13209) DynamoDBIO silently drops unprocessed items

2021-11-09 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-13209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441188#comment-17441188
 ] 

Etienne Chauchot commented on BEAM-13209:
-

We should implement retrials as it was done in Elasticsearch IO. There are 
classes for retrial purposes in the SDK

> DynamoDBIO silently drops unprocessed items
> ---
>
> Key: BEAM-13209
> URL: https://issues.apache.org/jira/browse/BEAM-13209
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-aws
>Reporter: Moritz Mack
>Priority: P0
>  Labels: aws, data-loss, dynamodb
>
> `[batchWriteItem|https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_BatchWriteItem.html]`
>  might fail partially and return unprocessed items. Such partial failures are 
> not handled and result in a data loss.
> If hitting DynamoDB at scale it's rather likely to run into this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-13209) DynamoDBIO silently drops unprocessed items

2021-11-09 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-13209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441185#comment-17441185
 ] 

Etienne Chauchot commented on BEAM-13209:
-

regarding data loss comment, I'd raise the priority

> DynamoDBIO silently drops unprocessed items
> ---
>
> Key: BEAM-13209
> URL: https://issues.apache.org/jira/browse/BEAM-13209
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-aws
>Reporter: Moritz Mack
>Priority: P0
>  Labels: aws, data-loss, dynamodb
>
> `[batchWriteItem|https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_BatchWriteItem.html]`
>  might fail partially and return unprocessed items. Such partial failures are 
> not handled and result in a data loss.
> If hitting DynamoDB at scale it's rather likely to run into this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (BEAM-13209) DynamoDBIO silently drops unprocessed items

2021-11-09 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-13209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot updated BEAM-13209:

Priority: P0  (was: P2)

> DynamoDBIO silently drops unprocessed items
> ---
>
> Key: BEAM-13209
> URL: https://issues.apache.org/jira/browse/BEAM-13209
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-aws
>Reporter: Moritz Mack
>Priority: P0
>  Labels: aws, data-loss, dynamodb
>
> `[batchWriteItem|https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_BatchWriteItem.html]`
>  might fail partially and return unprocessed items. Such partial failures are 
> not handled and result in a data loss.
> If hitting DynamoDB at scale it's rather likely to run into this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-13209) DynamoDBIO silently drops unprocessed items

2021-11-09 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-13209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441186#comment-17441186
 ] 

Etienne Chauchot commented on BEAM-13209:
-

Thanks for identifying this [~mosche] !

> DynamoDBIO silently drops unprocessed items
> ---
>
> Key: BEAM-13209
> URL: https://issues.apache.org/jira/browse/BEAM-13209
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-aws
>Reporter: Moritz Mack
>Priority: P0
>  Labels: aws, data-loss, dynamodb
>
> `[batchWriteItem|https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_BatchWriteItem.html]`
>  might fail partially and return unprocessed items. Such partial failures are 
> not handled and result in a data loss.
> If hitting DynamoDB at scale it's rather likely to run into this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-13203) Potential data loss when using SnsIO.writeAsync

2021-11-09 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-13203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441184#comment-17441184
 ] 

Etienne Chauchot commented on BEAM-13203:
-

Thanks for identifying this [~mosche] !

> Potential data loss when using SnsIO.writeAsync
> ---
>
> Key: BEAM-13203
> URL: https://issues.apache.org/jira/browse/BEAM-13203
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-aws
>Reporter: Moritz Mack
>Priority: P0
>
> This needs to be investigated, reading the code suggests we might be losing 
> data under certain conditions e.g. when terminating the pipeline. The async 
> processing model here is far too simplistic.
> The bundle won't ever know about pending writes and won't block to wait for 
> any such operation. The same way exceptions are thrown into nowhere. Test 
> cases don't capture this as they operate on completed futures only (so 
> exceptions in the callbacks get thrown on the thread of processElement).
> {code:java}
> client.publish(publishRequest).whenComplete((response, ex) -> {
>   if (ex == null) {
> SnsResponse snsResponse = SnsResponse.of(context.element(), response);
> context.output(snsResponse);
>   } else {
> LOG.error("Error while publishing request to SNS", ex);
> throw new SnsWriteException("Error while publishing request to SNS", ex);
>   }
> }); {code}
> Also, this entirely removes backpressure from a stream. When used with a much 
> faster source we will continue to accumulate more and more memory as the 
> number of concurrent pending async operations is not limited.
> Spotify's scio contains a 
> [JavaAsyncDoFn|https://github.com/spotify/scio/blob/main/scio-core/src/main/java/com/spotify/scio/transforms/JavaAsyncDoFn.java]
>  that illustrates how it can be done.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-13203) Potential data loss when using SnsIO.writeAsync

2021-11-09 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-13203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441183#comment-17441183
 ] 

Etienne Chauchot commented on BEAM-13203:
-

regarding data loss comment, I'd raise the priority

> Potential data loss when using SnsIO.writeAsync
> ---
>
> Key: BEAM-13203
> URL: https://issues.apache.org/jira/browse/BEAM-13203
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-aws
>Reporter: Moritz Mack
>Priority: P0
>
> This needs to be investigated, reading the code suggests we might be losing 
> data under certain conditions e.g. when terminating the pipeline. The async 
> processing model here is far too simplistic.
> The bundle won't ever know about pending writes and won't block to wait for 
> any such operation. The same way exceptions are thrown into nowhere. Test 
> cases don't capture this as they operate on completed futures only (so 
> exceptions in the callbacks get thrown on the thread of processElement).
> {code:java}
> client.publish(publishRequest).whenComplete((response, ex) -> {
>   if (ex == null) {
> SnsResponse snsResponse = SnsResponse.of(context.element(), response);
> context.output(snsResponse);
>   } else {
> LOG.error("Error while publishing request to SNS", ex);
> throw new SnsWriteException("Error while publishing request to SNS", ex);
>   }
> }); {code}
> Also, this entirely removes backpressure from a stream. When used with a much 
> faster source we will continue to accumulate more and more memory as the 
> number of concurrent pending async operations is not limited.
> Spotify's scio contains a 
> [JavaAsyncDoFn|https://github.com/spotify/scio/blob/main/scio-core/src/main/java/com/spotify/scio/transforms/JavaAsyncDoFn.java]
>  that illustrates how it can be done.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (BEAM-13203) Potential data loss when using SnsIO.writeAsync

2021-11-09 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-13203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot updated BEAM-13203:

Priority: P0  (was: P1)

> Potential data loss when using SnsIO.writeAsync
> ---
>
> Key: BEAM-13203
> URL: https://issues.apache.org/jira/browse/BEAM-13203
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-aws
>Reporter: Moritz Mack
>Priority: P0
>
> This needs to be investigated, reading the code suggests we might be losing 
> data under certain conditions e.g. when terminating the pipeline. The async 
> processing model here is far too simplistic.
> The bundle won't ever know about pending writes and won't block to wait for 
> any such operation. The same way exceptions are thrown into nowhere. Test 
> cases don't capture this as they operate on completed futures only (so 
> exceptions in the callbacks get thrown on the thread of processElement).
> {code:java}
> client.publish(publishRequest).whenComplete((response, ex) -> {
>   if (ex == null) {
> SnsResponse snsResponse = SnsResponse.of(context.element(), response);
> context.output(snsResponse);
>   } else {
> LOG.error("Error while publishing request to SNS", ex);
> throw new SnsWriteException("Error while publishing request to SNS", ex);
>   }
> }); {code}
> Also, this entirely removes backpressure from a stream. When used with a much 
> faster source we will continue to accumulate more and more memory as the 
> number of concurrent pending async operations is not limited.
> Spotify's scio contains a 
> [JavaAsyncDoFn|https://github.com/spotify/scio/blob/main/scio-core/src/main/java/com/spotify/scio/transforms/JavaAsyncDoFn.java]
>  that illustrates how it can be done.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-5172) org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky

2021-11-09 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441170#comment-17441170
 ] 

Etienne Chauchot commented on BEAM-5172:


[~ibzib] ok fair enough

> org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky
> -
>
> Key: BEAM-5172
> URL: https://issues.apache.org/jira/browse/BEAM-5172
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-elasticsearch, test-failures
>Affects Versions: 2.31.0
>Reporter: Valentyn Tymofieiev
>Priority: P1
>  Labels: flake, sickbay
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> In a recent PostCommit builld, 
> https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/1290/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testRead/
>  failed with:
> Error Message
> java.lang.AssertionError: Count/Flatten.PCollections.out: 
> Expected: <400L>
>  but: was <470L>
> Stacktrace
> java.lang.AssertionError: Count/Flatten.PCollections.out: 
> Expected: <400L>
>  but: was <470L>
>   at 
> org.apache.beam.sdk.testing.PAssert$PAssertionSite.capture(PAssert.java:168)
>   at org.apache.beam.sdk.testing.PAssert.thatSingleton(PAssert.java:413)
>   at org.apache.beam.sdk.testing.PAssert.thatSingleton(PAssert.java:404)
>   at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTestCommon.testRead(ElasticsearchIOTestCommon.java:124)
>   at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testRead(ElasticsearchIOTest.java:125)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:239)
>   at 
> org.apache.beam.sdk.testing.TestPipeline$1.evaluate(TestPipeline.java:319)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:106)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38)
>   at 
> org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:66)
>   at 
> org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
>   at 
> org.gradle.internal.dispatch.ReflectionDispatch.dispatch(Refle

[jira] [Commented] (BEAM-12730) Add custom delimiters to Python TextIO reads

2021-11-04 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-12730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17438866#comment-17438866
 ] 

Etienne Chauchot commented on BEAM-12730:
-

done: assigned to 2.35 and removed labels as this PR requires parallel 
splitting knowledge so not a newbie task indeed.

> Add custom delimiters to Python TextIO reads
> 
>
> Key: BEAM-12730
> URL: https://issues.apache.org/jira/browse/BEAM-12730
> Project: Beam
>  Issue Type: New Feature
>  Components: io-py-common, io-py-files
>Reporter: Daniel Oliveira
>Assignee: Dmitrii Kuzin
>Priority: P2
> Fix For: 2.35.0
>
>  Time Spent: 25h 50m
>  Remaining Estimate: 0h
>
> A common request by users is to be able to separate a text files read by 
> TextIO with delimiters other than newline. The Java SDK already supports this 
> feature.
> The current delimiter code is [located 
> here|https://github.com/apache/beam/blob/v2.31.0/sdks/python/apache_beam/io/textio.py#L236]
>  and defaults to newlines. This function could easily be modified to also 
> handle custom delimiters. Changing this would also necessitate changing the 
> API for the various TextIO.Read methods and adding documentation.
> This seems like a good starter bug for making more in-depth contributions to 
> Beam Python.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-12730) Add custom delimiters to Python TextIO reads

2021-11-04 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-12730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot updated BEAM-12730:

Labels:   (was: beginner newbie starter)

> Add custom delimiters to Python TextIO reads
> 
>
> Key: BEAM-12730
> URL: https://issues.apache.org/jira/browse/BEAM-12730
> Project: Beam
>  Issue Type: New Feature
>  Components: io-py-common, io-py-files
>Reporter: Daniel Oliveira
>Assignee: Dmitrii Kuzin
>Priority: P2
> Fix For: 2.35.0
>
>  Time Spent: 25h 50m
>  Remaining Estimate: 0h
>
> A common request by users is to be able to separate a text files read by 
> TextIO with delimiters other than newline. The Java SDK already supports this 
> feature.
> The current delimiter code is [located 
> here|https://github.com/apache/beam/blob/v2.31.0/sdks/python/apache_beam/io/textio.py#L236]
>  and defaults to newlines. This function could easily be modified to also 
> handle custom delimiters. Changing this would also necessitate changing the 
> API for the various TextIO.Read methods and adding documentation.
> This seems like a good starter bug for making more in-depth contributions to 
> Beam Python.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-12730) Add custom delimiters to Python TextIO reads

2021-11-04 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-12730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot updated BEAM-12730:

Fix Version/s: (was: 2.34.0)
   2.35.0

> Add custom delimiters to Python TextIO reads
> 
>
> Key: BEAM-12730
> URL: https://issues.apache.org/jira/browse/BEAM-12730
> Project: Beam
>  Issue Type: New Feature
>  Components: io-py-common, io-py-files
>Reporter: Daniel Oliveira
>Assignee: Dmitrii Kuzin
>Priority: P2
>  Labels: beginner, newbie, starter
> Fix For: 2.35.0
>
>  Time Spent: 25h 50m
>  Remaining Estimate: 0h
>
> A common request by users is to be able to separate a text files read by 
> TextIO with delimiters other than newline. The Java SDK already supports this 
> feature.
> The current delimiter code is [located 
> here|https://github.com/apache/beam/blob/v2.31.0/sdks/python/apache_beam/io/textio.py#L236]
>  and defaults to newlines. This function could easily be modified to also 
> handle custom delimiters. Changing this would also necessitate changing the 
> API for the various TextIO.Read methods and adding documentation.
> This seems like a good starter bug for making more in-depth contributions to 
> Beam Python.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (BEAM-5172) org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky

2021-11-04 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17438613#comment-17438613
 ] 

Etienne Chauchot commented on BEAM-5172:


No worries for the lack of bandwidth. And no hurry about fixing. This tests 
suite has been flaky for months so I guess it is fine considering the flakiness 
frequency has dropped. So I would rather leave it as it is as a reminder about 
fixing (but no hurry) rather than ignoring and possibly forgetting about it. 

> org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky
> -
>
> Key: BEAM-5172
> URL: https://issues.apache.org/jira/browse/BEAM-5172
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-elasticsearch, test-failures
>Affects Versions: 2.31.0
>Reporter: Valentyn Tymofieiev
>Priority: P1
>  Labels: flake, sickbay
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> In a recent PostCommit builld, 
> https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/1290/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testRead/
>  failed with:
> Error Message
> java.lang.AssertionError: Count/Flatten.PCollections.out: 
> Expected: <400L>
>  but: was <470L>
> Stacktrace
> java.lang.AssertionError: Count/Flatten.PCollections.out: 
> Expected: <400L>
>  but: was <470L>
>   at 
> org.apache.beam.sdk.testing.PAssert$PAssertionSite.capture(PAssert.java:168)
>   at org.apache.beam.sdk.testing.PAssert.thatSingleton(PAssert.java:413)
>   at org.apache.beam.sdk.testing.PAssert.thatSingleton(PAssert.java:404)
>   at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTestCommon.testRead(ElasticsearchIOTestCommon.java:124)
>   at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testRead(ElasticsearchIOTest.java:125)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:239)
>   at 
> org.apache.beam.sdk.testing.TestPipeline$1.evaluate(TestPipeline.java:319)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:106)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38)
>   at 
> org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:66)
>   at 
> org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodA

[jira] [Comment Edited] (BEAM-5172) org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky

2021-11-04 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17435498#comment-17435498
 ] 

Etienne Chauchot edited comment on BEAM-5172 at 11/4/21, 10:12 AM:
---

Took a look at the 3 flakiness:

1. 
[https://ci-beam.apache.org/job/beam_PreCommit_Java_Commit/19333/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testSplit/]
 : root cause is that there is no splits on ES6. The only possible cause is 
estimated index size is < desiredBundleSizeBytes (2000 in the test)

2. 
[https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/4147/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testReadWithQueryString/
 
|https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/4189/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testSizes/]once
 again the guilty one is the index size estimation estimation is 209 whereas 
average size is 25 and 40 docs. 

=> I recall size estimation is statistical in ES and might vary over time hence 
the error margin included in the tests. I think we should give the tests more 
margin but also fix size estimation. I opened this ticket for that. I also 
opened a cleaning ticket while reviewing the split code.

3. 
[https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/4147/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testReadWithQueryString/
 
|https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/4147/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testReadWithQueryString/]this
 one worries me more: if there is no scientist matching Einstein out of 40 docs 
inserted (with 10 different scientists) it is either that the insert was not 
done or indexed (too much load and docs not yet inserted or not enough wait 
before indexing time) or the match query was not properly passed by the 
queryconfigurer in the unit test.

[~egalpin] can you take a look with this hints in mind ?


was (Author: echauchot):
Took a look at the 3 flakiness:

1. 
[https://ci-beam.apache.org/job/beam_PreCommit_Java_Commit/19333/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testSplit/]
 : root cause is that there is no splits on ES6. The only possible cause is 
estimated index size is < desiredBundleSizeBytes (2000 in the test)

2. 
[https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/4147/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testReadWithQueryString/
 
|https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/4189/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testSizes/]once
 again the guilty one is the index size estimation estimation is 209 whereas 
average size is 25 and 40 docs. 

=> I recall size estimation is statistical in ES and might vary over time hence 
the error margin included in the tests. I think we should give the tests more 
margin but also fix size estimation. I opened 
[this|https://issues.apache.org/jira/browse/BEAM-13137] ticket for that. I also 
opened a [cleaning ticket|https://issues.apache.org/jira/browse/BEAM-13136] 
while reviewing the split code.

3. 
[https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/4147/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testReadWithQueryString/
 
|https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/4147/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testReadWithQueryString/]this
 once worries me more: if there is no scientist matching Einstein out of 40 
docs inserted (with 10 different scientists) it is either that the insert was 
not done or indexed (too much load and docs not yet inserted or not enough wait 
before indexing time) or the match query was not properly passed by the 
queryconfigurer in the unit test.

[~egalpin] can you take a look with this hints in mind ?

> org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky
> -
>
> Key: BEAM-5172
> URL: https://issues.apache.org/jira/browse/BEAM-5172
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-elasticsearch, test-failures
>Affects Versions: 2.31.0
>Reporter: Valentyn Tymofieiev
>Priority: P1
>  Labels: flake, sickbay
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> In a recent PostCommit builld, 
> https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/1290/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testRead/
>  failed with:
> Error Message
> java.lang.AssertionError: Count/Flatten.PCollections.out: 
> Expected: <400L>
>  but: was <470L>
> Stacktrace
> java.lang.AssertionError: Count/Flatten.PCollections.out: 
> Ex

[jira] [Commented] (BEAM-12730) Add custom delimiters to Python TextIO reads

2021-10-29 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-12730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17435918#comment-17435918
 ] 

Etienne Chauchot commented on BEAM-12730:
-

Hi [~EugeneNikolaiev]. It's been years since I coded that, so I had to take  a 
look at the code. Yes bundle split can happen at any byte so it can cut a 
multi-byte delimiter so we have to go backward in parsing to be sure we get the 
whole delimiter and we have data completeness.

Regarding self overlap: such as {{||}} for example: how should we interpret 
{{abc|||xyz}} - as {{abc|}}, {{xyz}} - or as {{abc}}, {{|xyz}}? And how do we 
consistently enforce this interpretation if the file is split by the runner 
into bundles differently each time?

At parsing I cannot rewind the offset of {{(separator.size + 1)}} to allow only 
one byte overlap, neither can I rewind the offset of {{(2*sperator.size)}} to 
allow maximum overlap because it might produce duplicate record if a 
{{record.size < separator.size}}. I cannot either catch anything to state that 
the file format is wrong in case of overlap because I will get no exception, 
just flaky record content depending on the runner / source split point.

So we decided to disallow overlapping delimiters. I advice you should disallow 
overlapping delimiters for python as well for the same reasons + consistency 
with the java sdk

> Add custom delimiters to Python TextIO reads
> 
>
> Key: BEAM-12730
> URL: https://issues.apache.org/jira/browse/BEAM-12730
> Project: Beam
>  Issue Type: New Feature
>  Components: io-py-common, io-py-files
>Reporter: Daniel Oliveira
>Assignee: Dmitrii Kuzin
>Priority: P2
>  Labels: beginner, newbie, starter
>  Time Spent: 18h 10m
>  Remaining Estimate: 0h
>
> A common request by users is to be able to separate a text files read by 
> TextIO with delimiters other than newline. The Java SDK already supports this 
> feature.
> The current delimiter code is [located 
> here|https://github.com/apache/beam/blob/v2.31.0/sdks/python/apache_beam/io/textio.py#L236]
>  and defaults to newlines. This function could easily be modified to also 
> handle custom delimiters. Changing this would also necessitate changing the 
> API for the various TextIO.Read methods and adding documentation.
> This seems like a good starter bug for making more in-depth contributions to 
> Beam Python.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (BEAM-5172) org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky

2021-10-29 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17435498#comment-17435498
 ] 

Etienne Chauchot edited comment on BEAM-5172 at 10/29/21, 9:12 AM:
---

Took a look at the 3 flakiness:

1. 
[https://ci-beam.apache.org/job/beam_PreCommit_Java_Commit/19333/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testSplit/]
 : root cause is that there is no splits on ES6. The only possible cause is 
estimated index size is < desiredBundleSizeBytes (2000 in the test)

2. 
[https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/4147/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testReadWithQueryString/
 
|https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/4189/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testSizes/]once
 again the guilty one is the index size estimation estimation is 209 whereas 
average size is 25 and 40 docs. 

=> I recall size estimation is statistical in ES and might vary over time hence 
the error margin included in the tests. I think we should give the tests more 
margin but also fix size estimation. I opened 
[this|https://issues.apache.org/jira/browse/BEAM-13137] ticket for that. I also 
opened a [cleaning ticket|https://issues.apache.org/jira/browse/BEAM-13136] 
while reviewing the split code.

3. 
[https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/4147/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testReadWithQueryString/
 
|https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/4147/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testReadWithQueryString/]this
 once worries me more: if there is no scientist matching Einstein out of 40 
docs inserted (with 10 different scientists) it is either that the insert was 
not done or indexed (too much load and docs not yet inserted or not enough wait 
before indexing time) or the match query was not properly passed by the 
queryconfigurer in the unit test.

[~egalpin] can you take a look with this hints in mind ?


was (Author: echauchot):
Took a look at the 3 flakiness:

1. 
[https://ci-beam.apache.org/job/beam_PreCommit_Java_Commit/19333/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testSplit/]
 : root cause is that there is no splits on ES6. The only possible cause is 
estimated index size is < desiredBundleSizeBytes (2000 in the test)

2. 
[https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/4147/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testReadWithQueryString/
 
|https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/4189/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testSizes/]once
 again the guilty one is the index size estimation estimation is 209 whereas 
average size is 25 and 40 docs. 

=> I recall size estimation is statistical in ES and might vary over time hence 
the error margin included in the tests. I think we should give the tests more 
margin but also fix size estimation. I opened this ticket for that. I also 
opened a cleaning ticket while reviewing the split code.

3. 
[https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/4147/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testReadWithQueryString/
 
|https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/4147/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testReadWithQueryString/]this
 once worries me more: if there is no scientist matching Einstein out of 40 
docs inserted (with 10 different scientists) it is either that the insert was 
not done or indexed (too much load and docs not yet inserted or not enough wait 
before indexing time) or the match query was not properly passed by the 
queryconfigurer in the unit test.

[~egalpin] can you take a look with this hints in mind ?

> org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky
> -
>
> Key: BEAM-5172
> URL: https://issues.apache.org/jira/browse/BEAM-5172
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-elasticsearch, test-failures
>Affects Versions: 2.31.0
>Reporter: Valentyn Tymofieiev
>Priority: P1
>  Labels: flake, sickbay
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> In a recent PostCommit builld, 
> https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/1290/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testRead/
>  failed with:
> Error Message
> java.lang.AssertionError: Count/Flatten.PCollections.out: 
> Expected: <400L>
>  but: was <470L>
> Stacktrace
> java.lang.AssertionError: Count/Flatten.PCollections.out: 
> E

[jira] [Commented] (BEAM-5172) org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky

2021-10-28 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17435498#comment-17435498
 ] 

Etienne Chauchot commented on BEAM-5172:


Took a look at the 3 flakiness:

1. 
[https://ci-beam.apache.org/job/beam_PreCommit_Java_Commit/19333/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testSplit/]
 : root cause is that there is no splits on ES6. The only possible cause is 
estimated index size is < desiredBundleSizeBytes (2000 in the test)

2. 
[https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/4147/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testReadWithQueryString/
 
|https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/4189/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testSizes/]once
 again the guilty one is the index size estimation estimation is 209 whereas 
average size is 25 and 40 docs. 

=> I recall size estimation is statistical in ES and might vary over time hence 
the error margin included in the tests. I think we should give the tests more 
margin but also fix size estimation. I opened this ticket for that. I also 
opened a cleaning ticket while reviewing the split code.

3. 
[https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/4147/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testReadWithQueryString/
 
|https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/4147/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testReadWithQueryString/]this
 once worries me more: if there is no scientist matching Einstein out of 40 
docs inserted (with 10 different scientists) it is either that the insert was 
not done or indexed (too much load and docs not yet inserted or not enough wait 
before indexing time) or the match query was not properly passed by the 
queryconfigurer in the unit test.

[~egalpin] can you take a look with this hints in mind ?

> org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky
> -
>
> Key: BEAM-5172
> URL: https://issues.apache.org/jira/browse/BEAM-5172
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-elasticsearch, test-failures
>Affects Versions: 2.31.0
>Reporter: Valentyn Tymofieiev
>Priority: P1
>  Labels: flake, sickbay
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> In a recent PostCommit builld, 
> https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/1290/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testRead/
>  failed with:
> Error Message
> java.lang.AssertionError: Count/Flatten.PCollections.out: 
> Expected: <400L>
>  but: was <470L>
> Stacktrace
> java.lang.AssertionError: Count/Flatten.PCollections.out: 
> Expected: <400L>
>  but: was <470L>
>   at 
> org.apache.beam.sdk.testing.PAssert$PAssertionSite.capture(PAssert.java:168)
>   at org.apache.beam.sdk.testing.PAssert.thatSingleton(PAssert.java:413)
>   at org.apache.beam.sdk.testing.PAssert.thatSingleton(PAssert.java:404)
>   at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTestCommon.testRead(ElasticsearchIOTestCommon.java:124)
>   at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testRead(ElasticsearchIOTest.java:125)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:239)
>   at 
> org.apache.beam.sdk.testing.TestPipeline$1.evaluate(TestPipeline.java:319)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRu

[jira] [Updated] (BEAM-13136) Clean leftovers of old ESIO versions / test mechanism

2021-10-28 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-13136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot updated BEAM-13136:

Description: 
While auditing about flakiness issues, I saw lefovers on the test code and 
production code of the IO. This list is not comprehensive but spots some places:
 * In ESTestCommon#testSplit there is a reference to ES2: ES2 cannot split the 
shards (missing API) so the split are fixed to the number of shards (5 by 
default in the test). As ES2 support was removed, this special case in the test 
can be removed
 * org.elasticsearch.bootstrap.JarHell was a hacky fix to disable jarHell test 
in ES test framework. This framework is no more used (testContainers) so we 
could remove the hack in all places
 * if (backendVersion == 2)  in prod code
 * @ThreadLeakScope(ThreadLeakScope.Scope.NONE) come from es test framework 
that is no longer used
 * please audit for other places that could need cleaning.

  was:
While auditing about flakiness issues, I saw lefovers on the test code and 
production code of the IO. This list is not comprehensive but spots some places:
 * In ESTestCommon#testSplit there is a reference to ES2: ES2 cannot split the 
shards (missing API) so the split are fixed to the number of shards (5 by 
default in the test). As ES2 support was removed, this special case in the test 
can be removed
 * org.elasticsearch.bootstrap.JarHell was a hacky fix to disable jarHell test 
in ES test framework. This framework is no more used (testContainers) so we 
could remove the hack in all places
 * if (backendVersion == 2)  in prod code
 * please audit for other places that could need cleaning.


> Clean leftovers of old ESIO versions / test mechanism
> -
>
> Key: BEAM-13136
> URL: https://issues.apache.org/jira/browse/BEAM-13136
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Etienne Chauchot
>Assignee: Evan Galpin
>Priority: P0
>
> While auditing about flakiness issues, I saw lefovers on the test code and 
> production code of the IO. This list is not comprehensive but spots some 
> places:
>  * In ESTestCommon#testSplit there is a reference to ES2: ES2 cannot split 
> the shards (missing API) so the split are fixed to the number of shards (5 by 
> default in the test). As ES2 support was removed, this special case in the 
> test can be removed
>  * org.elasticsearch.bootstrap.JarHell was a hacky fix to disable jarHell 
> test in ES test framework. This framework is no more used (testContainers) so 
> we could remove the hack in all places
>  * if (backendVersion == 2)  in prod code
>  * @ThreadLeakScope(ThreadLeakScope.Scope.NONE) come from es test framework 
> that is no longer used
>  * please audit for other places that could need cleaning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (BEAM-13137) make ElasticsearchIO$BoundedElasticsearchSource#getEstimatedSizeBytes deterministic

2021-10-28 Thread Etienne Chauchot (Jira)

Etienne Chauchot created BEAM-13137:
---

 Summary: make 
ElasticsearchIO$BoundedElasticsearchSource#getEstimatedSizeBytes deterministic
 Key: BEAM-13137
 URL: https://issues.apache.org/jira/browse/BEAM-13137
 Project: Beam
  Issue Type: Improvement
  Components: io-java-elasticsearch
Reporter: Etienne Chauchot
Assignee: Evan Galpin


Index size estimation is statistical in ES and varies. But it is the base for 
splitting so it needs to be more deterministic because that causes flakiness in 
the UTests in _testSplit_ and _testSizes_ and maybe entails sub-optimal 
splitting in production in some cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-13136) Clean leftovers of old ESIO versions / test mechanism

2021-10-28 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-13136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot updated BEAM-13136:

Description: 
While auditing about flakiness issues, I saw lefovers on the test code and 
production code of the IO. This list is not comprehensive but spots some places:
 * In ESTestCommon#testSplit there is a reference to ES2: ES2 cannot split the 
shards (missing API) so the split are fixed to the number of shards (5 by 
default in the test). As ES2 support was removed, this special case in the test 
can be removed
 * org.elasticsearch.bootstrap.JarHell was a hacky fix to disable jarHell test 
in ES test framework. This framework is no more used (testContainers) so we 
could remove the hack in all places
 * if (backendVersion == 2)  in prod code
 * please audit for other places that could need cleaning.

  was:
While auditing about flakiness issues, I saw lefovers on the test code and 
production code of the IO. This list is not comprehensive but spots some places:
 * In ESTestCommon#testSplit there is a reference to ES2: ES2 cannot split the 
shards (missing API) so the split are fixed to the number of shards (5 by 
default in the test). As ES2 support was removed, this special case in the test 
can be removed
 * org.elasticsearch.bootstrap.JarHell was a hacky fix to disable jarHell test 
in ES test framework. This framework is no more used (testContainers) so we 
could remove the hack in all places
 * please audit for other places that could need cleaning.


> Clean leftovers of old ESIO versions / test mechanism
> -
>
> Key: BEAM-13136
> URL: https://issues.apache.org/jira/browse/BEAM-13136
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Etienne Chauchot
>Assignee: Evan Galpin
>Priority: P0
>
> While auditing about flakiness issues, I saw lefovers on the test code and 
> production code of the IO. This list is not comprehensive but spots some 
> places:
>  * In ESTestCommon#testSplit there is a reference to ES2: ES2 cannot split 
> the shards (missing API) so the split are fixed to the number of shards (5 by 
> default in the test). As ES2 support was removed, this special case in the 
> test can be removed
>  * org.elasticsearch.bootstrap.JarHell was a hacky fix to disable jarHell 
> test in ES test framework. This framework is no more used (testContainers) so 
> we could remove the hack in all places
>  * if (backendVersion == 2)  in prod code
>  * please audit for other places that could need cleaning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-13136) Clean leftovers of old ESIO versions / test mechanism

2021-10-28 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-13136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot updated BEAM-13136:

Description: 
While auditing about flakiness issues, I saw lefovers on the test code and 
production code of the IO. This list is not comprehensive but spots some places:
 * In ESTestCommon#testSplit there is a reference to ES2: ES2 cannot split the 
shards (missing API) so the split are fixed to the number of shards (5 by 
default in the test). As ES2 support was removed, this special case in the test 
can be removed
 * org.elasticsearch.bootstrap.JarHell was a hacky fix to disable jarHell test 
in ES test framework. This framework is no more used (testContainers) so we 
could remove the hack in all places
 * please audit for other places that could need cleaning.

  was:
While auditing about flakiness issues, I saw lefovers on the test code and 
production code of the IO. This list is not comprehensive but stops some places:
 * In ESTestCommon in testSplit there is a refence to ES2: ES2 cannot split the 
shards (missing API) so the split are fixed to the number of shards (5). As ES2 
support was removed, this special case in the test can be removed
 * org.elasticsearch.bootstrap.JarHell was a hacky fix to disable jarHell test 
in ES test framework. This framework is no more used (testContainers) so we 
could remove the hack in all places
 * please audit for other places that could need cleaning.


> Clean leftovers of old ESIO versions / test mechanism
> -
>
> Key: BEAM-13136
> URL: https://issues.apache.org/jira/browse/BEAM-13136
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Etienne Chauchot
>Assignee: Evan Galpin
>Priority: P0
>
> While auditing about flakiness issues, I saw lefovers on the test code and 
> production code of the IO. This list is not comprehensive but spots some 
> places:
>  * In ESTestCommon#testSplit there is a reference to ES2: ES2 cannot split 
> the shards (missing API) so the split are fixed to the number of shards (5 by 
> default in the test). As ES2 support was removed, this special case in the 
> test can be removed
>  * org.elasticsearch.bootstrap.JarHell was a hacky fix to disable jarHell 
> test in ES test framework. This framework is no more used (testContainers) so 
> we could remove the hack in all places
>  * please audit for other places that could need cleaning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (BEAM-13136) Clean leftovers of old ESIO versions / test mechanism

2021-10-28 Thread Etienne Chauchot (Jira)

Etienne Chauchot created BEAM-13136:
---

 Summary: Clean leftovers of old ESIO versions / test mechanism
 Key: BEAM-13136
 URL: https://issues.apache.org/jira/browse/BEAM-13136
 Project: Beam
  Issue Type: Improvement
  Components: io-java-elasticsearch
Reporter: Etienne Chauchot
Assignee: Evan Galpin


While auditing about flakiness issues, I saw lefovers on the test code and 
production code of the IO. This list is not comprehensive but stops some places:
 * In ESTestCommon in testSplit there is a refence to ES2: ES2 cannot split the 
shards (missing API) so the split are fixed to the number of shards (5). As ES2 
support was removed, this special case in the test can be removed
 * org.elasticsearch.bootstrap.JarHell was a hacky fix to disable jarHell test 
in ES test framework. This framework is no more used (testContainers) so we 
could remove the hack in all places
 * please audit for other places that could need cleaning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (BEAM-5172) org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky

2021-10-20 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17431031#comment-17431031
 ] 

Etienne Chauchot commented on BEAM-5172:


[This|https://github.com/apache/beam/commit/a05aa45dac5e7295a56634f0916b7019bc6782e5]
 commit should solve the issue. [~egalpin] can you monitor the ES UTests status 
for a week (just once within a week is enough) so that we know that the flake 
has disappeared and close the ticket ?

> org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky
> -
>
> Key: BEAM-5172
> URL: https://issues.apache.org/jira/browse/BEAM-5172
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-elasticsearch, test-failures
>Affects Versions: 2.31.0
>Reporter: Valentyn Tymofieiev
>Priority: P1
>  Labels: flake, sickbay
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> In a recent PostCommit builld, 
> https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/1290/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testRead/
>  failed with:
> Error Message
> java.lang.AssertionError: Count/Flatten.PCollections.out: 
> Expected: <400L>
>  but: was <470L>
> Stacktrace
> java.lang.AssertionError: Count/Flatten.PCollections.out: 
> Expected: <400L>
>  but: was <470L>
>   at 
> org.apache.beam.sdk.testing.PAssert$PAssertionSite.capture(PAssert.java:168)
>   at org.apache.beam.sdk.testing.PAssert.thatSingleton(PAssert.java:413)
>   at org.apache.beam.sdk.testing.PAssert.thatSingleton(PAssert.java:404)
>   at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTestCommon.testRead(ElasticsearchIOTestCommon.java:124)
>   at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testRead(ElasticsearchIOTest.java:125)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:239)
>   at 
> org.apache.beam.sdk.testing.TestPipeline$1.evaluate(TestPipeline.java:319)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:106)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38)
>   at 
> org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:66)
>   at 
> org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe

[jira] [Updated] (BEAM-10990) Elasticsearch IO Infinite loop with write Error when the pipeline job streaming mode

2021-10-20 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-10990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot updated BEAM-10990:

Fix Version/s: 2.35.0
   Resolution: Fixed
   Status: Resolved  (was: Triage Needed)

2.34.0 was cut on 10/06. As this ticket is not a blocker, I don't cherry pick 
it and plan it for 2.35.0

> Elasticsearch IO Infinite loop with write Error when the pipeline job 
> streaming  mode
> -
>
> Key: BEAM-10990
> URL: https://issues.apache.org/jira/browse/BEAM-10990
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Affects Versions: 2.24.0
>Reporter: Steven Gaunt
>Assignee: Evan Galpin
>Priority: P3
> Fix For: 2.35.0
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> When streaming messages from PubsubIO , the pipeline is in Streaming mode.
> If for some reason the ElasticSearchIO.Write() has an response from the 
> ElasticSearch index api, the writefn will throw a IOException.  Since this 
> excetpion is part of the Write transform, it becomes an unhandled error.  
> This will then inheritly cause behaviour from the job pipeline to infinitely 
> retry that error.. 
> _*snippet form beam website*_
> The Dataflow service retries failed tasks up to 4 times in batch mode, and an 
> unlimited number of times in streaming mode. In batch mode, your job will 
> fail; in streaming, it may stall indefinitely.
>  
> This is the ElasticSearchIO.write transform .
> {code:java}
>   public PDone expand(PCollection input) {
> ElasticsearchIO.ConnectionConfiguration connectionConfiguration = 
> this.getConnectionConfiguration();
> Preconditions.checkState(connectionConfiguration != null, 
> "withConnectionConfiguration() is required");
>* input.apply(ParDo.of(new ElasticsearchIO.Write.WriteFn(this)));*
> return PDone.in(input.getPipeline());
> }
> {code}
> The pardo function (WriteFn) finishBundle step will call 
> ElasticsearchIO.checkForErrors helper method which will throw exception if 
> the http response from elasticsearch has error in the json reponse.
> {code:java}
> // Some comments here
>  public void finishBundle(DoFn.FinishBundleContext context) 
> throws IOException, InterruptedException {
> this.flushBatch();
> }
> private void flushBatch() throws IOException, 
> InterruptedException {
> if (!this.batch.isEmpty()) {
> StringBuilder bulkRequest = new StringBuilder();
> Iterator var2 = this.batch.iterator();
> while(var2.hasNext()) {
> String json = (String)var2.next();
> bulkRequest.append(json);
> }
> this.batch.clear();
> this.currentBatchSizeBytes = 0L;
> String endPoint = String.format("/%s/%s/_bulk", 
> this.spec.getConnectionConfiguration().getIndex(), 
> this.spec.getConnectionConfiguration().getType());
> HttpEntity requestBody = new 
> NStringEntity(bulkRequest.toString(), ContentType.APPLICATION_JSON);
> Request request = new Request("POST", endPoint);
> request.addParameters(Collections.emptyMap());
> request.setEntity(requestBody);
> Response response = 
> this.restClient.performRequest(request);
> HttpEntity responseEntity = new 
> BufferedHttpEntity(response.getEntity());
> if (this.spec.getRetryConfiguration() != null && 
> this.spec.getRetryConfiguration().getRetryPredicate().test(responseEntity)) {
> responseEntity = this.handleRetry("POST", endPoint, 
> Collections.emptyMap(), requestBody);
> }
> 
> ElasticsearchIO.checkForErrors((HttpEntity)responseEntity, 
> this.backendVersion, this.spec.getUsePartialUpdate());
> }
> }
>  static void checkForErrors(HttpEntity responseEntity, int backendVersion, 
> boolean partialUpdate) throws IOException {
> JsonNode searchResult = parseResponse(responseEntity);
> boolean errors = searchResult.path("errors").asBoolean();
> if (errors) {
> StringBuilder errorMessages = new StringBuilder("Error writing to 
> Elasticsearch, some elements could not be inserted:");
> JsonNode items = searchResult.path("items");
> Iterator var7 = items.iterator();
> while(var7.hasNext()) {
> JsonNode item = (JsonNode)var7.next();
> String errorRootName = "";
> if (partialUpdate) {
>

[jira] [Commented] (BEAM-4383) Enable block size support in ParquetIO

2021-10-18 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-4383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429971#comment-17429971
 ] 

Etienne Chauchot commented on BEAM-4383:


[~ŁukaszG] is it not already implemented on ParquetIO ? I see a 
Sink#withRowGroupSize()

> Enable block size support in ParquetIO
> --
>
> Key: BEAM-4383
> URL: https://issues.apache.org/jira/browse/BEAM-4383
> Project: Beam
>  Issue Type: Improvement
>  Components: io-ideas, io-java-parquet
>Reporter: Lukasz Gajowy
>Priority: P3
>
> Parquet API allows block size support, which can improve IO performance when 
> working with Parquet files. Currently, the ParquetIO does not support it at 
> all so it looks like a room for improvement for this IO.
> Good intro into this topic: [https://www.dremio.com/tuning-parquet/] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-12849) Structured Streaming runner: Migrate metrics sources

2021-09-30 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-12849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot updated BEAM-12849:

  Assignee: Etienne Chauchot
Resolution: Not A Problem
Status: Resolved  (was: Triage Needed)

> Structured Streaming runner: Migrate metrics sources
> 
>
> Key: BEAM-12849
> URL: https://issues.apache.org/jira/browse/BEAM-12849
> Project: Beam
>  Issue Type: Sub-task
>  Components: runner-spark
>Reporter: Etienne Chauchot
>Assignee: Etienne Chauchot
>Priority: P2
>  Labels: structured-streaming
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> java.lang.ClassNotFoundException: org.apache.spark.metrics.source.Source no 
> more exists in spark 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-12909) Add spark3 to nexmark runs

2021-09-30 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-12909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot updated BEAM-12909:

Fix Version/s: Not applicable
   Resolution: Fixed
   Status: Resolved  (was: Triage Needed)

> Add spark3 to nexmark runs
> --
>
> Key: BEAM-12909
> URL: https://issues.apache.org/jira/browse/BEAM-12909
> Project: Beam
>  Issue Type: Sub-task
>  Components: runner-spark
>Reporter: Etienne Chauchot
>Assignee: Etienne Chauchot
>Priority: P2
> Fix For: Not applicable
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-7093) Support Spark 3 in Spark runner

2021-09-30 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-7093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot updated BEAM-7093:
---
Resolution: Fixed
Status: Resolved  (was: Open)

> Support Spark 3 in Spark runner
> ---
>
> Key: BEAM-7093
> URL: https://issues.apache.org/jira/browse/BEAM-7093
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-spark
>Reporter: Ismaël Mejía
>Assignee: Etienne Chauchot
>Priority: P3
> Fix For: 2.34.0
>
>  Time Spent: 28h 50m
>  Remaining Estimate: 0h
>
> Spark 3 is the next release of Spark, we need to fix some issues before we 
> are ready to support it. See subtasks for details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (BEAM-12909) Add spark3 to nexmark runs

2021-09-17 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-12909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot reassigned BEAM-12909:
---

Assignee: Etienne Chauchot

> Add spark3 to nexmark runs
> --
>
> Key: BEAM-12909
> URL: https://issues.apache.org/jira/browse/BEAM-12909
> Project: Beam
>  Issue Type: Sub-task
>  Components: runner-spark
>Reporter: Etienne Chauchot
>Assignee: Etienne Chauchot
>Priority: P2
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (BEAM-12909) Add spark3 to nexmark runs

2021-09-17 Thread Etienne Chauchot (Jira)

Etienne Chauchot created BEAM-12909:
---

 Summary: Add spark3 to nexmark runs
 Key: BEAM-12909
 URL: https://issues.apache.org/jira/browse/BEAM-12909
 Project: Beam
  Issue Type: Sub-task
  Components: runner-spark
Reporter: Etienne Chauchot






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (BEAM-12849) Structured Streaming runner: Migrate metrics sources

2021-09-16 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-12849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416141#comment-17416141
 ] 

Etienne Chauchot commented on BEAM-12849:
-

it seems to be a gradle conf issue: grade *provided* scope does not put spark 
libs in the runtime classpath. I saw that by running nexmark (it is a main) 
that uses local (in mem) spark cluster

> Structured Streaming runner: Migrate metrics sources
> 
>
> Key: BEAM-12849
> URL: https://issues.apache.org/jira/browse/BEAM-12849
> Project: Beam
>  Issue Type: Sub-task
>  Components: runner-spark
>Reporter: Etienne Chauchot
>Priority: P2
>  Labels: structured-streaming
>
> java.lang.ClassNotFoundException: org.apache.spark.metrics.source.Source no 
> more exists in spark 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-12153) OOM on GBK in SparkRunner and SpartStructuredStreamingRunner

2021-09-15 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-12153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot updated BEAM-12153:

Fix Version/s: Not applicable
   Resolution: Not A Problem
   Status: Resolved  (was: Open)

> OOM on GBK in SparkRunner and SpartStructuredStreamingRunner
> 
>
> Key: BEAM-12153
> URL: https://issues.apache.org/jira/browse/BEAM-12153
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Etienne Chauchot
>Assignee: Etienne Chauchot
>Priority: P3
> Fix For: Not applicable
>
>  Time Spent: 10h 10m
>  Remaining Estimate: 0h
>
> We have being experiencing OOM errors on GroupByKey in batch mode in both 
> Spark runners even if behind the woods spark spills data to disk in such 
> cases: taking a look at the translation in the two runners, it might be due 
> to using ReduceFnRunner for merging windows in GBK translation. 
> ReduceFnRunner.processElements expects to have all elements to merge the 
> windows between each other.:
> RDD spark runner:
> https://github.com/apache/beam/blob/752798e3b0e6911fa84f8d138dacccdb6fcc85ef/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkGroupAlsoByWindowViaOutputBufferFn.java#L99
> structured streaming spark: runner: 
> [https://github.com/apache/beam/blob/752798e3b0e6911fa84f8d138dacccdb6fcc85ef/runners/spark/2/src/main/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/functions/GroupAlsoByWindowViaOutputBufferFn.java#L74|https://github.com/apache/beam/blob/752798e3b0e6911fa84f8d138dacccdb6fcc85ef/runners/spark/2/src/main/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/functions/GroupAlsoByWindowViaOutputBufferFn.java#L106]
> Even replacing the Iterable with an Iterator in ReduceFnRunner to avoid 
> materialization does not work because deep in 
> ReduceFnRunner.processElements(), the collection is iterated twice.
> It could be better to do what flink runner does and translate GBK as 
> CombinePerKey with a Concatenate combine fn and thus avoid elements 
> materialization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (BEAM-12153) OOM on GBK in SparkRunner and SpartStructuredStreamingRunner

2021-09-15 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-12153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415412#comment-17415412
 ] 

Etienne Chauchot commented on BEAM-12153:
-

Also, the materialization happens inside a 
[UDF|https://github.com/apache/beam/blob/b362a53c3515e0caf6c12ba93508a2645947cc29/runners/spark/src/main/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/functions/GroupAlsoByWindowViaOutputBufferFn.java#L77]
 so if this UDF consumes too much memory, spark will spill to disk. The OOM 
observed was due to disk being full. So I think we can safely close this ticket.

> OOM on GBK in SparkRunner and SpartStructuredStreamingRunner
> 
>
> Key: BEAM-12153
> URL: https://issues.apache.org/jira/browse/BEAM-12153
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Etienne Chauchot
>Assignee: Etienne Chauchot
>Priority: P3
>  Time Spent: 9h 50m
>  Remaining Estimate: 0h
>
> We have being experiencing OOM errors on GroupByKey in batch mode in both 
> Spark runners even if behind the woods spark spills data to disk in such 
> cases: taking a look at the translation in the two runners, it might be due 
> to using ReduceFnRunner for merging windows in GBK translation. 
> ReduceFnRunner.processElements expects to have all elements to merge the 
> windows between each other.:
> RDD spark runner:
> https://github.com/apache/beam/blob/752798e3b0e6911fa84f8d138dacccdb6fcc85ef/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkGroupAlsoByWindowViaOutputBufferFn.java#L99
> structured streaming spark: runner: 
> [https://github.com/apache/beam/blob/752798e3b0e6911fa84f8d138dacccdb6fcc85ef/runners/spark/2/src/main/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/functions/GroupAlsoByWindowViaOutputBufferFn.java#L74|https://github.com/apache/beam/blob/752798e3b0e6911fa84f8d138dacccdb6fcc85ef/runners/spark/2/src/main/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/functions/GroupAlsoByWindowViaOutputBufferFn.java#L106]
> Even replacing the Iterable with an Iterator in ReduceFnRunner to avoid 
> materialization does not work because deep in 
> ReduceFnRunner.processElements(), the collection is iterated twice.
> It could be better to do what flink runner does and translate GBK as 
> CombinePerKey with a Concatenate combine fn and thus avoid elements 
> materialization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (BEAM-12153) OOM on GBK in SparkRunner and SpartStructuredStreamingRunner

2021-09-14 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-12153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414962#comment-17414962
 ] 

Etienne Chauchot commented on BEAM-12153:
-

Did the test of GroupByKeyLoadTests with
{code:java}
--sourceOptions={\"numRecords\":20,\"keySizeBytes\":10,\"valueSizeBytes\":1,\"hotKeyFraction\":1.0,\"numHotKeys\":1}"{code}
The benchmark finished sucessfully

 

> OOM on GBK in SparkRunner and SpartStructuredStreamingRunner
> 
>
> Key: BEAM-12153
> URL: https://issues.apache.org/jira/browse/BEAM-12153
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Etienne Chauchot
>Assignee: Etienne Chauchot
>Priority: P3
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> We have being experiencing OOM errors on GroupByKey in batch mode in both 
> Spark runners even if behind the woods spark spills data to disk in such 
> cases: taking a look at the translation in the two runners, it might be due 
> to using ReduceFnRunner for merging windows in GBK translation. 
> ReduceFnRunner.processElements expects to have all elements to merge the 
> windows between each other.:
> RDD spark runner:
> https://github.com/apache/beam/blob/752798e3b0e6911fa84f8d138dacccdb6fcc85ef/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkGroupAlsoByWindowViaOutputBufferFn.java#L99
> structured streaming spark: runner: 
> [https://github.com/apache/beam/blob/752798e3b0e6911fa84f8d138dacccdb6fcc85ef/runners/spark/2/src/main/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/functions/GroupAlsoByWindowViaOutputBufferFn.java#L74|https://github.com/apache/beam/blob/752798e3b0e6911fa84f8d138dacccdb6fcc85ef/runners/spark/2/src/main/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/functions/GroupAlsoByWindowViaOutputBufferFn.java#L106]
> Even replacing the Iterable with an Iterator in ReduceFnRunner to avoid 
> materialization does not work because deep in 
> ReduceFnRunner.processElements(), the collection is iterated twice.
> It could be better to do what flink runner does and translate GBK as 
> CombinePerKey with a Concatenate combine fn and thus avoid elements 
> materialization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (BEAM-12153) OOM on GBK in SparkRunner and SpartStructuredStreamingRunner

2021-09-14 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-12153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414936#comment-17414936
 ] 

Etienne Chauchot commented on BEAM-12153:
-

Did the test of GroupByKeyLoadTests  with  

 
{code:java}
--sourceOptions={\"numRecords\":20,\"keySizeBytes\":10,\"valueSizeBytes\":10}"
 
{code}
got SparkOutOfMemoryError because spark saturates the filesystem when spilling 
to disk
{code:java}
21/09/14 15:22:08 ERROR org.apache.spark.memory.TaskMemoryManager: error while 
calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@57a81b4921/09/14
 15:22:08 ERROR org.apache.spark.memory.TaskMemoryManager: error while calling 
spill() on 
org.apache.spark.util.collection.unsafe.sort.unsafeexternalsor...@57a81b49java.io.IOException:
 Aucun espace disponible sur le périphérique at 
java.io.FileOutputStream.writeBytes(Native Method) at 
java.io.FileOutputStream.write(FileOutputStream.java:326) at 
org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:58)
 at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122) at 
net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:223)
 at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:176) 
at 
org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:252)
 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:133)
 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:544)
 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:226)
 at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:206)
 at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:285)
 at 
org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:117) at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:400)
 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.allocateMemoryForRecordIfNecessary(UnsafeExternalSorter.java:419)
 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:454)
 at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:140)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.sort_addToSorter_0$(Unknown
 Source) at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
 at 
org.apache.spark.sql.execution.GroupedIterator$.apply(GroupedIterator.scala:29) 
at 
org.apache.spark.sql.execution.MapGroupsExec$$anonfun$10.apply(objects.scala:331)
 at 
org.apache.spark.sql.execution.MapGroupsExec$$anonfun$10.apply(objects.scala:330)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at 
org.apache.spark.sql.execution.SQLExecutionRDD.compute(SQLExecutionRDD.scala:55)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
org.apache.spark.scheduler.Task.run

[jira] [Commented] (BEAM-7093) Support Spark 3 in Spark runner

2021-09-12 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-7093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413939#comment-17413939
 ] 

Etienne Chauchot commented on BEAM-7093:


yes of course, I forgot to move it to 2.34.0. Thanks !

> Support Spark 3 in Spark runner
> ---
>
> Key: BEAM-7093
> URL: https://issues.apache.org/jira/browse/BEAM-7093
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-spark
>Reporter: Ismaël Mejía
>Assignee: Etienne Chauchot
>Priority: P3
> Fix For: 2.34.0
>
>  Time Spent: 28h 50m
>  Remaining Estimate: 0h
>
> Spark 3 is the next release of Spark, we need to fix some issues before we 
> are ready to support it. See subtasks for details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-12727) deduplication: Concatenate CombineFn

2021-09-09 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-12727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot updated BEAM-12727:

Fix Version/s: 2.34.0
   Resolution: Fixed
   Status: Resolved  (was: In Progress)

> deduplication: Concatenate CombineFn
> 
>
> Key: BEAM-12727
> URL: https://issues.apache.org/jira/browse/BEAM-12727
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-core
>Reporter: Etienne Chauchot
>Assignee: Etienne Chauchot
>Priority: P2
> Fix For: 2.34.0
>
>
> This class is present in flink, dataflow, samza runners and also duplicated 
> inside these runners. We need to extract the code to runner-core util.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (BEAM-7093) Support Spark 3 in Spark runner

2021-09-08 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-7093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411797#comment-17411797
 ] 

Etienne Chauchot commented on BEAM-7093:


[~ibzib] no it is not a regression, just something that was forgotten in the 
migration to spark 3 and it is only affecting Structured Streaming runner. I 
updated the title of the task to reflect that.

Also that makes me think that the VR coverage is not enough on metrics

> Support Spark 3 in Spark runner
> ---
>
> Key: BEAM-7093
> URL: https://issues.apache.org/jira/browse/BEAM-7093
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-spark
>Reporter: Ismaël Mejía
>Assignee: Etienne Chauchot
>Priority: P3
> Fix For: 2.33.0
>
>  Time Spent: 28h 50m
>  Remaining Estimate: 0h
>
> Spark 3 is the next release of Spark, we need to fix some issues before we 
> are ready to support it. See subtasks for details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-12849) Structured Streaming runner: Migrate metrics sources

2021-09-08 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-12849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot updated BEAM-12849:

Summary: Structured Streaming runner: Migrate metrics sources  (was: 
Migrate metrics sources)

> Structured Streaming runner: Migrate metrics sources
> 
>
> Key: BEAM-12849
> URL: https://issues.apache.org/jira/browse/BEAM-12849
> Project: Beam
>  Issue Type: Sub-task
>  Components: runner-spark
>Reporter: Etienne Chauchot
>Priority: P2
>  Labels: structured-streaming
>
> java.lang.ClassNotFoundException: org.apache.spark.metrics.source.Source no 
> more exists in spark 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (BEAM-7093) Support Spark 3 in Spark runner

2021-09-07 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-7093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411190#comment-17411190
 ] 

Etienne Chauchot edited comment on BEAM-7093 at 9/7/21, 12:30 PM:
--

reopening because beam metrics are not properly supported see 
https://issues.apache.org/jira/browse/BEAM-12849


was (Author: echauchot):
reopening because beam metrics are not properly supported

> Support Spark 3 in Spark runner
> ---
>
> Key: BEAM-7093
> URL: https://issues.apache.org/jira/browse/BEAM-7093
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-spark
>Reporter: Ismaël Mejía
>Assignee: Etienne Chauchot
>Priority: P3
> Fix For: 2.33.0
>
>  Time Spent: 28h 50m
>  Remaining Estimate: 0h
>
> Spark 3 is the next release of Spark, we need to fix some issues before we 
> are ready to support it. See subtasks for details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (BEAM-12849) Migrate metrics sources

2021-09-07 Thread Etienne Chauchot (Jira)

Etienne Chauchot created BEAM-12849:
---

 Summary: Migrate metrics sources
 Key: BEAM-12849
 URL: https://issues.apache.org/jira/browse/BEAM-12849
 Project: Beam
  Issue Type: Sub-task
  Components: runner-spark
Reporter: Etienne Chauchot


java.lang.ClassNotFoundException: org.apache.spark.metrics.source.Source no 
more exists in spark 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Reopened] (BEAM-7093) Support Spark 3 in Spark runner

2021-09-07 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-7093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot reopened BEAM-7093:


reopening because beam metrics are not properly supported

> Support Spark 3 in Spark runner
> ---
>
> Key: BEAM-7093
> URL: https://issues.apache.org/jira/browse/BEAM-7093
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-spark
>Reporter: Ismaël Mejía
>Assignee: Etienne Chauchot
>Priority: P3
> Fix For: 2.33.0
>
>  Time Spent: 28h 50m
>  Remaining Estimate: 0h
>
> Spark 3 is the next release of Spark, we need to fix some issues before we 
> are ready to support it. See subtasks for details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (BEAM-5172) org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky

2021-09-03 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409459#comment-17409459
 ] 

Etienne Chauchot edited comment on BEAM-5172 at 9/3/21, 11:38 AM:
--

actually they are flaky in all ES versions, in my [ongoing 
PR|https://github.com/apache/beam/pull/15267] I have sickbayed this tests as 
requested by the reviewer but this ticket needs attention as it is dangerous to 
leave ignored tests for too long, so I raised the importance of the ticket. 
[~egalpin] can you take a look at the flakes please ? I took the liberty to 
assign to you feel free to unassign yourself if not suited.


was (Author: echauchot):
actually they are flaky in all ES versions, in my ongoing PR I have sickbayed 
this tests as requested by the reviewer but this ticket needs attention as it 
is dangerous to leave ignored tests for too long, so I raised the importance of 
the ticket. [~egalpin] can you take a look at the flakes please ? I took the 
liberty to assign to you feel free to unassign yourself if not suited.

> org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky
> -
>
> Key: BEAM-5172
> URL: https://issues.apache.org/jira/browse/BEAM-5172
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-elasticsearch, test-failures
>Affects Versions: 2.31.0
>Reporter: Valentyn Tymofieiev
>Assignee: Evan Galpin
>Priority: P0
>  Labels: flake
> Fix For: 2.9.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> In a recent PostCommit builld, 
> https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/1290/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testRead/
>  failed with:
> Error Message
> java.lang.AssertionError: Count/Flatten.PCollections.out: 
> Expected: <400L>
>  but: was <470L>
> Stacktrace
> java.lang.AssertionError: Count/Flatten.PCollections.out: 
> Expected: <400L>
>  but: was <470L>
>   at 
> org.apache.beam.sdk.testing.PAssert$PAssertionSite.capture(PAssert.java:168)
>   at org.apache.beam.sdk.testing.PAssert.thatSingleton(PAssert.java:413)
>   at org.apache.beam.sdk.testing.PAssert.thatSingleton(PAssert.java:404)
>   at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTestCommon.testRead(ElasticsearchIOTestCommon.java:124)
>   at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testRead(ElasticsearchIOTest.java:125)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:239)
>   at 
> org.apache.beam.sdk.testing.TestPipeline$1.evaluate(TestPipeline.java:319)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:106)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.exe

[jira] [Updated] (BEAM-5172) org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky

2021-09-03 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot updated BEAM-5172:
---
Priority: P0  (was: P1)

> org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky
> -
>
> Key: BEAM-5172
> URL: https://issues.apache.org/jira/browse/BEAM-5172
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-elasticsearch, test-failures
>Affects Versions: 2.31.0
>Reporter: Valentyn Tymofieiev
>Assignee: Evan Galpin
>Priority: P0
>  Labels: flake
> Fix For: 2.9.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> In a recent PostCommit builld, 
> https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/1290/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testRead/
>  failed with:
> Error Message
> java.lang.AssertionError: Count/Flatten.PCollections.out: 
> Expected: <400L>
>  but: was <470L>
> Stacktrace
> java.lang.AssertionError: Count/Flatten.PCollections.out: 
> Expected: <400L>
>  but: was <470L>
>   at 
> org.apache.beam.sdk.testing.PAssert$PAssertionSite.capture(PAssert.java:168)
>   at org.apache.beam.sdk.testing.PAssert.thatSingleton(PAssert.java:413)
>   at org.apache.beam.sdk.testing.PAssert.thatSingleton(PAssert.java:404)
>   at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTestCommon.testRead(ElasticsearchIOTestCommon.java:124)
>   at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testRead(ElasticsearchIOTest.java:125)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:239)
>   at 
> org.apache.beam.sdk.testing.TestPipeline$1.evaluate(TestPipeline.java:319)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:106)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38)
>   at 
> org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:66)
>   at 
> org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
>   at 
> org.gradle.internal.dispatch.ReflectionDispatch.dispatch(Refl

[jira] [Commented] (BEAM-5172) org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky

2021-09-03 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409459#comment-17409459
 ] 

Etienne Chauchot commented on BEAM-5172:


actually they are flaky in all ES versions, in my ongoing PR I have sickbayed 
this tests as requested by the reviewer but this ticket needs attention as it 
is dangerous to leave ignored tests for too long, so I raised the importance of 
the ticket. [~egalpin] can you take a look at the flakes please ? I took the 
liberty to assign to you feel free to unassign yourself if not suited.

> org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky
> -
>
> Key: BEAM-5172
> URL: https://issues.apache.org/jira/browse/BEAM-5172
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-elasticsearch, test-failures
>Affects Versions: 2.31.0
>Reporter: Valentyn Tymofieiev
>Priority: P1
>  Labels: flake
> Fix For: 2.9.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> In a recent PostCommit builld, 
> https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/1290/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testRead/
>  failed with:
> Error Message
> java.lang.AssertionError: Count/Flatten.PCollections.out: 
> Expected: <400L>
>  but: was <470L>
> Stacktrace
> java.lang.AssertionError: Count/Flatten.PCollections.out: 
> Expected: <400L>
>  but: was <470L>
>   at 
> org.apache.beam.sdk.testing.PAssert$PAssertionSite.capture(PAssert.java:168)
>   at org.apache.beam.sdk.testing.PAssert.thatSingleton(PAssert.java:413)
>   at org.apache.beam.sdk.testing.PAssert.thatSingleton(PAssert.java:404)
>   at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTestCommon.testRead(ElasticsearchIOTestCommon.java:124)
>   at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testRead(ElasticsearchIOTest.java:125)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:239)
>   at 
> org.apache.beam.sdk.testing.TestPipeline$1.evaluate(TestPipeline.java:319)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:106)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38)
>   at 
> org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:66)
>   at 
> org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAcces

[jira] [Assigned] (BEAM-5172) org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky

2021-09-03 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot reassigned BEAM-5172:
--

Assignee: Evan Galpin

> org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky
> -
>
> Key: BEAM-5172
> URL: https://issues.apache.org/jira/browse/BEAM-5172
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-elasticsearch, test-failures
>Affects Versions: 2.31.0
>Reporter: Valentyn Tymofieiev
>Assignee: Evan Galpin
>Priority: P1
>  Labels: flake
> Fix For: 2.9.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> In a recent PostCommit builld, 
> https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/1290/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testRead/
>  failed with:
> Error Message
> java.lang.AssertionError: Count/Flatten.PCollections.out: 
> Expected: <400L>
>  but: was <470L>
> Stacktrace
> java.lang.AssertionError: Count/Flatten.PCollections.out: 
> Expected: <400L>
>  but: was <470L>
>   at 
> org.apache.beam.sdk.testing.PAssert$PAssertionSite.capture(PAssert.java:168)
>   at org.apache.beam.sdk.testing.PAssert.thatSingleton(PAssert.java:413)
>   at org.apache.beam.sdk.testing.PAssert.thatSingleton(PAssert.java:404)
>   at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTestCommon.testRead(ElasticsearchIOTestCommon.java:124)
>   at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testRead(ElasticsearchIOTest.java:125)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:239)
>   at 
> org.apache.beam.sdk.testing.TestPipeline$1.evaluate(TestPipeline.java:319)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:106)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38)
>   at 
> org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:66)
>   at 
> org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
>   at 
> org.gradle.internal.dispatch.ReflectionDispatch.dispatch

[jira] [Commented] (BEAM-5172) org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky

2021-09-03 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409348#comment-17409348
 ] 

Etienne Chauchot commented on BEAM-5172:


testSizes and testSplit seem to be flaky in ES5. I propose to sickbay them 
waiting for a resolution. [~egalpin] WDYT ?

> org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky
> -
>
> Key: BEAM-5172
> URL: https://issues.apache.org/jira/browse/BEAM-5172
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-elasticsearch, test-failures
>Affects Versions: 2.31.0
>Reporter: Valentyn Tymofieiev
>Priority: P1
>  Labels: flake
> Fix For: 2.9.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> In a recent PostCommit builld, 
> https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/1290/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testRead/
>  failed with:
> Error Message
> java.lang.AssertionError: Count/Flatten.PCollections.out: 
> Expected: <400L>
>  but: was <470L>
> Stacktrace
> java.lang.AssertionError: Count/Flatten.PCollections.out: 
> Expected: <400L>
>  but: was <470L>
>   at 
> org.apache.beam.sdk.testing.PAssert$PAssertionSite.capture(PAssert.java:168)
>   at org.apache.beam.sdk.testing.PAssert.thatSingleton(PAssert.java:413)
>   at org.apache.beam.sdk.testing.PAssert.thatSingleton(PAssert.java:404)
>   at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTestCommon.testRead(ElasticsearchIOTestCommon.java:124)
>   at 
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testRead(ElasticsearchIOTest.java:125)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:239)
>   at 
> org.apache.beam.sdk.testing.TestPipeline$1.evaluate(TestPipeline.java:319)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:106)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38)
>   at 
> org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:66)
>   at 
> org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.gradle.internal.dispatch.ReflectionDispat

[jira] [Work started] (BEAM-12727) deduplication: Concatenate CombineFn

2021-09-01 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-12727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-12727 started by Etienne Chauchot.
---
> deduplication: Concatenate CombineFn
> 
>
> Key: BEAM-12727
> URL: https://issues.apache.org/jira/browse/BEAM-12727
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-core
>Reporter: Etienne Chauchot
>Assignee: Etienne Chauchot
>Priority: P2
>
> This class is present in flink, dataflow, samza runners and also duplicated 
> inside these runners. We need to extract the code to runner-core util.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-12601) Support append-only indices in ES output

2021-08-06 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-12601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot updated BEAM-12601:

Fix Version/s: 2.33.0
   Resolution: Done
   Status: Resolved  (was: Triage Needed)

> Support append-only indices in ES output 
> -
>
> Key: BEAM-12601
> URL: https://issues.apache.org/jira/browse/BEAM-12601
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Reporter: Andres Rodriguez
>Priority: P2
> Fix For: 2.33.0
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> Currently, the Apache Beam Elasticsearch sink is 
> [using|https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L1532]
>  the 
> [index|https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html#bulk-api-request-body]
>  bulk API operation to add data to the target index. When using append-only 
> indices it is better to use the 
> [create|https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html#bulk-api-request-body]
>  operation. This also applies to new append-only indexes, like [data 
> streams|https://www.elastic.co/guide/en/elasticsearch/reference/7.x/use-a-data-stream.html#add-documents-to-a-data-stream].
> The scope of this improvement is to add a new boolean configuration option, 
> {{append-only}}, to the Elasticsearch sink, with a default value of {{false}} 
> (to keep the current behaivour) that when enabled makes it use the {{create}} 
> operation instead of the {{index}} one when sending data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (BEAM-12727) deduplication: Concatenate CombineFn

2021-08-05 Thread Etienne Chauchot (Jira)

Etienne Chauchot created BEAM-12727:
---

 Summary: deduplication: Concatenate CombineFn
 Key: BEAM-12727
 URL: https://issues.apache.org/jira/browse/BEAM-12727
 Project: Beam
  Issue Type: Improvement
  Components: runner-core
Reporter: Etienne Chauchot
Assignee: Etienne Chauchot


This class is present in flink, dataflow, samza runners and also duplicated 
inside these runners. We need to extract the code to runner-core util.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-12630) Spark Structured Streaming runner: deal with change in dataStreamWriter wait for termination

2021-08-05 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-12630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot updated BEAM-12630:

Resolution: Done
Status: Resolved  (was: In Progress)

> Spark Structured Streaming runner: deal with change in dataStreamWriter wait 
> for termination
> 
>
> Key: BEAM-12630
> URL: https://issues.apache.org/jira/browse/BEAM-12630
> Project: Beam
>  Issue Type: Sub-task
>  Components: runner-spark
>Reporter: Etienne Chauchot
>Assignee: Etienne Chauchot
>Priority: P2
>
> termination of pipeline has change in spark 3, need to update the way, we run 
> the pipelines.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-7093) Support Spark 3 in Spark runner

2021-08-05 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-7093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot updated BEAM-7093:
---
Resolution: Done
Status: Resolved  (was: Triage Needed)

> Support Spark 3 in Spark runner
> ---
>
> Key: BEAM-7093
> URL: https://issues.apache.org/jira/browse/BEAM-7093
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-spark
>Reporter: Ismaël Mejía
>Assignee: Etienne Chauchot
>Priority: P3
> Fix For: 2.33.0
>
>  Time Spent: 28h 20m
>  Remaining Estimate: 0h
>
> Spark 3 is the next release of Spark, we need to fix some issues before we 
> are ready to support it. See subtasks for details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-12629) Spark Structured Streaming runner: migrate Beam sources wrappers

2021-08-05 Thread Etienne Chauchot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-12629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etienne Chauchot updated BEAM-12629:

Resolution: Done
Status: Resolved  (was: In Progress)

> Spark Structured Streaming runner: migrate Beam sources wrappers
> 
>
> Key: BEAM-12629
> URL: https://issues.apache.org/jira/browse/BEAM-12629
> Project: Beam
>  Issue Type: Sub-task
>  Components: runner-spark
>Reporter: Etienne Chauchot
>Assignee: Etienne Chauchot
>Priority: P2
>  Labels: structured-streaming
>
> DataSourceV2 API no longer exists in spark 3. Need to migrate to new API



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

1 2 3 4 5 6 >

1 - 100 of 519 matches

Mail list logo