Re: S3 parquet sink - failed with S3 connection exception

2019-03-14 Thread Averell
Hi Kostas and everyone,

I tried to change setFailOnCheckpointingErrors from True to False, and got
the following trace in Flink GUI when the checkpoint/uploading failed. Not
sure whether it would be of any help in identifying the issue.

BTW, could you please help tell where to find the log file that Flink GUI's
Exception tab is reading from?

Thanks and regards,
Averell

java.lang.ArrayIndexOutOfBoundsException: 122626
at
org.apache.parquet.column.values.dictionary.DictionaryValuesWriter$PlainLongDictionaryValuesWriter.fallBackDictionaryEncodedData(DictionaryValuesWriter.java:397)
at
org.apache.parquet.column.values.dictionary.DictionaryValuesWriter.fallBackAllValuesTo(DictionaryValuesWriter.java:130)
at
org.apache.parquet.column.values.fallback.FallbackValuesWriter.fallBack(FallbackValuesWriter.java:153)
at
org.apache.parquet.column.values.fallback.FallbackValuesWriter.checkFallback(FallbackValuesWriter.java:147)
at
org.apache.parquet.column.values.fallback.FallbackValuesWriter.writeLong(FallbackValuesWriter.java:181)
at
org.apache.parquet.column.impl.ColumnWriterV1.write(ColumnWriterV1.java:228)
at
org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.addLong(MessageColumnIO.java:449)
at
org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:327)
at
org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:278)
at
org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191)
at
org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
at
org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299)
at
org.apache.flink.formats.parquet.ParquetBulkWriter.addElement(ParquetBulkWriter.java:52)
at
org.apache.flink.streaming.api.functions.sink.filesystem.BulkPartWriter.write(BulkPartWriter.java:50)
at
org.apache.flink.streaming.api.functions.sink.filesystem.Bucket.write(Bucket.java:214)
at
org.apache.flink.streaming.api.functions.sink.filesystem.Buckets.onElement(Buckets.java:268)
at
org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink.invoke(StreamingFileSink.java:370)
at
org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:56)
at
org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:202)
at
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
at java.lang.Thread.run(Thread.java:748)






--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: S3 parquet sink - failed with S3 connection exception

2019-03-10 Thread Averell
Hi Kostas, and everyone,

Just some update to my issue: I have tried to:
 * changed s3 related configuration in hadoop as suggested by hadoop
document [1]: 
 increased /fs.s3a.threads.max/ from 10 to 100, and
/fs.s3a.connection.maximum/ from 15 to 120. For reference, I am having only
3 S3 sinks, with parallelisms of 4, 4, and 1.
 * followed AWS's document [2] to increase their EMRFS maxConnections to
200. However, I doubt that this would make any difference, as in creating
the S3 parquet bucket sink, I needed to use "s3a://..." path. "s3://..."
seems not supported by Flink yet. 
 * reduced the parallelism for my S3 continuous files reader.

However, the problem still randomly occurred (random by job executions. When
it occurred, the only solution is to cancel the job and restart from the
last successful checkpoint).

Thanks and regards,
Averell

[1]  Hadoop-AWS module: Integration with Amazon Web Services

  
[2]  emr-timeout-connection-wait

  



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: S3 parquet sink - failed with S3 connection exception

2019-03-05 Thread Averell
Hello Kostas,

Thanks for your time.

I started that job from fresh, set checkpoint interval to 15 minutes. It
completed the first 13 checkpoints successfully, only started failing from
the 14th. I waited for about 20 more checkpoints, but all failed.
Then I cancelled the job, restored from the last successful checkpoint, and
there were no more issues.

Today, I had another try - restoring from the last successful checkpoint
from yesterday. Result: started getting the same error from the first
checkpoint after restore. 
Tried to cancel and restore again, then no more issue until now (35 more
checkpoints already).

Regarding my job: I have 6 different S3-file-source streams
connected/unioned together, and then connected to a 7th S3-file-source
broadcast stream. Sinks are S3 parquet files and Elasticsearch.
Checkpointing is incremental and uses RocksDB.
This broadcast stream is one of the new changes to my job. The previous
version with 4 out of those 6 sources has been running well for more than a
month without any issue.
TM/JM logs for the first run today (the failure one) are attached.
The Yarn/EMR cluster is dedicated to the job.

I have a feeling that the issue comes from that broadcast stream (as
mentioned in the document, it doesn't use RocksDB). But not quite sure.

Thanks and regards,
Averell

logs.gz

  



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: S3 parquet sink - failed with S3 connection exception

2019-03-05 Thread Kostas Kloudas
Hi Averell,

Did you have other failures before (from which you managed to resume
successfully)?
Can you share a bit more details about your job and potentially the TM/JM
logs?

The only thing I found about this is here
https://forums.aws.amazon.com/thread.jspa?threadID=130172
but Flink does not directly call getObject(). We rely on Hadoop code to
read objects.
In addition, we do not interact directly with the connection pool.

Cheers,
Kostas

On Mon, Mar 4, 2019 at 1:53 PM Averell  wrote:

> Hello everyone,
>
> I have a job which is writing some streams into parquet files in S3. I use
> Flink 1.7.2 on EMR 5.21.
> My job had been running well, but suddenly it failed to make a checkpoint
> with the full stack trace mentioned below. After that failure, the job
> restarted from the last successful checkpoint, but it could not make any
> further checkpoint - all subsequent checkpoints failed with the same
> reason.
> Searching on Internet I could only find one explanation: S3Object has not
> been closed properly.
>
> Could someone please help?
>
> Thanks and regards,
> Averell
>
>
>  /The program finished with the following exception:
>
> org.apache.flink.util.FlinkException: Triggering a savepoint for the job
> 8cf68432acb3b4ced2cbc97bc23b4af5 failed.
> at
>
> org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:723)
> at
>
> org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:701)
> at
>
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:985)
> at
> org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:698)
> at
>
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1065)
> at
>
> org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1126)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
> at
>
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
> at
> org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1126)
> Caused by: java.util.concurrent.CompletionException:
> java.util.concurrent.CompletionException: java.lang.Exception: Checkpoint
> failed: Could not perform checkpoint 33 for operator Sink: S3 - Instant
> (1/4).
> at
>
> org.apache.flink.runtime.jobmaster.JobMaster.lambda$triggerSavepoint$13(JobMaster.java:972)
> at
>
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822)
> at
>
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
> at
>
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
> at
>
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
> at
>
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
> at
>
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
> at
>
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
> at
>
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
> at
>
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
>
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
>
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: java.util.concurrent.CompletionException: java.lang.Exception:
> Checkpoint failed: Could not perform checkpoint 33 for operator Sink: S3 -
> Instant (1/4).
> at
>
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> at
>
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> at
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
> at
>
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
> at
>
> 

S3 parquet sink - failed with S3 connection exception

2019-03-04 Thread Averell
Hello everyone,

I have a job which is writing some streams into parquet files in S3. I use
Flink 1.7.2 on EMR 5.21.
My job had been running well, but suddenly it failed to make a checkpoint
with the full stack trace mentioned below. After that failure, the job
restarted from the last successful checkpoint, but it could not make any
further checkpoint - all subsequent checkpoints failed with the same reason. 
Searching on Internet I could only find one explanation: S3Object has not
been closed properly.

Could someone please help?

Thanks and regards,
Averell


 /The program finished with the following exception:

org.apache.flink.util.FlinkException: Triggering a savepoint for the job
8cf68432acb3b4ced2cbc97bc23b4af5 failed.
at
org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:723)
at
org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:701)
at
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:985)
at 
org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:698)
at
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1065)
at
org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1126)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1126)
Caused by: java.util.concurrent.CompletionException:
java.util.concurrent.CompletionException: java.lang.Exception: Checkpoint
failed: Could not perform checkpoint 33 for operator Sink: S3 - Instant
(1/4).
at
org.apache.flink.runtime.jobmaster.JobMaster.lambda$triggerSavepoint$13(JobMaster.java:972)
at
java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822)
at
java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
at
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
at
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
at
akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.util.concurrent.CompletionException: java.lang.Exception:
Checkpoint failed: Could not perform checkpoint 33 for operator Sink: S3 -
Instant (1/4).
at
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
at
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at
org.apache.flink.runtime.checkpoint.PendingCheckpoint.abortWithCause(PendingCheckpoint.java:452)
at
org.apache.flink.runtime.checkpoint.PendingCheckpoint.abortError(PendingCheckpoint.java:447)
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.discardCheckpoint(CheckpointCoordinator.java:1258)
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.failUnacknowledgedPendingCheckpointsFor(CheckpointCoordinator.java:918)
at
org.apache.flink.runtime.executiongraph.ExecutionGraph.notifyExecutionChange(ExecutionGraph.java:1779)
at