[ https://issues.apache.org/jira/browse/SPARK-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231737#comment-14231737 ]
Ezequiel Bella commented on SPARK-3553: --------------------------------------- Please see if this post works for you, http://stackoverflow.com/questions/25894405/spark-streaming-app-streams-files-that-have-already-been-streamed good luck. easy > Spark Streaming app streams files that have already been streamed in an > endless loop > ------------------------------------------------------------------------------------ > > Key: SPARK-3553 > URL: https://issues.apache.org/jira/browse/SPARK-3553 > Project: Spark > Issue Type: Bug > Components: Streaming > Affects Versions: 1.0.1 > Environment: Ec2 cluster - YARN > Reporter: Ezequiel Bella > Labels: S3, Streaming, YARN > > We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node > and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB > of RAM each. > The app streams from a directory in S3 which is constantly being written; > this is the line of code that achieves that: > val lines = ssc.fileStream[LongWritable, Text, > TextInputFormat](Settings.S3RequestsHost , (f:Path)=> true, true ) > The purpose of using fileStream instead of textFileStream is to customize the > way that spark handles existing files when the process starts. We want to > process just the new files that are added after the process launched and omit > the existing ones. We configured a batch duration of 10 seconds. > The process goes fine while we add a small number of files to s3, let's say 4 > or 5. We can see in the streaming UI how the stages are executed successfully > in the executors, one for each file that is processed. But when we try to add > a larger number of files, we face a strange behavior; the application starts > streaming files that have already been streamed. > For example, I add 20 files to s3. The files are processed in 3 batches. The > first batch processes 7 files, the second 8 and the third 5. No more files > are added to S3 at this point, but spark start repeating these phases > endlessly with the same files. > Any thoughts what can be causing this? > Regards, > Easyb -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org