[ 
https://issues.apache.org/jira/browse/SPARK-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231737#comment-14231737
 ] 

Ezequiel Bella commented on SPARK-3553:
---------------------------------------

Please see if this post works for you,
http://stackoverflow.com/questions/25894405/spark-streaming-app-streams-files-that-have-already-been-streamed

good luck.
easy

> Spark Streaming app streams files that have already been streamed in an 
> endless loop
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-3553
>                 URL: https://issues.apache.org/jira/browse/SPARK-3553
>             Project: Spark
>          Issue Type: Bug
>          Components: Streaming
>    Affects Versions: 1.0.1
>         Environment: Ec2 cluster - YARN
>            Reporter: Ezequiel Bella
>              Labels: S3, Streaming, YARN
>
> We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node 
> and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB 
> of RAM each.
> The app streams from a directory in S3 which is constantly being written; 
> this is the line of code that achieves that:
> val lines = ssc.fileStream[LongWritable, Text, 
> TextInputFormat](Settings.S3RequestsHost  , (f:Path)=> true, true )
> The purpose of using fileStream instead of textFileStream is to customize the 
> way that spark handles existing files when the process starts. We want to 
> process just the new files that are added after the process launched and omit 
> the existing ones. We configured a batch duration of 10 seconds.
> The process goes fine while we add a small number of files to s3, let's say 4 
> or 5. We can see in the streaming UI how the stages are executed successfully 
> in the executors, one for each file that is processed. But when we try to add 
> a larger number of files, we face a strange behavior; the application starts 
> streaming files that have already been streamed. 
> For example, I add 20 files to s3. The files are processed in 3 batches. The 
> first batch processes 7 files, the second 8 and the third 5. No more files 
> are added to S3 at this point, but spark start repeating these phases 
> endlessly with the same files.
> Any thoughts what can be causing this?
> Regards,
> Easyb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to