[ 
https://issues.apache.org/jira/browse/SPARK-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232801#comment-14232801
 ] 

Micael Capitão commented on SPARK-3553:
---------------------------------------

I confirm the weird behaviour running in HDFS too.
I have the Spark Streaming app with a filestream on dir 
"hdfs:///user/altaia/cdrs/stream"

Having initially these files:
[1] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_6_06_20.txt.gz
[2] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_7_11_01.txt.gz
[3] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_41_01.txt.gz
[4] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_06_58.txt.gz
[5] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_41_01.txt.gz
[6] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_57_13.txt.gz

When I start the application, they are processed. When I add a new file [7] by 
renaming it to end with .gz it is processed too.
[7] 
hdfs://blade2.ct.ptin.corppt.com:8020/user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_36_34.txt.gz

But right after the [7], Spark Streaming reprocesses some of the initially 
present files:
[3] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_41_01.txt.gz
[4] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_06_58.txt.gz
[5] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_41_01.txt.gz
[6] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_57_13.txt.gz

And does not repeat anything else on the next batches. When adding yet another 
file, it is not detected and stays like that.

> Spark Streaming app streams files that have already been streamed in an 
> endless loop
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-3553
>                 URL: https://issues.apache.org/jira/browse/SPARK-3553
>             Project: Spark
>          Issue Type: Bug
>          Components: Streaming
>    Affects Versions: 1.0.1
>         Environment: Ec2 cluster - YARN
>            Reporter: Ezequiel Bella
>              Labels: S3, Streaming, YARN
>
> We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node 
> and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB 
> of RAM each.
> The app streams from a directory in S3 which is constantly being written; 
> this is the line of code that achieves that:
> val lines = ssc.fileStream[LongWritable, Text, 
> TextInputFormat](Settings.S3RequestsHost  , (f:Path)=> true, true )
> The purpose of using fileStream instead of textFileStream is to customize the 
> way that spark handles existing files when the process starts. We want to 
> process just the new files that are added after the process launched and omit 
> the existing ones. We configured a batch duration of 10 seconds.
> The process goes fine while we add a small number of files to s3, let's say 4 
> or 5. We can see in the streaming UI how the stages are executed successfully 
> in the executors, one for each file that is processed. But when we try to add 
> a larger number of files, we face a strange behavior; the application starts 
> streaming files that have already been streamed. 
> For example, I add 20 files to s3. The files are processed in 3 batches. The 
> first batch processes 7 files, the second 8 and the third 5. No more files 
> are added to S3 at this point, but spark start repeating these phases 
> endlessly with the same files.
> Any thoughts what can be causing this?
> Regards,
> Easyb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to