[ https://issues.apache.org/jira/browse/SPARK-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Micael Capitão updated SPARK-3553: ---------------------------------- Comment: was deleted (was: I confirm the weird behaviour running in HDFS too. I have the Spark Streaming app with a filestream on dir "hdfs:///user/altaia/cdrs/stream" Having initially these files: [1] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_6_06_20.txt.gz [2] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_7_11_01.txt.gz [3] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_41_01.txt.gz [4] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_06_58.txt.gz [5] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_41_01.txt.gz [6] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_57_13.txt.gz When I start the application, they are processed. When I add a new file [7] by renaming it to end with .gz it is processed too. [7] hdfs://blade2.ct.ptin.corppt.com:8020/user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_36_34.txt.gz But right after the [7], Spark Streaming reprocesses some of the initially present files: [3] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_41_01.txt.gz [4] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_06_58.txt.gz [5] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_41_01.txt.gz [6] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_57_13.txt.gz And does not repeat anything else on the next batches. When adding yet another file, it is not detected and stays like that.) > Spark Streaming app streams files that have already been streamed in an > endless loop > ------------------------------------------------------------------------------------ > > Key: SPARK-3553 > URL: https://issues.apache.org/jira/browse/SPARK-3553 > Project: Spark > Issue Type: Bug > Components: Streaming > Affects Versions: 1.0.1 > Environment: Ec2 cluster - YARN > Reporter: Ezequiel Bella > Labels: S3, Streaming, YARN > > We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node > and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB > of RAM each. > The app streams from a directory in S3 which is constantly being written; > this is the line of code that achieves that: > val lines = ssc.fileStream[LongWritable, Text, > TextInputFormat](Settings.S3RequestsHost , (f:Path)=> true, true ) > The purpose of using fileStream instead of textFileStream is to customize the > way that spark handles existing files when the process starts. We want to > process just the new files that are added after the process launched and omit > the existing ones. We configured a batch duration of 10 seconds. > The process goes fine while we add a small number of files to s3, let's say 4 > or 5. We can see in the streaming UI how the stages are executed successfully > in the executors, one for each file that is processed. But when we try to add > a larger number of files, we face a strange behavior; the application starts > streaming files that have already been streamed. > For example, I add 20 files to s3. The files are processed in 3 batches. The > first batch processes 7 files, the second 8 and the third 5. No more files > are added to S3 at this point, but spark start repeating these phases > endlessly with the same files. > Any thoughts what can be causing this? > Regards, > Easyb -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org