[jira] [Commented] (SPARK-3553) Spark Streaming app streams files that have already been streamed in an endless loop

2015-09-11 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14741817#comment-14741817
 ] 

Tathagata Das commented on SPARK-3553:
--

Because of the absence of any activity, I am closing this issue. Please open it 
if this is still a problem in newer versions of Spark.

> Spark Streaming app streams files that have already been streamed in an 
> endless loop
> 
>
> Key: SPARK-3553
> URL: https://issues.apache.org/jira/browse/SPARK-3553
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.0.1
> Environment: Ec2 cluster - YARN
>Reporter: Ezequiel Bella
>  Labels: S3, Streaming, YARN
>
> We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node 
> and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB 
> of RAM each.
> The app streams from a directory in S3 which is constantly being written; 
> this is the line of code that achieves that:
> val lines = ssc.fileStream[LongWritable, Text, 
> TextInputFormat](Settings.S3RequestsHost  , (f:Path)=> true, true )
> The purpose of using fileStream instead of textFileStream is to customize the 
> way that spark handles existing files when the process starts. We want to 
> process just the new files that are added after the process launched and omit 
> the existing ones. We configured a batch duration of 10 seconds.
> The process goes fine while we add a small number of files to s3, let's say 4 
> or 5. We can see in the streaming UI how the stages are executed successfully 
> in the executors, one for each file that is processed. But when we try to add 
> a larger number of files, we face a strange behavior; the application starts 
> streaming files that have already been streamed. 
> For example, I add 20 files to s3. The files are processed in 3 batches. The 
> first batch processes 7 files, the second 8 and the third 5. No more files 
> are added to S3 at this point, but spark start repeating these phases 
> endlessly with the same files.
> Any thoughts what can be causing this?
> Regards,
> Easyb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3553) Spark Streaming app streams files that have already been streamed in an endless loop

2015-04-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395793#comment-14395793
 ] 

Sean Owen commented on SPARK-3553:
--

Checking through old issues --  I know this logic has been updated since 1.0 
and fixed in changes like SPARK-4518 and SPARK-2362. Any chance you know 
whether it is still an issue? It would not surprise me if it's fixed.

Otherwise, do you know if the file modification times were changed by your 
process?
Debug log output would help too since it explains its logic in what to keep in 
the debug messages.

 Spark Streaming app streams files that have already been streamed in an 
 endless loop
 

 Key: SPARK-3553
 URL: https://issues.apache.org/jira/browse/SPARK-3553
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.1
 Environment: Ec2 cluster - YARN
Reporter: Ezequiel Bella
  Labels: S3, Streaming, YARN

 We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node 
 and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB 
 of RAM each.
 The app streams from a directory in S3 which is constantly being written; 
 this is the line of code that achieves that:
 val lines = ssc.fileStream[LongWritable, Text, 
 TextInputFormat](Settings.S3RequestsHost  , (f:Path)= true, true )
 The purpose of using fileStream instead of textFileStream is to customize the 
 way that spark handles existing files when the process starts. We want to 
 process just the new files that are added after the process launched and omit 
 the existing ones. We configured a batch duration of 10 seconds.
 The process goes fine while we add a small number of files to s3, let's say 4 
 or 5. We can see in the streaming UI how the stages are executed successfully 
 in the executors, one for each file that is processed. But when we try to add 
 a larger number of files, we face a strange behavior; the application starts 
 streaming files that have already been streamed. 
 For example, I add 20 files to s3. The files are processed in 3 batches. The 
 first batch processes 7 files, the second 8 and the third 5. No more files 
 are added to S3 at this point, but spark start repeating these phases 
 endlessly with the same files.
 Any thoughts what can be causing this?
 Regards,
 Easyb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3553) Spark Streaming app streams files that have already been streamed in an endless loop

2014-12-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232802#comment-14232802
 ] 

Micael Capitão commented on SPARK-3553:
---

I confirm the weird behaviour running in HDFS too.
I have the Spark Streaming app with a filestream on dir 
hdfs:///user/altaia/cdrs/stream. It is running on YARN and uses 
checkpointing. For now the application only reads the files and prints the 
number of read lines.

Having initially these files:
[1] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_6_06_20.txt.gz
[2] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_7_11_01.txt.gz
[3] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_41_01.txt.gz
[4] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_06_58.txt.gz
[5] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_41_01.txt.gz
[6] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_57_13.txt.gz

When I start the application, they are processed. When I add a new file [7] by 
renaming it to end with .gz it is processed too.
[7] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_36_34.txt.gz

But right after the [7], Spark Streaming reprocesses some of the initially 
present files:
[3] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_41_01.txt.gz
[4] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_06_58.txt.gz
[5] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_41_01.txt.gz
[6] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_57_13.txt.gz

And does not repeat anything else on the next batches. When adding yet another 
file, it is not detected and stays like that.

 Spark Streaming app streams files that have already been streamed in an 
 endless loop
 

 Key: SPARK-3553
 URL: https://issues.apache.org/jira/browse/SPARK-3553
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.1
 Environment: Ec2 cluster - YARN
Reporter: Ezequiel Bella
  Labels: S3, Streaming, YARN

 We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node 
 and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB 
 of RAM each.
 The app streams from a directory in S3 which is constantly being written; 
 this is the line of code that achieves that:
 val lines = ssc.fileStream[LongWritable, Text, 
 TextInputFormat](Settings.S3RequestsHost  , (f:Path)= true, true )
 The purpose of using fileStream instead of textFileStream is to customize the 
 way that spark handles existing files when the process starts. We want to 
 process just the new files that are added after the process launched and omit 
 the existing ones. We configured a batch duration of 10 seconds.
 The process goes fine while we add a small number of files to s3, let's say 4 
 or 5. We can see in the streaming UI how the stages are executed successfully 
 in the executors, one for each file that is processed. But when we try to add 
 a larger number of files, we face a strange behavior; the application starts 
 streaming files that have already been streamed. 
 For example, I add 20 files to s3. The files are processed in 3 batches. The 
 first batch processes 7 files, the second 8 and the third 5. No more files 
 are added to S3 at this point, but spark start repeating these phases 
 endlessly with the same files.
 Any thoughts what can be causing this?
 Regards,
 Easyb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3553) Spark Streaming app streams files that have already been streamed in an endless loop

2014-12-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232801#comment-14232801
 ] 

Micael Capitão commented on SPARK-3553:
---

I confirm the weird behaviour running in HDFS too.
I have the Spark Streaming app with a filestream on dir 
hdfs:///user/altaia/cdrs/stream

Having initially these files:
[1] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_6_06_20.txt.gz
[2] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_7_11_01.txt.gz
[3] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_41_01.txt.gz
[4] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_06_58.txt.gz
[5] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_41_01.txt.gz
[6] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_57_13.txt.gz

When I start the application, they are processed. When I add a new file [7] by 
renaming it to end with .gz it is processed too.
[7] 
hdfs://blade2.ct.ptin.corppt.com:8020/user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_36_34.txt.gz

But right after the [7], Spark Streaming reprocesses some of the initially 
present files:
[3] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_41_01.txt.gz
[4] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_06_58.txt.gz
[5] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_41_01.txt.gz
[6] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_57_13.txt.gz

And does not repeat anything else on the next batches. When adding yet another 
file, it is not detected and stays like that.

 Spark Streaming app streams files that have already been streamed in an 
 endless loop
 

 Key: SPARK-3553
 URL: https://issues.apache.org/jira/browse/SPARK-3553
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.1
 Environment: Ec2 cluster - YARN
Reporter: Ezequiel Bella
  Labels: S3, Streaming, YARN

 We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node 
 and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB 
 of RAM each.
 The app streams from a directory in S3 which is constantly being written; 
 this is the line of code that achieves that:
 val lines = ssc.fileStream[LongWritable, Text, 
 TextInputFormat](Settings.S3RequestsHost  , (f:Path)= true, true )
 The purpose of using fileStream instead of textFileStream is to customize the 
 way that spark handles existing files when the process starts. We want to 
 process just the new files that are added after the process launched and omit 
 the existing ones. We configured a batch duration of 10 seconds.
 The process goes fine while we add a small number of files to s3, let's say 4 
 or 5. We can see in the streaming UI how the stages are executed successfully 
 in the executors, one for each file that is processed. But when we try to add 
 a larger number of files, we face a strange behavior; the application starts 
 streaming files that have already been streamed. 
 For example, I add 20 files to s3. The files are processed in 3 batches. The 
 first batch processes 7 files, the second 8 and the third 5. No more files 
 are added to S3 at this point, but spark start repeating these phases 
 endlessly with the same files.
 Any thoughts what can be causing this?
 Regards,
 Easyb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3553) Spark Streaming app streams files that have already been streamed in an endless loop

2014-12-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231697#comment-14231697
 ] 

Micael Capitão commented on SPARK-3553:
---

I'm having that same issue running Spark Streaming locally on my Windows 
machine.
I have somthing like:
val lines = ssc.fileStream[LongWritable, Text, TextInputFormat](cdrsDir, 
fileFilter(_), newFilesOnly = false)

The cdrsDir has initially 2 files in it.

On startup, Spark processes the existing files on cdrsDir and keeps quite 
after that. When I move another file to that dir it detects it and processes 
it, but after that it processes the first two files again and then the third 
one in an endless loop.
If I add a fourth one it keeps processing the first two files on the same batch 
and then processes the 3rd and the 4th files on another batch.

If I add more files it keeps repeating but the behaviour gets weirder. It 
mixtures, for example, the 3rd with the 5th on the same batch and the 4th with 
the 6th in another batch, stopping repeating the first two files.

 Spark Streaming app streams files that have already been streamed in an 
 endless loop
 

 Key: SPARK-3553
 URL: https://issues.apache.org/jira/browse/SPARK-3553
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.1
 Environment: Ec2 cluster - YARN
Reporter: Ezequiel Bella
  Labels: S3, Streaming, YARN

 We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node 
 and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB 
 of RAM each.
 The app streams from a directory in S3 which is constantly being written; 
 this is the line of code that achieves that:
 val lines = ssc.fileStream[LongWritable, Text, 
 TextInputFormat](Settings.S3RequestsHost  , (f:Path)= true, true )
 The purpose of using fileStream instead of textFileStream is to customize the 
 way that spark handles existing files when the process starts. We want to 
 process just the new files that are added after the process launched and omit 
 the existing ones. We configured a batch duration of 10 seconds.
 The process goes fine while we add a small number of files to s3, let's say 4 
 or 5. We can see in the streaming UI how the stages are executed successfully 
 in the executors, one for each file that is processed. But when we try to add 
 a larger number of files, we face a strange behavior; the application starts 
 streaming files that have already been streamed. 
 For example, I add 20 files to s3. The files are processed in 3 batches. The 
 first batch processes 7 files, the second 8 and the third 5. No more files 
 are added to S3 at this point, but spark start repeating these phases 
 endlessly with the same files.
 Any thoughts what can be causing this?
 Regards,
 Easyb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3553) Spark Streaming app streams files that have already been streamed in an endless loop

2014-12-02 Thread Ezequiel Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231737#comment-14231737
 ] 

Ezequiel Bella commented on SPARK-3553:
---

Please see if this post works for you,
http://stackoverflow.com/questions/25894405/spark-streaming-app-streams-files-that-have-already-been-streamed

good luck.
easy

 Spark Streaming app streams files that have already been streamed in an 
 endless loop
 

 Key: SPARK-3553
 URL: https://issues.apache.org/jira/browse/SPARK-3553
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.1
 Environment: Ec2 cluster - YARN
Reporter: Ezequiel Bella
  Labels: S3, Streaming, YARN

 We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node 
 and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB 
 of RAM each.
 The app streams from a directory in S3 which is constantly being written; 
 this is the line of code that achieves that:
 val lines = ssc.fileStream[LongWritable, Text, 
 TextInputFormat](Settings.S3RequestsHost  , (f:Path)= true, true )
 The purpose of using fileStream instead of textFileStream is to customize the 
 way that spark handles existing files when the process starts. We want to 
 process just the new files that are added after the process launched and omit 
 the existing ones. We configured a batch duration of 10 seconds.
 The process goes fine while we add a small number of files to s3, let's say 4 
 or 5. We can see in the streaming UI how the stages are executed successfully 
 in the executors, one for each file that is processed. But when we try to add 
 a larger number of files, we face a strange behavior; the application starts 
 streaming files that have already been streamed. 
 For example, I add 20 files to s3. The files are processed in 3 batches. The 
 first batch processes 7 files, the second 8 and the third 5. No more files 
 are added to S3 at this point, but spark start repeating these phases 
 endlessly with the same files.
 Any thoughts what can be causing this?
 Regards,
 Easyb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3553) Spark Streaming app streams files that have already been streamed in an endless loop

2014-12-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231815#comment-14231815
 ] 

Micael Capitão commented on SPARK-3553:
---

I've already seen that post. It didn't work for me...
I have a filter for .gz files and like you, I've tried to put the files in the 
dir as .gz.tmp and then renaming to remove the .tmp. The behaviour, in my case, 
is like the one I've described before.

I'm going to check if it behaves decently using HDFS for the stream data dir...

 Spark Streaming app streams files that have already been streamed in an 
 endless loop
 

 Key: SPARK-3553
 URL: https://issues.apache.org/jira/browse/SPARK-3553
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.1
 Environment: Ec2 cluster - YARN
Reporter: Ezequiel Bella
  Labels: S3, Streaming, YARN

 We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node 
 and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB 
 of RAM each.
 The app streams from a directory in S3 which is constantly being written; 
 this is the line of code that achieves that:
 val lines = ssc.fileStream[LongWritable, Text, 
 TextInputFormat](Settings.S3RequestsHost  , (f:Path)= true, true )
 The purpose of using fileStream instead of textFileStream is to customize the 
 way that spark handles existing files when the process starts. We want to 
 process just the new files that are added after the process launched and omit 
 the existing ones. We configured a batch duration of 10 seconds.
 The process goes fine while we add a small number of files to s3, let's say 4 
 or 5. We can see in the streaming UI how the stages are executed successfully 
 in the executors, one for each file that is processed. But when we try to add 
 a larger number of files, we face a strange behavior; the application starts 
 streaming files that have already been streamed. 
 For example, I add 20 files to s3. The files are processed in 3 batches. The 
 first batch processes 7 files, the second 8 and the third 5. No more files 
 are added to S3 at this point, but spark start repeating these phases 
 endlessly with the same files.
 Any thoughts what can be causing this?
 Regards,
 Easyb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org