[jira] [Commented] (SPARK-3553) Spark Streaming app streams files that have already been streamed in an endless loop

2014-12-02 Thread Ezequiel Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231737#comment-14231737
 ] 

Ezequiel Bella commented on SPARK-3553:
---

Please see if this post works for you,
http://stackoverflow.com/questions/25894405/spark-streaming-app-streams-files-that-have-already-been-streamed

good luck.
easy

 Spark Streaming app streams files that have already been streamed in an 
 endless loop
 

 Key: SPARK-3553
 URL: https://issues.apache.org/jira/browse/SPARK-3553
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.1
 Environment: Ec2 cluster - YARN
Reporter: Ezequiel Bella
  Labels: S3, Streaming, YARN

 We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node 
 and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB 
 of RAM each.
 The app streams from a directory in S3 which is constantly being written; 
 this is the line of code that achieves that:
 val lines = ssc.fileStream[LongWritable, Text, 
 TextInputFormat](Settings.S3RequestsHost  , (f:Path)= true, true )
 The purpose of using fileStream instead of textFileStream is to customize the 
 way that spark handles existing files when the process starts. We want to 
 process just the new files that are added after the process launched and omit 
 the existing ones. We configured a batch duration of 10 seconds.
 The process goes fine while we add a small number of files to s3, let's say 4 
 or 5. We can see in the streaming UI how the stages are executed successfully 
 in the executors, one for each file that is processed. But when we try to add 
 a larger number of files, we face a strange behavior; the application starts 
 streaming files that have already been streamed. 
 For example, I add 20 files to s3. The files are processed in 3 batches. The 
 first batch processes 7 files, the second 8 and the third 5. No more files 
 are added to S3 at this point, but spark start repeating these phases 
 endlessly with the same files.
 Any thoughts what can be causing this?
 Regards,
 Easyb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3553) Spark Streaming app streams files that have already been streamed in an endless loop

2014-09-17 Thread Ezequiel Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ezequiel Bella updated SPARK-3553:
--
Description: 
We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node 
and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB of 
RAM each.
The app streams from a directory in S3 which is constantly being written; this 
is the line of code that achieves that:

val lines = ssc.fileStream[LongWritable, Text, 
TextInputFormat](Settings.S3RequestsHost  , (f:Path)= true, true )

The purpose of using fileStream instead of textFileStream is to customize the 
way that spark handles existing files when the process starts. We want to 
process just the new files that are added after the process launched and omit 
the existing ones. We configured a batch duration of 10 seconds.
The process goes fine while we add a small number of files to s3, let's say 4 
or 5. We can see in the streaming UI how the stages are executed successfully 
in the executors, one for each file that is processed. But when we try to add a 
larger number of files, we face a strange behavior; the application starts 
streaming files that have already been streamed. 
For example, I add 20 files to s3. The files are processed in 3 batches. The 
first batch processes 7 files, the second 8 and the third 5. No more files are 
added to S3 at this point, but spark start repeating these phases endlessly 
with the same files.
Any thoughts what can be causing this?
Regards,
Easyb


  was:
We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node 
and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB of 
RAM each.
The app streams from a directory in S3 which is constantly being written; this 
is the line of code that achieves that:

val lines = ssc.fileStream[LongWritable, Text, 
TextInputFormat](Settings.S3RequestsHost  , (f:Path)= true, true )

The purpose of using fileStream instead of textFileStream is to customize the 
way that spark handles existing files when the process starts. We want to 
process just the new files that are added after the process launched and omit 
the existing ones. We configured a batch duration of 10 seconds.
The process goes fine while we add a small number of files to s3, let's say 4 
or 5. We can see in the streaming UI how the stages are executed successfully 
in the executors, one for each file that is processed. But when we try to add a 
larger number of files, we face a strange behavior; the application starts 
streaming files that have already being streamed. 
For example, I add 20 files to s3. The files are processed in 3 batches. The 
first batch processes 7 files, the second 8 and the third 5. No more files are 
added to S3 at this point, but spark start repeating these phases endlessly.
Any thoughts what can be causing this?
Regards,
Easyb



 Spark Streaming app streams files that have already been streamed in an 
 endless loop
 

 Key: SPARK-3553
 URL: https://issues.apache.org/jira/browse/SPARK-3553
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.1
 Environment: Ec2 cluster - YARN
Reporter: Ezequiel Bella
  Labels: S3, Streaming, YARN

 We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node 
 and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB 
 of RAM each.
 The app streams from a directory in S3 which is constantly being written; 
 this is the line of code that achieves that:
 val lines = ssc.fileStream[LongWritable, Text, 
 TextInputFormat](Settings.S3RequestsHost  , (f:Path)= true, true )
 The purpose of using fileStream instead of textFileStream is to customize the 
 way that spark handles existing files when the process starts. We want to 
 process just the new files that are added after the process launched and omit 
 the existing ones. We configured a batch duration of 10 seconds.
 The process goes fine while we add a small number of files to s3, let's say 4 
 or 5. We can see in the streaming UI how the stages are executed successfully 
 in the executors, one for each file that is processed. But when we try to add 
 a larger number of files, we face a strange behavior; the application starts 
 streaming files that have already been streamed. 
 For example, I add 20 files to s3. The files are processed in 3 batches. The 
 first batch processes 7 files, the second 8 and the third 5. No more files 
 are added to S3 at this point, but spark start repeating these phases 
 endlessly with the same files.
 Any thoughts what can be causing this?
 Regards,
 Easyb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (SPARK-3553) Spark Streaming app streams files that have already being streamed in an endless loop

2014-09-16 Thread Ezequiel Bella (JIRA)
Ezequiel Bella created SPARK-3553:
-

 Summary: Spark Streaming app streams files that have already being 
streamed in an endless loop
 Key: SPARK-3553
 URL: https://issues.apache.org/jira/browse/SPARK-3553
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.1
 Environment: Ec2 cluster - YARN
Reporter: Ezequiel Bella


We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node 
and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB of 
RAM each.
The app streams from a directory in S3 which is constantly being written; this 
is the line of code that achieves that:

val lines = ssc.fileStream[LongWritable, Text, 
TextInputFormat](Settings.S3RequestsHost  , (f:Path)= true, true )

The purpose of using fileStream instead of textFileStream is to customize the 
way that spark handles existing files when the process starts. We want to 
process just the new files that are added after the process launched and omit 
the existing ones. We configured a batch duration of 10 seconds.
The process goes fine while we add a small number of files to s3, let's say 4 
or 5. We can see in the streaming UI how the stages are executed successfully 
in the executors, one for each file that is processed. But when we try to add a 
larger number of files, we face a strange behavior; the application starts 
streaming files that have already being streamed. 
For example, I add 20 files to s3. The files are processed in 3 batches. The 
first batch processes 7 files, the second 8 and the third 5. No more files are 
added to S3 at this point, but spark start repeating these phases endlessly.
Any thoughts what can be causing this?
Regards,
Easyb




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org