Re: Magic number 16: Why doesn't Spark Streaming process more than 16 files?

2015-02-23 Thread Tathagata Das
In case this mystery has not been solved, DStream.print() essentially does a RDD.take(10) on each RDD, which computes only a subset of the partitions in the RDD. But collects forces the evaluation of all the RDDs. Since you are writing to json in the mapI() function, this could be the reason. TD

Re: Magic number 16: Why doesn't Spark Streaming process more than 16 files?

2015-02-18 Thread Emre Sevinc
Hello Imran, (a) I know that all 20 files are processed when I use foreachRDD, because I can see the processed files in the output directory. (My application logic writes them to an output directory after they are processed, *but* that writing operation does not happen in foreachRDD, below you

Re: Magic number 16: Why doesn't Spark Streaming process more than 16 files?

2015-02-18 Thread Imran Rashid
so if you only change this line: https://gist.github.com/emres/0fb6de128baea099e741#file-mymoduledriver-java-L137 to json.print() it processes 16 files instead? I am totally perplexed. My only suggestions to help debug are (1) see what happens when you get rid of MyModuleWorker completely --

Re: Magic number 16: Why doesn't Spark Streaming process more than 16 files?

2015-02-17 Thread Imran Rashid
Hi Emre, there shouldn't be any difference in which files get processed w/ print() vs. foreachRDD(). In fact, if you look at the definition of print(), it is just calling foreachRDD() underneath. So there is something else going on here. We need a little more information to figure out exactly

Re: Magic number 16: Why doesn't Spark Streaming process more than 16 files?

2015-02-16 Thread Emre Sevinc
I've managed to solve this, but I still don't know exactly why my solution works: In my code I was trying to force the Spark to output via: jsonIn.print(); jsonIn being a JavaDStreamString. When removed the code above, and added the code below to force the output operation, hence the

Re: Magic number 16: Why doesn't Spark Streaming process more than 16 files?

2015-02-16 Thread Emre Sevinc
Sean, In this case, I've been testing the code on my local machine and using Spark locally, so I all the log output was available on my terminal. And I've used the .print() method to have an output operation, just to force Spark execute. And I was not using foreachRDD, I was only using print()

Re: Magic number 16: Why doesn't Spark Streaming process more than 16 files?

2015-02-16 Thread Sean Owen
How are you deciding whether files are processed or not? It doesn't seem possible from this code. Maybe it just seems so. On Feb 16, 2015 12:51 PM, Emre Sevinc emre.sev...@gmail.com wrote: I've managed to solve this, but I still don't know exactly why my solution works: In my code I was

Re: Magic number 16: Why doesn't Spark Streaming process more than 16 files?

2015-02-16 Thread Emre Sevinc
Hello Sean, I did not understand your question very well, but what I do is checking the output directory (and I have various logger outputs at various stages showing the contents of an input file being processed, the response from the web service, etc.). By the way, I've already solved my

Re: Magic number 16: Why doesn't Spark Streaming process more than 16 files?

2015-02-16 Thread Sean Owen
Materialization shouldn't be relevant. The collect by itself doesn't let you detect whether it happened. Print should print some results to the console but on different machines, so may not be a reliable way to see what happened. Yes I understand your real process uses foreachRDD and that's what

Re: Magic number 16: Why doesn't Spark Streaming process more than 16 files?

2015-02-16 Thread Akhil Das
Instead of print you should do jsonIn.count().print(). Straight forward approach is to use foreachRDD :) Thanks Best Regards On Mon, Feb 16, 2015 at 6:48 PM, Emre Sevinc emre.sev...@gmail.com wrote: Hello Sean, I did not understand your question very well, but what I do is checking the

Magic number 16: Why doesn't Spark Streaming process more than 16 files?

2015-02-16 Thread Emre Sevinc
Hello, I have an application in Java that uses Spark Streaming 1.2.1 in the following manner: - Listen to the input directory. - If a new file is copied to that input directory process it. - Process: contact a RESTful web service (running also locally and responsive), send the contents of the