In case this mystery has not been solved, DStream.print() essentially does
a RDD.take(10) on each RDD, which computes only a subset of the partitions
in the RDD. But collects forces the evaluation of all the RDDs. Since you
are writing to json in the mapI() function, this could be the reason.
TD
Hello Imran,
(a) I know that all 20 files are processed when I use foreachRDD, because I
can see the processed files in the output directory. (My application logic
writes them to an output directory after they are processed, *but* that
writing operation does not happen in foreachRDD, below you
so if you only change this line:
https://gist.github.com/emres/0fb6de128baea099e741#file-mymoduledriver-java-L137
to
json.print()
it processes 16 files instead? I am totally perplexed. My only
suggestions to help debug are
(1) see what happens when you get rid of MyModuleWorker completely --
Hi Emre,
there shouldn't be any difference in which files get processed w/ print()
vs. foreachRDD(). In fact, if you look at the definition of print(), it is
just calling foreachRDD() underneath. So there is something else going on
here.
We need a little more information to figure out exactly
I've managed to solve this, but I still don't know exactly why my solution
works:
In my code I was trying to force the Spark to output via:
jsonIn.print();
jsonIn being a JavaDStreamString.
When removed the code above, and added the code below to force the output
operation, hence the
Sean,
In this case, I've been testing the code on my local machine and using
Spark locally, so I all the log output was available on my terminal. And
I've used the .print() method to have an output operation, just to force
Spark execute.
And I was not using foreachRDD, I was only using print()
How are you deciding whether files are processed or not? It doesn't seem
possible from this code. Maybe it just seems so.
On Feb 16, 2015 12:51 PM, Emre Sevinc emre.sev...@gmail.com wrote:
I've managed to solve this, but I still don't know exactly why my solution
works:
In my code I was
Hello Sean,
I did not understand your question very well, but what I do is checking the
output directory (and I have various logger outputs at various stages
showing the contents of an input file being processed, the response from
the web service, etc.).
By the way, I've already solved my
Materialization shouldn't be relevant. The collect by itself doesn't let
you detect whether it happened. Print should print some results to the
console but on different machines, so may not be a reliable way to see what
happened.
Yes I understand your real process uses foreachRDD and that's what
Instead of print you should do jsonIn.count().print(). Straight forward
approach is to use foreachRDD :)
Thanks
Best Regards
On Mon, Feb 16, 2015 at 6:48 PM, Emre Sevinc emre.sev...@gmail.com wrote:
Hello Sean,
I did not understand your question very well, but what I do is checking
the
Hello,
I have an application in Java that uses Spark Streaming 1.2.1 in the
following manner:
- Listen to the input directory.
- If a new file is copied to that input directory process it.
- Process: contact a RESTful web service (running also locally and
responsive), send the contents of the
11 matches
Mail list logo