The streaming program contains the following main stages: 1. receive data from Kafka 2. preprocessing of the data. These are all map and filtering stages. 3. Group by a field 4. Process the groupBy results using map. Inside this processing, I use collect, count.
Thanks! Bill On Tue, Jul 22, 2014 at 10:05 PM, Tathagata Das <tathagata.das1...@gmail.com > wrote: > Can you give an idea of the streaming program? Rest of the transformation > you are doing on the input streams? > > > On Tue, Jul 22, 2014 at 11:05 AM, Bill Jay <bill.jaypeter...@gmail.com> > wrote: > >> Hi all, >> >> I am currently running a Spark Streaming program, which consumes data >> from Kakfa and does the group by operation on the data. I try to optimize >> the running time of the program because it looks slow to me. It seems the >> stage named: >> >> * combineByKey at ShuffledDStream.scala:42 * >> >> always takes the longest running time. And If I open this stage, I only >> see two executors on this stage. Does anyone has an idea what this stage >> does and how to increase the speed for this stage? Thanks! >> >> Bill >> > >