Hi dachuan,

Getting top-k up and running using spark streaming is actually very easy
using Twitter's Algebird project. I gave a presentation recently at a spark
user meetup that wen through an example of using algebird in a spark
streaming job. You can find the video and slides here -
http://isurfsoftware.com/blog/2014/01/20/spark-meetup-monoids/

Once you get the general idea of using monoids for aggregation it will be
easy to drop in the
TopKMonoid<https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/TopKMonoid.scala>
from
Algebird to solve your problem.

As far as cluster configuration goes, at my old company Sharethrough, we
set Spark to course grained mode on apache mesos with spark config to limit
the number of CPUs per job. We also made some minor tweaks to JVM settings
for bigger heap size and reduced RDD cache time.

Cheers

Ryan Weald


On Fri, Jan 24, 2014 at 7:28 PM, dachuan <hdc1...@gmail.com> wrote:

> Hello, community,
>
> I have three questions about spark streaming.
>
> 1,
> I noticed that one streaming example (StatefulNetworkWordCount) has one
> interesting phenomenon:
> since this workload only prints the first 10 rows of the final RDD, this
> means if the data influx rate is fast enough (much faster than hand typing
> in keyboard), then the final RDD would have more than one partition, assume
> it's 2 partitions, but the second partition won't be computed at all
> because the first partition suffice to serve the first 10 rows. However,
> these two workloads must make checkpoint to that RDD. This would lead to a
> very time consuming checkpoint process because the checkpoint to the second
> partition can only start before it is computed. So, is this workload only
> designed for demonstration purpose, for example, only designed for one
> partition RDD?
>
> (I have attached a figure to illustrate what I've said, please tell me if
> mailing list doesn't welcome attachment.
> A short description about the experiment
> Hardware specs: 4 cores
> Software specs: spark local cluster, 5 executors (workers), each one has
> one core, each executor has 1G memory
> Data influx speed: 3MB/s
> Data source: one ServerSocket in local file
> Streaming App's name: StatefulNetworkWordCount
> Job generation frequency: one job per second
> Checkpoint time: once per 10s
> JobManager.numThreads = 2)
>
>
>
> (And another workload might have the same problem:
> PageViewStream's slidingPageCounts)
>
> 2,
> Does anybody have a Top-K wordcount streaming source code?
>
> 3,
> Can anybody share your real world streaming example? for example,
> including source code, and cluster configuration details?
>
> thanks,
> dachuan.
>
> --
> Dachuan Huang
> Cellphone: 614-390-7234
> 2015 Neil Avenue
> Ohio State University
> Columbus, Ohio
> U.S.A.
> 43210
>

Reply via email to