When using Samza to process streaming data (kafka/databus), we deploy to Yarn clusters dedicated to Samza workloads. The configurations of machines in this cluster are roughly similar to what I provided.
When using Samza to process batch data (files on hadoop <https://reviews.apache.org/r/52570/>), we deploy to our hadoop clusters that are shared with other M-R workloads. I believe these clusters use spinning disks. For the future, We plan to explore trade-offs in storage-costs versus performance and will continue to share what we learn with the community. Thanks, Jagadish On Tue, Jan 31, 2017 at 1:38 PM, Ankit Malhotra <[email protected]> wrote: > Hi Jagadish, > > Thanks for your reply. Is it safe to assume that you are running similar > machines in production YARN clusters where only SAMZA workloads run? > > Ankit > > > On Jan 31, 2017, at 3:49 PM, Jagadish Venkatraman < > [email protected]> wrote: > > > > Hi Ankit, > > > > We have benchmarked Samza on the following hardware configuration: > > > > - Processor: Intel Xeon 2.67 GHz processor (with 24 cores) > > - 48GB of RAM > > - 1Gbps Ethernet > > - SSD: 1.65TB Fusion-IO SSD > > > > Please check out the perf numbers and the methodology here: > > https://engineering.linkedin.com/performance/benchmarking- > apache-samza-12-million-messages-second-single-node > > > > Thanks, > > -- Jagadish V, Graduate Student, Department of Computer Science, Stanford University
