Re: Review Request 53282: SAMZA-1043: Samza performance improvements

Chris Pettitt Wed, 02 Nov 2016 12:01:49 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53282/#review154611
-----------------------------------------------------------



Fix it, then Ship it!




Can you get the numbers with all fixes but with instrumentation turned on? I 
don't think we'd ever run without the instrumentation on at LI. If 
instrumentation is a significant problem we should look at alternatives to 
tracking it or consider changing the granularity.


samza-core/src/main/java/org/apache/samza/util/TimerClock.java (line 25)
<https://reviews.apache.org/r/53282/#comment224228>

    We have a HighResolutionClock that does this. I think you can use it here.


- Chris Pettitt


On Nov. 2, 2016, 5:56 p.m., Xinyu Liu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53282/
> -----------------------------------------------------------
> 
> (Updated Nov. 2, 2016, 5:56 p.m.)
> 
> 
> Review request for samza, Chris Pettitt, Jake Maes, and Navina Ramesh.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> In the recent experiments of samza batch job (consuming hdfs data on hadoop), 
> the results are subpar to map/reduce and spark. By looking at the metrics 
> closely, we found two basic problems:
> 
> 1) Not enough data to process. This is spotted as the unprocessed message 
> queue length was zero for quite a lot of times.
> 
> 2) Not process fast enough. We found samza performed closely in both median 
> size records (100B) and small record (10B), while spark can scale very well 
> in the small record (over 1M/s).
> 
> The first problem is solved by increasing the buffer size. This ticket is to 
> address the second problem, which contains three major improvements:
> 
> - Option to turn off timer metrics calculation: one of the main time spent in 
> samza processing turns out to be just keeping the timer metrics. While it is 
> useful in debugging, it becomes a bottleneck when running a stable job with 
> high performance. In my testing job which consumes 8M mock data, it took 30 
> secs with timer metrics on. After turning it off, it only took 14 secs.
> 
> - Java coding improvements: The AsyncRunLoop code can be further optimized 
> for efficiency. Some of the thread-safe data structure I am using is not for 
> optimal performance (Collections.synchronizedSet). I switched to use 
> CopyOnWriteArraySet, which has far better performance due to more reads and 
> small set size.
> 
> - Specific handling for in-order processing improvements: AsyncRunLoop 
> handles the callbacks regardless of whether it's in-order or out-of-order 
> (max concurrency > 1), which incurs quite some cost. By simplying the logic 
> for in-order handling, the performance gains.
> 
> After all three improvements, my test job with mock input (8M messages) can 
> be processed within 8 sec (down from org 30 secs), so it's 1M/s for one cpu 
> core.
> 
> For the performance benchmark jobs running in Hadoop, we also see a 4 times 
> improvement with all the fixes above. Please take a look at the attached 
> spreedsheet (see the numbers with fix(turn off the timing metrics) and 
> fix2(all three together).
> 
> 
> Diffs
> -----
> 
>   samza-core/src/main/java/org/apache/samza/container/RunLoopFactory.java 
> 609a956a1f2fa97419c2f66fe2fb6876aaaeecd0 
>   samza-core/src/main/java/org/apache/samza/task/AsyncRunLoop.java 
> 8fac8155c7f64e67d4a39ec6943f98da1e1d63d9 
>   samza-core/src/main/java/org/apache/samza/task/CoordinatorRequests.java 
> 052b3b91ec609ca6288662cfa2d3e71b0273d020 
>   samza-core/src/main/java/org/apache/samza/task/TaskCallbackImpl.java 
> 9b700998d2af040c6734289f7f28bbd78c36bd2c 
>   samza-core/src/main/java/org/apache/samza/task/TaskCallbackManager.java 
> 132cf59eb593524a4cac134aeceeeb37a4c74b1f 
>   samza-core/src/main/java/org/apache/samza/util/TimerClock.java PRE-CREATION 
>   samza-core/src/main/java/org/apache/samza/util/Utils.java 
> 472e0a59d5aa992b136292c8a3347c311e2cd606 
>   samza-core/src/main/scala/org/apache/samza/config/MetricsConfig.scala 
> c3fd8bfb2e16a4c5146d34682d04cb1d4e9bbe72 
>   samza-core/src/main/scala/org/apache/samza/container/SamzaContainer.scala 
> e0468ee89c89fd720834461771ebb36475475bcb 
>   samza-core/src/main/scala/org/apache/samza/system/SystemConsumers.scala 
> e2aed5b1c2e77a914268963b21809380972037b6 
>   samza-core/src/main/scala/org/apache/samza/util/Util.scala 
> c4836f202f7eda1d4e71eac94fd48e46207b0316 
>   samza-core/src/test/java/org/apache/samza/task/TestAsyncRunLoop.java 
> 6000ffaf2b8723d48a72e58b571f242a42dc8128 
>   samza-core/src/test/java/org/apache/samza/task/TestAsyncStreamAdapter.java 
> 99e1e18bcfa6bca1e275d8ae030a77ff8d70a4eb 
>   samza-core/src/test/java/org/apache/samza/task/TestTaskCallbackImpl.java 
> f1dbf35165e6ddfc02e3522887c25d78a4bbfcd7 
>   samza-core/src/test/java/org/apache/samza/task/TestTaskCallbackManager.java 
> d7110f34a9eae6e9ffc15b4982bfbb180da88b2d 
>   
> samza-kv/src/main/scala/org/apache/samza/storage/kv/BaseKeyValueStorageEngineFactory.scala
>  c975893a42689732c39c39600fecacee843bf9d6 
> 
> Diff: https://reviews.apache.org/r/53282/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build
> 
> Tested in the yarn hadoop cluster with different kinds of jobs.
> 
> 
> File Attachments
> ----------------
> 
> hdfs performance
>   
> https://reviews.apache.org/media/uploaded/files/2016/11/02/c05007fe-2fdd-4c8c-b5ef-b7862dea13b2__hdfs_perf.png
> 
> 
> Thanks,
> 
> Xinyu Liu
> 
>

Re: Review Request 53282: SAMZA-1043: Samza performance improvements

Reply via email to