-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53282/#review155478
-----------------------------------------------------------


Ship it!




Looks pretty good, thanks!


samza-core/src/main/java/org/apache/samza/util/ThrottlingExecutor.java (line 
134)
<https://reviews.apache.org/r/53282/#comment225446>

    Maybe make private if not being tested.



samza-core/src/main/scala/org/apache/samza/config/MetricsConfig.scala (line 23)
<https://reviews.apache.org/r/53282/#comment225447>

    Unused import?



samza-core/src/main/scala/org/apache/samza/config/MetricsConfig.scala (line 23)
<https://reviews.apache.org/r/53282/#comment225448>

    



samza-core/src/main/scala/org/apache/samza/config/MetricsConfig.scala (line 23)
<https://reviews.apache.org/r/53282/#comment225449>

    



samza-core/src/main/scala/org/apache/samza/container/SamzaContainer.scala (line 
155)
<https://reviews.apache.org/r/53282/#comment225452>

    



samza-core/src/main/scala/org/apache/samza/util/Util.scala (line 393)
<https://reviews.apache.org/r/53282/#comment225450>

    



samza-core/src/main/scala/org/apache/samza/util/Util.scala (line 393)
<https://reviews.apache.org/r/53282/#comment225451>

    s/TimerClock/HighResolutionClock


- Prateek Maheshwari


On Nov. 9, 2016, 11:10 a.m., Xinyu Liu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53282/
> -----------------------------------------------------------
> 
> (Updated Nov. 9, 2016, 11:10 a.m.)
> 
> 
> Review request for samza, Chris Pettitt, Jake Maes, and Navina Ramesh.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> In the recent experiments of samza batch job (consuming hdfs data on hadoop), 
> the results are subpar to map/reduce and spark. By looking at the metrics 
> closely, we found two basic problems:
> 
> 1) Not enough data to process. This is spotted as the unprocessed message 
> queue length was zero for quite a lot of times.
> 
> 2) Not process fast enough. We found samza performed closely in both median 
> size records (100B) and small record (10B), while spark can scale very well 
> in the small record (over 1M/s).
> 
> The first problem is solved by increasing the buffer size. This ticket is to 
> address the second problem, which contains three major improvements:
> 
> - Option to turn off timer metrics calculation: one of the main time spent in 
> samza processing turns out to be just keeping the timer metrics. While it is 
> useful in debugging, it becomes a bottleneck when running a stable job with 
> high performance. In my testing job which consumes 8M mock data, it took 30 
> secs with timer metrics on. After turning it off, it only took 14 secs.
> 
> - Java coding improvements: The AsyncRunLoop code can be further optimized 
> for efficiency. Some of the thread-safe data structure I am using is not for 
> optimal performance (Collections.synchronizedSet). I switched to use 
> CopyOnWriteArraySet, which has far better performance due to more reads and 
> small set size.
> 
> - Specific handling for in-order processing improvements: AsyncRunLoop 
> handles the callbacks regardless of whether it's in-order or out-of-order 
> (max concurrency > 1), which incurs quite some cost. By simplying the logic 
> for in-order handling, the performance gains.
> 
> After all three improvements, my test job with mock input (8M messages) can 
> be processed within 8 sec (down from org 30 secs), so it's 1M/s for one cpu 
> core.
> 
> For the performance benchmark jobs running in Hadoop, we also see a 4 times 
> improvement with all the fixes above. Please take a look at the attached 
> spreedsheet (see the numbers with fix(turn off the timing metrics) and 
> fix2(all three together).
> 
> 
> Diffs
> -----
> 
>   samza-core/src/main/java/org/apache/samza/container/RunLoopFactory.java 
> 609a956a1f2fa97419c2f66fe2fb6876aaaeecd0 
>   samza-core/src/main/java/org/apache/samza/task/AsyncRunLoop.java 
> 8fac8155c7f64e67d4a39ec6943f98da1e1d63d9 
>   samza-core/src/main/java/org/apache/samza/task/CoordinatorRequests.java 
> 052b3b91ec609ca6288662cfa2d3e71b0273d020 
>   samza-core/src/main/java/org/apache/samza/task/TaskCallbackImpl.java 
> 9b700998d2af040c6734289f7f28bbd78c36bd2c 
>   samza-core/src/main/java/org/apache/samza/task/TaskCallbackManager.java 
> 132cf59eb593524a4cac134aeceeeb37a4c74b1f 
>   samza-core/src/main/java/org/apache/samza/util/HighResolutionClock.java 
> 69ba441ed087305dfe4e1272b00fad67b644e13f 
>   
> samza-core/src/main/java/org/apache/samza/util/SystemHighResolutionClock.java 
> 2e65b603bc959273e679eba3ea9f89c7c0c4664d 
>   samza-core/src/main/java/org/apache/samza/util/ThrottlingExecutor.java 
> d1298fc40680e5ad4db7067c9ef02f0266dffc1d 
>   samza-core/src/main/java/org/apache/samza/util/Utils.java 
> 472e0a59d5aa992b136292c8a3347c311e2cd606 
>   samza-core/src/main/scala/org/apache/samza/config/MetricsConfig.scala 
> c3fd8bfb2e16a4c5146d34682d04cb1d4e9bbe72 
>   samza-core/src/main/scala/org/apache/samza/container/SamzaContainer.scala 
> e0468ee89c89fd720834461771ebb36475475bcb 
>   samza-core/src/main/scala/org/apache/samza/system/SystemConsumers.scala 
> e2aed5b1c2e77a914268963b21809380972037b6 
>   samza-core/src/main/scala/org/apache/samza/util/Util.scala 
> c4836f202f7eda1d4e71eac94fd48e46207b0316 
>   samza-core/src/test/java/org/apache/samza/task/TestAsyncRunLoop.java 
> 6000ffaf2b8723d48a72e58b571f242a42dc8128 
>   samza-core/src/test/java/org/apache/samza/task/TestAsyncStreamAdapter.java 
> 99e1e18bcfa6bca1e275d8ae030a77ff8d70a4eb 
>   samza-core/src/test/java/org/apache/samza/task/TestTaskCallbackImpl.java 
> f1dbf35165e6ddfc02e3522887c25d78a4bbfcd7 
>   samza-core/src/test/java/org/apache/samza/task/TestTaskCallbackManager.java 
> d7110f34a9eae6e9ffc15b4982bfbb180da88b2d 
>   samza-core/src/test/java/org/apache/samza/util/TestThrottlingExecutor.java 
> 0276e6b17bc9ad9413611189b2e9ff2b8793694c 
>   
> samza-kv/src/main/scala/org/apache/samza/storage/kv/BaseKeyValueStorageEngineFactory.scala
>  c975893a42689732c39c39600fecacee843bf9d6 
> 
> Diff: https://reviews.apache.org/r/53282/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build
> 
> Tested in the yarn hadoop cluster with different kinds of jobs.
> 
> 
> File Attachments
> ----------------
> 
> hdfs performance
>   
> https://reviews.apache.org/media/uploaded/files/2016/11/02/c05007fe-2fdd-4c8c-b5ef-b7862dea13b2__hdfs_perf.png
> 
> 
> Thanks,
> 
> Xinyu Liu
> 
>

Reply via email to