[GitHub] [incubator-uniffle] jerqi commented on issue #79: [Improvement] Process don't exit if exec start script using ansible

2022-07-27 Thread GitBox
jerqi commented on issue #79: URL: https://github.com/apache/incubator-uniffle/issues/79#issuecomment-1197624531 Do you want to raise a pr? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

[GitHub] [incubator-uniffle] jerqi commented on issue #76: [Improvement] Disallow sendShuffleData if requireBufferId expired

2022-07-27 Thread GitBox
jerqi commented on issue #76: URL: https://github.com/apache/incubator-uniffle/issues/76#issuecomment-1197623471 Could you share your solution? We can discuss first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use th

[GitHub] [incubator-uniffle] colinmjj commented on issue #95: [Performance Optimization] Multiple channels when getting shuffle data in client side

2022-07-27 Thread GitBox
colinmjj commented on issue #95: URL: https://github.com/apache/incubator-uniffle/issues/95#issuecomment-1197620579 > > @zuston The current implementation limit the number of connection, because don't want too many connection established between client and shuffle server. We also plan to i

[GitHub] [incubator-uniffle] jerqi commented on issue #95: [Performance Optimization] Multiple channels when getting shuffle data in client side

2022-07-27 Thread GitBox
jerqi commented on issue #95: URL: https://github.com/apache/incubator-uniffle/issues/95#issuecomment-1197619293 > @zuston The current implementation limit the number of connection, because don't want too many connection established between client and shuffle server. We also plan to improv

[GitHub] [incubator-uniffle] zuston commented on issue #95: [Performance Optimization] Multiple channels when getting shuffle data in client side

2022-07-27 Thread GitBox
zuston commented on issue #95: URL: https://github.com/apache/incubator-uniffle/issues/95#issuecomment-1197618109 Glad to hear this. From the flame graph, due to extra memory-copy, it cost too much time in shuffle server side. If using the netty to directly manipulate shuffle data by

[GitHub] [incubator-uniffle] colinmjj commented on issue #95: [Performance Optimization] Multiple channels when getting shuffle data in client side

2022-07-27 Thread GitBox
colinmjj commented on issue #95: URL: https://github.com/apache/incubator-uniffle/issues/95#issuecomment-1197616175 @zuston The current implementation limit the number of connection, because don't want too many connection established between client and shuffle server. We also plan to imp

[GitHub] [incubator-uniffle] colinmjj commented on issue #81: Benchmark: ESS and Uniffle

2022-07-27 Thread GitBox
colinmjj commented on issue #81: URL: https://github.com/apache/incubator-uniffle/issues/81#issuecomment-1197613441 @zuston The benchmark of blog is based on Spark 2.4.6. If there has no random disk IO problem with ESS, Uniffle is expected has **poor performance** than ESS -- This is

[GitHub] [incubator-uniffle] zuston opened a new issue, #95: [Performance Optimization] Multiple channels when getting shuffle data in client side

2022-07-27 Thread GitBox
zuston opened a new issue, #95: URL: https://github.com/apache/incubator-uniffle/issues/95 ### Motivation Now the executor only will use the single TCP connection with the specified shuffle server, so when multiple tasks are running concurrently, it will share this channel. Maybe it will

[GitHub] [incubator-uniffle] zuston commented on issue #92: [Performance Optimization] The huge performance drop due to the method of getBlockIdsByPartitionId

2022-07-27 Thread GitBox
zuston commented on issue #92: URL: https://github.com/apache/incubator-uniffle/issues/92#issuecomment-1197607466 Got it. If we have the better design on this, i think it will achieve better performance. -- This is an automated message from the Apache Git Service. To respond to the messa

[GitHub] [incubator-uniffle] zuston commented on issue #81: Benchmark: ESS and Uniffle

2022-07-27 Thread GitBox
zuston commented on issue #81: URL: https://github.com/apache/incubator-uniffle/issues/81#issuecomment-1197606473 Attach the google doc about test result in our internal cluster: https://docs.google.com/document/d/1nmHMBEaa4lHfgQkdlYokTtXt12F5vRJiQ2MCmTQbW6k/edit#heading=h.b1udpb9l28w7

[GitHub] [incubator-uniffle] jerqi commented on issue #77: [Feature Request] Support deploy multiple shuffle servers in a single node

2022-07-27 Thread GitBox
jerqi commented on issue #77: URL: https://github.com/apache/incubator-uniffle/issues/77#issuecomment-1197605715 > K8S Deployment can't solve this issue. @xianjingfeng I think you can continue this. -- This is an automated message from the Apache Git Service. To respond to the mess

[GitHub] [incubator-uniffle] colinmjj commented on issue #77: [Feature Request] Support deploy multiple shuffle servers in a single node

2022-07-27 Thread GitBox
colinmjj commented on issue #77: URL: https://github.com/apache/incubator-uniffle/issues/77#issuecomment-1197603444 > > I'm also curious why we need to modify start script? > > start script will process existence Yes, this is the limit with current implementation -- This is

[GitHub] [incubator-uniffle] colinmjj commented on issue #81: Benchmark: ESS and Uniffle

2022-07-27 Thread GitBox
colinmjj commented on issue #81: URL: https://github.com/apache/incubator-uniffle/issues/81#issuecomment-1197597392 @zuston You can refer this [blog](https://cloud.tencent.com/developer/article/1943179) for the benchmark related. -- This is an automated message from the Apache Git Serv

[GitHub] [incubator-uniffle] colinmjj commented on issue #92: [Performance Optimization] The huge performance drop due to the method of getBlockIdsByPartitionId

2022-07-27 Thread GitBox
colinmjj commented on issue #92: URL: https://github.com/apache/incubator-uniffle/issues/92#issuecomment-1197595382 The performance problem of `getBlockIdsByPartitionId` is a known issue. With current design, blockId should be stored in shuffle server to support features like block filte

[GitHub] [incubator-uniffle] jerqi closed issue #90: [Performance Optimization] Improve the speed of writing index file in shuffle server

2022-07-27 Thread GitBox
jerqi closed issue #90: [Performance Optimization] Improve the speed of writing index file in shuffle server URL: https://github.com/apache/incubator-uniffle/issues/90 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the U

[GitHub] [incubator-uniffle] jerqi commented on issue #90: [Performance Optimization] Improve the speed of writing index file in shuffle server

2022-07-27 Thread GitBox
jerqi commented on issue #90: URL: https://github.com/apache/incubator-uniffle/issues/90#issuecomment-1197592448 solved by #91 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comme

[GitHub] [incubator-uniffle] zuston commented on issue #92: [Performance Optimization] The huge performance drop due to the method of getBlockIdsByPartitionId

2022-07-27 Thread GitBox
zuston commented on issue #92: URL: https://github.com/apache/incubator-uniffle/issues/92#issuecomment-1197577668 @colinmjj -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [incubator-uniffle] jerqi commented on issue #78: [Bug] The metric `grpc_open` sometime incorrect

2022-07-27 Thread GitBox
jerqi commented on issue #78: URL: https://github.com/apache/incubator-uniffle/issues/78#issuecomment-1196899262 cc @colinmjj , Do you remember our flaky metric test? I guess that it's caused by this issue. -- This is an automated message from the Apache Git Service. To respond to the me

[GitHub] [incubator-uniffle] jerqi commented on issue #80: [Feature Request] Support shuffle server decommissioned

2022-07-27 Thread GitBox
jerqi commented on issue #80: URL: https://github.com/apache/incubator-uniffle/issues/80#issuecomment-119698 If we want to add some interface to control shuffle server's behavior, we should have a complete design, and we think we need detailed discussions. We ever have similar mind in

[GitHub] [incubator-uniffle] jerqi commented on issue #80: [Feature Request] Support shuffle server decommissioned

2022-07-27 Thread GitBox
jerqi commented on issue #80: URL: https://github.com/apache/incubator-uniffle/issues/80#issuecomment-1196808782 Could you write a design doc's (use google doc) ? Because this issue is a little complex. -- This is an automated message from the Apache Git Service. To respond to the messag

[GitHub] [incubator-uniffle] jerqi commented on issue #78: [Bug] The metric `grpc_open` sometime incorrect

2022-07-27 Thread GitBox
jerqi commented on issue #78: URL: https://github.com/apache/incubator-uniffle/issues/78#issuecomment-1196688153 > No logs, we just found this phenomenon. Maybe `org.apache.uniffle.common.rpc.MonitoringServerCall#close` not called sometimes. I try to call `decCounter` in `MonitoringServer

[GitHub] [incubator-uniffle] jerqi commented on issue #77: [Feature Request] Support deploy multiple shuffle servers in a single node

2022-07-27 Thread GitBox
jerqi commented on issue #77: URL: https://github.com/apache/incubator-uniffle/issues/77#issuecomment-1196682268 K8S Deployment can't solve this issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [incubator-uniffle] xianjingfeng commented on issue #79: [Improvement] Process don't exit if exec start script using ansible

2022-07-27 Thread GitBox
xianjingfeng commented on issue #79: URL: https://github.com/apache/incubator-uniffle/issues/79#issuecomment-1196681674 Yes, by using nohup -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

[GitHub] [incubator-uniffle] xianjingfeng commented on issue #78: [Bug] The metric `grpc_open` sometime incorrect

2022-07-27 Thread GitBox
xianjingfeng commented on issue #78: URL: https://github.com/apache/incubator-uniffle/issues/78#issuecomment-1196679903 No logs, we just found this phenomenon. Maybe `org.apache.uniffle.common.rpc.MonitoringServerCall#close` not called sometimes. I try to call `decCounter` in `Monitorin

[GitHub] [incubator-uniffle] xianjingfeng commented on issue #77: [Feature Request] Support deploy multiple shuffle servers in a single node

2022-07-27 Thread GitBox
xianjingfeng commented on issue #77: URL: https://github.com/apache/incubator-uniffle/issues/77#issuecomment-119659 If our plan is deploy on k8s, this issue should close? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub a

[GitHub] [incubator-uniffle] xianjingfeng commented on issue #77: [Feature Request] Support deploy multiple shuffle servers in a single node

2022-07-27 Thread GitBox
xianjingfeng commented on issue #77: URL: https://github.com/apache/incubator-uniffle/issues/77#issuecomment-1196664044 > I'm also curious why we need to modify start script? start script will process existence -- This is an automated message from the Apache Git Service. To respond

[GitHub] [incubator-uniffle] xianjingfeng commented on issue #76: [Improvement] Disallow sendShuffleData if requireBufferId expired

2022-07-27 Thread GitBox
xianjingfeng commented on issue #76: URL: https://github.com/apache/incubator-uniffle/issues/76#issuecomment-1196662502 Yes, it is be testing in our production environment. I will watch it for a while. If it's OK, I will create a pr -- This is an automated message from the Apache Git Ser

[GitHub] [incubator-uniffle] xianjingfeng commented on issue #80: [Feature Request] Support shuffle server decommissioned

2022-07-27 Thread GitBox
xianjingfeng commented on issue #80: URL: https://github.com/apache/incubator-uniffle/issues/80#issuecomment-1196656326 > I understand that you need a `rolling upgrade` feature. In our plan, we want to accomplish this feature by k8s operator. For the standalone mode, we don't have the plan

[GitHub] [incubator-uniffle] zuston opened a new issue, #92: [Performance Optimization] The huge performance drop due to the method of getBlockIdsByPartitionId

2022-07-27 Thread GitBox
zuston opened a new issue, #92: URL: https://github.com/apache/incubator-uniffle/issues/92 ### Background I found when getting shuffle result, the flame graph show the method of `getBlockIdsByPartitionId` occupy too much time. ![reliao_img_1658922962790](https://user-images.githu

[GitHub] [incubator-uniffle] colinmjj commented on issue #90: [Performance Optimization] Improve the speed of writing index file in shuffle server

2022-07-27 Thread GitBox
colinmjj commented on issue #90: URL: https://github.com/apache/incubator-uniffle/issues/90#issuecomment-1196587758 @zuston good catch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specifi

[GitHub] [incubator-uniffle] zuston opened a new issue, #90: [Performance Optimization] Improve the speed of writing index file in shuffle server

2022-07-27 Thread GitBox
zuston opened a new issue, #90: URL: https://github.com/apache/incubator-uniffle/issues/90 ### Motivation When I test uniffle performance, i found a huge performance drop due to the low speed of writing index file. Flame graph attached: ![reliao_img_1658917352873](https://user-ima

[GitHub] [incubator-uniffle] smallzhongfeng closed issue #89: [Improvement] Add a load policy based on disk performance

2022-07-27 Thread GitBox
smallzhongfeng closed issue #89: [Improvement] Add a load policy based on disk performance URL: https://github.com/apache/incubator-uniffle/issues/89 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [incubator-uniffle] smallzhongfeng commented on issue #89: [Improvement] Add a load policy based on disk performance

2022-07-27 Thread GitBox
smallzhongfeng commented on issue #89: URL: https://github.com/apache/incubator-uniffle/issues/89#issuecomment-1196532415 OK. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment

[GitHub] [incubator-uniffle] jerqi commented on issue #89: [Improvement] Add a load policy based on disk performance

2022-07-27 Thread GitBox
jerqi commented on issue #89: URL: https://github.com/apache/incubator-uniffle/issues/89#issuecomment-1196522252 > Maybe you are right, but I think we should open it up so that we can verify this situation in more production environments. You can turn it on when you deploy the shuffl

[GitHub] [incubator-uniffle] smallzhongfeng commented on issue #89: [Improvement] Add a load policy based on disk performance

2022-07-27 Thread GitBox
smallzhongfeng commented on issue #89: URL: https://github.com/apache/incubator-uniffle/issues/89#issuecomment-1196481405 Maybe you are right, but I think we should open it up so that we can verify this situation in more production environments. -- This is an automated message from the A

[GitHub] [incubator-uniffle] smallzhongfeng commented on issue #89: [Improvement] Add a load policy based on disk performance

2022-07-27 Thread GitBox
smallzhongfeng commented on issue #89: URL: https://github.com/apache/incubator-uniffle/issues/89#issuecomment-1196480313 But I think we should open it up so that we can verify this situation in more production environments. -- This is an automated message from the Apache Git Service. To

[GitHub] [incubator-uniffle] jerqi commented on issue #89: [Improvement] Add a load policy based on disk performance

2022-07-27 Thread GitBox
jerqi commented on issue #89: URL: https://github.com/apache/incubator-uniffle/issues/89#issuecomment-1196467989 I think we should verify the function `HealCheck` in production environment first before we turn it on. But there are fewer broken disk in our production environment. I prefer u

[GitHub] [incubator-uniffle] smallzhongfeng commented on issue #89: [Improvement] Add a load policy based on disk performance

2022-07-27 Thread GitBox
smallzhongfeng commented on issue #89: URL: https://github.com/apache/incubator-uniffle/issues/89#issuecomment-1196452995 Do we need to turn this parameter HealthCheck on by default? This allows for better screening of healthy machines. @colinmjj @jerqi -- This is an automated message f

[GitHub] [incubator-uniffle] colinmjj commented on issue #89: [Improvement] Add a load policy based on disk performance

2022-07-27 Thread GitBox
colinmjj commented on issue #89: URL: https://github.com/apache/incubator-uniffle/issues/89#issuecomment-1196418301 > @colinmjj I think you are right, but is it possible that memory is allocated normally, but disk IO has problems? I think you're worry about the shuffle server with ab

[GitHub] [incubator-uniffle] smallzhongfeng commented on issue #89: [Improvement] Add a load policy based on disk performance

2022-07-27 Thread GitBox
smallzhongfeng commented on issue #89: URL: https://github.com/apache/incubator-uniffle/issues/89#issuecomment-1196406719 @colinmjj I think you are right, but is it possible that memory is allocated normally, but disk IO has problems. -- This is an automated message from the Apache Git S

[GitHub] [incubator-uniffle] smallzhongfeng commented on issue #89: [Improvement] Add a load policy based on disk performance

2022-07-27 Thread GitBox
smallzhongfeng commented on issue #89: URL: https://github.com/apache/incubator-uniffle/issues/89#issuecomment-1196405275 I mean the shuffleServer's property isHealthy returns true by default, but not the HealthCheck's default value. -- This is an automated message from the Apache Git Se

[GitHub] [incubator-uniffle] colinmjj commented on issue #89: [Improvement] Add a load policy based on disk performance

2022-07-27 Thread GitBox
colinmjj commented on issue #89: URL: https://github.com/apache/incubator-uniffle/issues/89#issuecomment-1196400738 @smallzhongfeng The workload of Shuffle Server depends on a lot of things, eg, Memory, Disk IO, NetworkIO, etc. To simplify the assignment strategy, memory is chosen as the m

[GitHub] [incubator-uniffle] jerqi commented on issue #89: [Improvement] Add a load policy based on disk performance

2022-07-27 Thread GitBox
jerqi commented on issue #89: URL: https://github.com/apache/incubator-uniffle/issues/89#issuecomment-1196378640 > 1. I think this scheme can only support MEMORY LOCALFILE. > 2. Since this HealthCheck collects the information of the local disk, we can use this feature. This health check

[GitHub] [incubator-uniffle] smallzhongfeng commented on issue #89: [Improvement] Add a load policy based on disk performance

2022-07-27 Thread GitBox
smallzhongfeng commented on issue #89: URL: https://github.com/apache/incubator-uniffle/issues/89#issuecomment-1196369819 1. I think this scheme can only support MEMORY LOCALFILE. 2. Since this HealthCheck collects the information of the local disk, we can use this feature. This health c