Very cool to hear of this progress on Samza! Python protocol buffers are extraordinarily slow (lots of reflection, attributes lookups, and bit fiddling for serialization/deserialization that is certainly not Python's strong point). Each bundle processed involves multiple protos being constructed and sent/received (notably the particularly nested and branchy monitoring info one). While there are still some improvements that could be made for making bundles lighter-weight, amortizing this cost over many elements is essential for performance. (Note that elements within a bundle are packed into a single byte buffer, so avoid this overhead.)
Also, it may be good to guarantee you're at least using the C++ bindings: https://developers.google.com/protocol-buffers/docs/reference/python-generated (still slow, but not as slow). And, of course, due to the GIL one may want many python workers for multi-core machines. On Thu, Nov 8, 2018 at 9:18 PM Thomas Weise <[email protected]> wrote: > > We have been doing some end to end testing with Python and Flink (streaming). > You could take a look at the following and possibly replicate it for your > work: > > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/flink/flink_streaming_impulse.py > > We found that in order to get acceptable performance, we need larger bundles > (we started with single element bundles). Default in the Flink runner now is > to cap bundles at 1000 elements or 1 second, whatever comes first. With that, > I have seen decent throughput for the pipeline above (~ 5000k elements per > second with single SDK worker). > > The Flink runner also has support to run multiple SDK workers per Flink task > manager. > > Thomas > > > On Thu, Nov 8, 2018 at 11:13 AM Xinyu Liu <[email protected]> wrote: >> >> 19mb/s throughput is enough for us. Seems the bottleneck is the rate of RPC >> calls. Our message size is usually 1k ~ 10k. So if we can reach 19mb/s, we >> will be able to process ~4k qps, that meets our requirements. I guess >> increasing the size of the bundles will help. Do you guys have any results >> from running python with Flink? We are curious about the performance there. >> >> Thanks, >> Xinyu >> >> On Thu, Nov 8, 2018 at 10:13 AM Lukasz Cwik <[email protected]> wrote: >>> >>> This benchmark[1] shows that Python is getting about 19mb/s. >>> >>> Yes, running more python sdk_worker processes will improve performance >>> since Python is limited to a single CPU core. >>> >>> [1] >>> https://performance-dot-grpc-testing.appspot.com/explore?dashboard=5652536396611584&widget=490377658&container=1286539696 >>> >>> >>> >>> On Wed, Nov 7, 2018 at 5:24 PM Xinyu Liu <[email protected]> wrote: >>>> >>>> By looking at the gRPC dashboard published by the benchmark[1], it seems >>>> the streaming ping-pong operations per second for gRPC in python is around >>>> 2k ~ 3k qps. This seems quite low compared to gRPC performance in other >>>> languages, e.g. 600k qps for Java and Go. Is it expected to run multiple >>>> sdk_worker processes to improve performance? >>>> >>>> [1] >>>> https://performance-dot-grpc-testing.appspot.com/explore?dashboard=5652536396611584&widget=713624174&container=1012810333&maximized >>>> >>>> On Wed, Nov 7, 2018 at 11:14 AM Lukasz Cwik <[email protected]> wrote: >>>>> >>>>> gRPC folks provide a bunch of benchmarks for different scenarios: >>>>> https://grpc.io/docs/guides/benchmarking.html >>>>> You would be most interested in the streaming throughput benchmarks since >>>>> the Data API is written on top of the gRPC streaming APIs. >>>>> >>>>> 200KB/s does seem pretty small. Have you captured any Python profiles[1] >>>>> and looked at them? >>>>> >>>>> 1: >>>>> https://lists.apache.org/thread.html/f8488faede96c65906216c6b4bc521385abeddc1578c99b85937d2f2@%3Cdev.beam.apache.org%3E >>>>> >>>>> >>>>> On Wed, Nov 7, 2018 at 10:18 AM Hai Lu <[email protected]> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> This is Hai from LinkedIn. I'm currently working on Portable API for >>>>>> Samza Runner. I was able to make Python work with Samza container >>>>>> reading from Kafka. However, I'm seeing severe performance issue with my >>>>>> set up, achieving only ~200KB throughput between the Samza runner in the >>>>>> Java side and the sdk_worker in the Python part. >>>>>> >>>>>> While I'm digging into this, I wonder whether some one has benchmarked >>>>>> the data channel between Java and Python and had some results how much >>>>>> throughput can be reached? Assuming single worker thread and single >>>>>> JobBundleFactory. >>>>>> >>>>>> I might be missing some very basic and naive gRPC setting which leads to >>>>>> this unsatisfactory results. So another question is whether are any good >>>>>> articles or documentations about gRPC tuning dedicated to IPC? >>>>>> >>>>>> Thanks, >>>>>> Hai >>>>>> >>>>>>
