Very cool to hear of this progress on Samza!

Python protocol buffers are extraordinarily slow (lots of reflection,
attributes lookups, and bit fiddling for serialization/deserialization
that is certainly not Python's strong point). Each bundle processed
involves multiple protos being constructed and sent/received (notably
the particularly nested and branchy monitoring info one). While there
are still some improvements that could be made for making bundles
lighter-weight, amortizing this cost over many elements is essential
for performance. (Note that elements within a bundle are packed into a
single byte buffer, so avoid this overhead.)

Also, it may be good to guarantee you're at least using the C++
bindings: 
https://developers.google.com/protocol-buffers/docs/reference/python-generated
(still slow, but not as slow).

And, of course, due to the GIL one may want many python workers for
multi-core machines.

On Thu, Nov 8, 2018 at 9:18 PM Thomas Weise <[email protected]> wrote:
>
> We have been doing some end to end testing with Python and Flink (streaming). 
> You could take a look at the following and possibly replicate it for your 
> work:
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/flink/flink_streaming_impulse.py
>
> We found that in order to get acceptable performance, we need larger bundles 
> (we started with single element bundles). Default in the Flink runner now is 
> to cap bundles at 1000 elements or 1 second, whatever comes first. With that, 
> I have seen decent throughput for the pipeline above (~ 5000k elements per 
> second with single SDK worker).
>
> The Flink runner also has support to run multiple SDK workers per Flink task 
> manager.
>
> Thomas
>
>
> On Thu, Nov 8, 2018 at 11:13 AM Xinyu Liu <[email protected]> wrote:
>>
>> 19mb/s throughput is enough for us. Seems the bottleneck is the rate of RPC 
>> calls. Our message size is usually 1k ~ 10k. So if we can reach 19mb/s, we 
>> will be able to process ~4k qps, that meets our requirements. I guess 
>> increasing the size of the bundles will help. Do you guys have any results 
>> from running python with Flink? We are curious about the performance there.
>>
>> Thanks,
>> Xinyu
>>
>> On Thu, Nov 8, 2018 at 10:13 AM Lukasz Cwik <[email protected]> wrote:
>>>
>>> This benchmark[1] shows that Python is getting about 19mb/s.
>>>
>>> Yes, running more python sdk_worker processes will improve performance 
>>> since Python is limited to a single CPU core.
>>>
>>> [1] 
>>> https://performance-dot-grpc-testing.appspot.com/explore?dashboard=5652536396611584&widget=490377658&container=1286539696
>>>
>>>
>>>
>>> On Wed, Nov 7, 2018 at 5:24 PM Xinyu Liu <[email protected]> wrote:
>>>>
>>>> By looking at the gRPC dashboard published by the benchmark[1], it seems 
>>>> the streaming ping-pong operations per second for gRPC in python is around 
>>>> 2k ~ 3k qps. This seems quite low compared to gRPC performance in other 
>>>> languages, e.g. 600k qps for Java and Go. Is it expected to run multiple 
>>>> sdk_worker processes to improve performance?
>>>>
>>>> [1] 
>>>> https://performance-dot-grpc-testing.appspot.com/explore?dashboard=5652536396611584&widget=713624174&container=1012810333&maximized
>>>>
>>>> On Wed, Nov 7, 2018 at 11:14 AM Lukasz Cwik <[email protected]> wrote:
>>>>>
>>>>> gRPC folks provide a bunch of benchmarks for different scenarios: 
>>>>> https://grpc.io/docs/guides/benchmarking.html
>>>>> You would be most interested in the streaming throughput benchmarks since 
>>>>> the Data API is written on top of the gRPC streaming APIs.
>>>>>
>>>>> 200KB/s does seem pretty small. Have you captured any Python profiles[1] 
>>>>> and looked at them?
>>>>>
>>>>> 1: 
>>>>> https://lists.apache.org/thread.html/f8488faede96c65906216c6b4bc521385abeddc1578c99b85937d2f2@%3Cdev.beam.apache.org%3E
>>>>>
>>>>>
>>>>> On Wed, Nov 7, 2018 at 10:18 AM Hai Lu <[email protected]> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> This is Hai from LinkedIn. I'm currently working on Portable API for 
>>>>>> Samza Runner. I was able to make Python work with Samza container 
>>>>>> reading from Kafka. However, I'm seeing severe performance issue with my 
>>>>>> set up, achieving only ~200KB throughput between the Samza runner in the 
>>>>>> Java side and the sdk_worker in the Python part.
>>>>>>
>>>>>> While I'm digging into this, I wonder whether some one has benchmarked 
>>>>>> the data channel between Java and Python and had some results how much 
>>>>>> throughput can be reached? Assuming single worker thread and single 
>>>>>> JobBundleFactory.
>>>>>>
>>>>>> I might be missing some very basic and naive gRPC setting which leads to 
>>>>>> this unsatisfactory results. So another question is whether are any good 
>>>>>> articles or documentations about gRPC tuning dedicated to IPC?
>>>>>>
>>>>>> Thanks,
>>>>>> Hai
>>>>>>
>>>>>>

Reply via email to