Re: PyFlink Table API and UDF Limitations

Niklas Wilcke Fri, 27 Nov 2020 01:12:52 -0800

Hi Xingbo,

thanks for sharing. This is very interesting.


Regards,
Niklas

> On 27. Nov 2020, at 03:05, Xingbo Huang <hxbks...@gmail.com> wrote:
> 
> Hi Niklas,
> 
> Thanks a lot for supporting PyFlink. In fact, your requirement for multiple 
> input and multiple output is essentially Table Aggregation Functions[1]. 
> Although PyFlink does not support it yet, we have listed it in the release 
> 1.13 plan. In addition, row-based operations[2] that are very user-friendly 
> to machine learning users are also included in the 1.13 plan.
> 
> [1] 
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/functions/udfs.html#table-aggregation-functions
>  
> <https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/functions/udfs.html#table-aggregation-functions>
> [2] 
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/tableApi.html#row-based-operations
>  
> <https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/tableApi.html#row-based-operations>
> 
> Best,
> Xingbo
> 
> Niklas Wilcke <niklas.wil...@uniberg.com <mailto:niklas.wil...@uniberg.com>> 
> 于2020年11月26日周四 下午5:11写道：
> Hi Xingbo,
> 
> thanks for taking care and letting me know. I was about to share an example, 
> how to reproduce this.
> Now I will wait for the next release candidate and give it a try.
> 
> Regards,
> Niklas
> 
> 
> --
> niklas.wil...@uniberg.com <mailto:niklas.wil...@uniberg.com>
> Mobile: +49 160 9793 2593
> Office: +49 40 2380 6523
> 
> Simon-von-Utrecht-Straße 85a
> 20359 Hamburg
> 
> UNIBERG GmbH
> Registergericht: Amtsgericht Kiel HRB SE-1507
> Geschäftsführer: Andreas Möller, Martin Ulbricht
> 
> Information Art. 13 DSGVO B2B:
> Für die Kommunikation mit Ihnen verarbeiten wir ggf. Ihre personenbezogenen 
> Daten.
> Alle Informationen zum Umgang mit Ihren Daten finden Sie unter 
> https://www.uniberg.com/impressum.html 
> <https://www.uniberg.com/impressum.html>. 
> 
>> On 26. Nov 2020, at 02:59, Xingbo Huang <hxbks...@gmail.com 
>> <mailto:hxbks...@gmail.com>> wrote:
>> 
>> Hi Niklas,
>> 
>> Regarding `Exception in thread "grpc-nio-worker-ELG-3-2" 
>> java.lang.NoClassDefFoundError: 
>> org/apache/beam/vendor/grpc/v1p26p0/io/netty/buffer/PoolArena$1`, 
>> it does not affect the correctness of the result. The reason is that some 
>> resources are released asynchronously when Grpc Server is shut down[1] . 
>> After the UserClassLoader unloads the class, the asynchronous thread tries 
>> to release the resources and throw NotClassFoundException, but the content 
>> of the data result has been sent downstream, so the correctness of the 
>> result will not be affected.
>> 
>> Regarding the details of the specific causes, I have explained in the flink 
>> community[2] and the beam community[3], and fixed them in the flink 
>> community. There will be no such problem in the next version of release 
>> 1.11.3 and 1.12.0.
>> 
>> [1] 
>> https://github.com/grpc/grpc-java/blob/master/core/src/main/java/io/grpc/internal/SharedResourceHolder.java#L150
>>  
>> <https://github.com/grpc/grpc-java/blob/master/core/src/main/java/io/grpc/internal/SharedResourceHolder.java#L150>
>> [2] https://issues.apache.org/jira/browse/FLINK-20284 
>> <https://issues.apache.org/jira/browse/FLINK-20284>
>> [3] https://issues.apache.org/jira/browse/BEAM-5397 
>> <https://issues.apache.org/jira/browse/BEAM-5397>
>> 
>> Best,
>> Xingbo
>> 
>> 
>> Dian Fu <dian0511...@gmail.com <mailto:dian0511...@gmail.com>> 
>> 于2020年11月16日周一 下午9:10写道：
>> Hi Niklas,
>> 
>>> How can I ingest data in a batch table from Kafka or even better 
>>> Elasticsearch. Kafka is only offering a Streaming source and Elasticsearch 
>>> isn't offering a source at all.
>>> The only workaround which comes to my mind is to use the Kafka streaming 
>>> source and to apply a single very large window to create a bounded table. 
>>> Do you think that would work?
>>> Are there other options available? Maybe converting a Stream to a bounded 
>>> table is somehow possible? Thank you!
>> 
>> 
>> I think you are right that Kafka still doesn't support batch and there is no 
>> ES source for now. Another option is you could load the data into a 
>> connector which supports batch. Not sure if anybody else has a better idea 
>> about this.
>> 
>>> I found one cause of this problem and it was mixing a Scala 2.12 Flink 
>>> installation with PyFlink, which has some 2.11 jars in its opt folder. I 
>>> think the JVM just skipped the class definitions, because they weren't 
>>> compatible. I actually wasn't aware of the fact that PyFlink comes with 
>>> prebuilt jar dependencies. If PyFlink is only compatible with Scala 2.11 it 
>>> would make sense to point that out in the documentation. I think I never 
>>> read that and it might be missing. Unfortunately there is still one 
>>> Exception showing up at the very end of the job in the taskmanager log. I 
>>> did the verification you asked for and the class is present in both jar 
>>> files. The one which comes with Flink in the opt folder and the one of 
>>> PyFlink. You can find the log attached.
>>> I think the main question is which jar file has be loaded in in the three 
>>> envronments (executor, jobmanager, taskmanager). Is it fine to not put the 
>>> flink-python_2.11-1.12.0.jar into the lib folder in the jobmanager and 
>>> taskmanager? Will it somehow be transferred by PyFlink to the jobmanager 
>>> and taskmanager?
>> 
>> PyFlink comes with the built-in jars such as flink-python_2.11-1.12.0.jar, 
>> flink-dist_2.11-1.12.0.jar, etc and so you don't need to manually add 
>> them(also shouldn't do that). Could you remove the duplicate jars and try it 
>> again?
>> 
>>> No I don't think that there are additional exceptions besides 
>>> "org.apache.beam.vendor.grpc.v1p26p0.io.grpc.StatusRuntimeException", but 
>>> maybe take a look in the attached log files. This problem could be related 
>>> to 2., maybe the root cause is a class loading issue as well. What do you 
>>> think? You can find attached three log files. One for the executor, the 
>>> jobmanager and the taskmanager. Maybe you can find something useful.
>> 
>> 
>> I found one similar issue at Beam side: 
>> https://issues.apache.org/jira/browse/BEAM-6258 
>> <https://issues.apache.org/jira/browse/BEAM-6258> which has been resolved 
>> long time ago. I'm still trying to reproduce this issue and will let you 
>> know if there is any progress. (Would be great if you could help to provide 
>> an example which could easily reproduce this issue)
>> 
>>> This was very helpful. I was able to implement it. There is only one detail 
>>> missing. Is it possible to UNNEST an array of Rows or tuples? It would be 
>>> really great if I would be able to return a list with multiple fields. 
>>> Currently I'm just putting multiple value into a single VARCHAR, but that 
>>> means the information needs to be extracted later on. Maybe you have an 
>>> idea how to avoid that.
>> 
>> Currently, Pandas UDAF still doesn't support complex type and so I'm afraid 
>> that you have to put multiple values into a single VARCHAR for now.
>> 
>> Regards,
>> Dian
>> 
>> 
>>> 在 2020年11月16日，上午2:46，Niklas Wilcke <niklas.wil...@uniberg.com 
>>> <mailto:niklas.wil...@uniberg.com>> 写道：
>>> 
>>> Hi Dian,
>>> 
>>> this was very helpful again. To the old questions I will answer inline as 
>>> well. Unfortunately also one new question popped up.
>>> 
>>> How can I ingest data in a batch table from Kafka or even better 
>>> Elasticsearch. Kafka is only offering a Streaming source and Elasticsearch 
>>> isn't offering a source at all.
>>> The only workaround which comes to my mind is to use the Kafka streaming 
>>> source and to apply a single very large window to create a bounded table. 
>>> Do you think that would work?
>>> Are there other options available? Maybe converting a Stream to a bounded 
>>> table is somehow possible? Thank you!
>>> 
>>> Kind Regards,
>>> Niklas
>>> 
>>> 
>>> 
>>>> On 13. Nov 2020, at 16:07, Dian Fu <dian0511...@gmail.com 
>>>> <mailto:dian0511...@gmail.com>> wrote:
>>>> 
>>>> Hi Niklas,
>>>> 
>>>> Good to know that this solution may work for you. Regarding to the 
>>>> questions you raised, please find my reply inline.
>>>> 
>>>> Regards,
>>>> Dian
>>>> 
>>>>> 在 2020年11月13日，下午8:48，Niklas Wilcke <niklas.wil...@uniberg.com 
>>>>> <mailto:niklas.wil...@uniberg.com>> 写道：
>>>>> 
>>>>> Hi Dian,
>>>>> 
>>>>> thanks again for your response. In the meantime I tried out your proposal 
>>>>> using the UDAF feature of PyFlink 1.12.0-rc1 and it is roughly working, 
>>>>> but I am facing some issues, which I would like to address. If this goes 
>>>>> too far, please let me know and I will open a new thread for each of the 
>>>>> questions. Let me share some more information about my current 
>>>>> environment, which will maybe help to answer the questions. I'm currently 
>>>>> using my dev machine with Docker and one jobmanager container and one 
>>>>> taskmanager container. If needed I can share the whole docker 
>>>>> environment, but this would involve some more effort on my side. Here are 
>>>>> my five questions.
>>>>> 
>>>>> 1. Where can I find connector libraries for 1.12.0-rc1 or some kind of 
>>>>> instruction how to build them? I can't find them in the 1.12.0-rc1 
>>>>> release and when I build flink from source, I can't find the connector 
>>>>> libraries in the build target. I need flink-sql-connector-elasticsearch7 
>>>>> and flink-sql-connector-kafka.
>>>> 
>>>> You could download the connector jars of 1.12.0-rc1 from here: 
>>>> https://repository.apache.org/content/repositories/orgapacheflink-1402/org/apache/flink/
>>>>  
>>>> <https://repository.apache.org/content/repositories/orgapacheflink-1402/org/apache/flink/>
>>> Thanks that worked like a charm!
>>> 
>>>> 
>>>>> 2. Which steps are needed to properly Setup PyFlink? I followed the 
>>>>> instructions, but I always get some ClassNotFoundExceptions for some Beam 
>>>>> related classes in the taskmanager. The job still works fine, but this 
>>>>> doesn't look good to me. It happens in 1.11.2 and in 1.12.0-rc1. I tried 
>>>>> to resolve this by adding certain jars, but I wasn't able to fix it. 
>>>>> Maybe you have an idea. You can find the Dockerfile attached, which lines 
>>>>> out the steps I'm currently using. The Exceptions signature looks like 
>>>>> this.
>>>>>    Exception in thread "grpc-nio-worker-ELG-3-2" 
>>>>> java.lang.NoClassDefFoundError: 
>>>>> org/apache/beam/vendor/grpc/v1p26p0/io/netty/buffer/PoolArena$1
>>>> 
>>>> Usually there is nothing specially need to do to set up PyFlink. I have 
>>>> manually checked that this class should be there(inside 
>>>> flink-python_2.11-1.12.0.jar) and so guess if it's because you environment 
>>>> isn't clean enough? 
>>>> 
>>>> I guess you could check the following things:
>>>> 1) Is it because you have installed 1.11.2 before and so the environment 
>>>> is not clean enough? Could you uninstall PyFlink 1.11.2 manually and 
>>>> reinstall PyFlink 1.12.0-rc1 again? You could also manually check that 
>>>> there should be only one flink-python*.jar under directory 
>>>> xxx/site-packages/pyflink/opt/
>>>> 2) Verify that the class is actually there by the following command: 
>>>> (flink-python_2.11-1.12.0.jar is under directory 
>>>> xxx/site-packages/pyflink/opt/)
>>>>    jar tf flink-python_2.11-1.12.0.jar | grep 
>>>> "org/apache/beam/vendor/grpc/v1p26p0/io/netty/buffer/PoolArena"
>>>> 3) If this exception still happens, could you share the exception stack?
>>> 
>>> I found one cause of this problem and it was mixing a Scala 2.12 Flink 
>>> installation with PyFlink, which has some 2.11 jars in its opt folder. I 
>>> think the JVM just skipped the class definitions, because they weren't 
>>> compatible. I actually wasn't aware of the fact that PyFlink comes with 
>>> prebuilt jar dependencies. If PyFlink is only compatible with Scala 2.11 it 
>>> would make sense to point that out in the documentation. I think I never 
>>> read that and it might be missing. Unfortunately there is still one 
>>> Exception showing up at the very end of the job in the taskmanager log. I 
>>> did the verification you asked for and the class is present in both jar 
>>> files. The one which comes with Flink in the opt folder and the one of 
>>> PyFlink. You can find the log attached.
>>> I think the main question is which jar file has be loaded in in the three 
>>> envronments (executor, jobmanager, taskmanager). Is it fine to not put the 
>>> flink-python_2.11-1.12.0.jar into the lib folder in the jobmanager and 
>>> taskmanager? Will it somehow be transferred by PyFlink to the jobmanager 
>>> and taskmanager?
>>> 
>>>> 
>>>>> 3. When increasing the size of the input data set I get the following 
>>>>> Exception and the job is canceled. I tried to increase the resources 
>>>>> assigned to flink, but it didn't help. Do you have an idea why this is 
>>>>> happening? You can find a more detailed stack trace in apendix.
>>>> 
>>>> Could you check if there are any other exceptions in the log when this 
>>>> exception happens?
>>> 
>>> No I don't think that there are additional exceptions besides 
>>> "org.apache.beam.vendor.grpc.v1p26p0.io.grpc.StatusRuntimeException", but 
>>> maybe take a look in the attached log files. This problem could be related 
>>> to 2., maybe the root cause is a class loading issue as well. What do you 
>>> think? You can find attached three log files. One for the executor, the 
>>> jobmanager and the taskmanager. Maybe you can find something useful.
>>> 
>>>> 
>>>>> 4. I can't manage to get the SQL UNNEST operation to work. It is quite 
>>>>> hard for me to debug it and I can't really find any valuable examples or 
>>>>> documentation on the internet. Currently instead of creating an ARRAY I'm 
>>>>> just returning a VARCHAR containing a string representation of the array. 
>>>>> The relevant code you can find in the apendix.
>>>> 
>>>> There are some examples here: 
>>>> https://github.com/apache/flink/blob/c601cfd662c2839f8ebc81b80879ecce55a8cbaf/flink-table/flink-table-planner-blink/src/test/scala/org/apache/flink/table/planner/runtime/batch/sql/UnnestITCase.scala
>>>>  
>>>> <https://github.com/apache/flink/blob/c601cfd662c2839f8ebc81b80879ecce55a8cbaf/flink-table/flink-table-planner-blink/src/test/scala/org/apache/flink/table/planner/runtime/batch/sql/UnnestITCase.scala>
>>> 
>>> This was very helpful. I was able to implement it. There is only one detail 
>>> missing. Is it possible to UNNEST an array of Rows or tuples? It would be 
>>> really great if I would be able to return a list with multiple fields. 
>>> Currently I'm just putting multiple value into a single VARCHAR, but that 
>>> means the information needs to be extracted later on. Maybe you have an 
>>> idea how to avoid that.
>>> 
>>>> 
>>>>> 5. How can I obtain the output of the Python interpreter executing the 
>>>>> UDF. If I put a print statement in the UDF I can't see the output in the 
>>>>> log of the taskmanager. Is there a way to access it?
>>>> 
>>>> You can use the standard logging in Python UDF instead of print. The log 
>>>> output could then be found in the log of the task manager.
>>> 
>>> Thank you! That worked well. I should have checked that without asking.
>>> 
>>>> 
>>>>> I hope these aren't too many questions for this thread. If this is the 
>>>>> case I can still split some of them out. Please let me know, if this is 
>>>>> the case.
>>>>> Thank you very much. I really appreciate your help.
>>>> 
>>>> It's fine to reuse this thread. :)
>>>> 
>>>>> Kind Regards,
>>>>> Niklas
>>>>> 
>>>>> 
>>> 
>>> End of the Taskmanager Log for 2.
>>> ###################################################################
>>> taskmanager_1    | 2020-11-15 17:46:53,438 INFO  
>>> org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot 
>>> TaskSlot(index:5, state:ACTIVE, resource profile: 
>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=219.333mb 
>>> (229987662 bytes), taskOffHeapMemory=166.667mb (174762666 bytes), 
>>> managedMemory=342.933mb (359591667 bytes), networkMemory=85.733mb (89897916 
>>> bytes)}, allocationId: e5137050c0f1ef5e660311ddf1f3429f, jobId: 
>>> ba4e3974860af7dc00a28fdfbb44fe06).
>>> taskmanager_1    | 2020-11-15 17:46:53,440 INFO  
>>> org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot 
>>> TaskSlot(index:1, state:ACTIVE, resource profile: 
>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=219.333mb 
>>> (229987662 bytes), taskOffHeapMemory=166.667mb (174762666 bytes), 
>>> managedMemory=342.933mb (359591667 bytes), networkMemory=85.733mb (89897916 
>>> bytes)}, allocationId: 541ad3e383fb9c024141f2bab5e8b7fd, jobId: 
>>> ba4e3974860af7dc00a28fdfbb44fe06).
>>> taskmanager_1    | 2020-11-15 17:46:53,442 INFO  
>>> org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot 
>>> TaskSlot(index:2, state:ACTIVE, resource profile: 
>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=219.333mb 
>>> (229987662 bytes), taskOffHeapMemory=166.667mb (174762666 bytes), 
>>> managedMemory=342.933mb (359591667 bytes), networkMemory=85.733mb (89897916 
>>> bytes)}, allocationId: ef8adb7d879f4072123fe4bc12054c0c, jobId: 
>>> ba4e3974860af7dc00a28fdfbb44fe06).
>>> taskmanager_1    | 2020-11-15 17:46:53,444 INFO  
>>> org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot 
>>> TaskSlot(index:4, state:ACTIVE, resource profile: 
>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=219.333mb 
>>> (229987662 bytes), taskOffHeapMemory=166.667mb (174762666 bytes), 
>>> managedMemory=342.933mb (359591667 bytes), networkMemory=85.733mb (89897916 
>>> bytes)}, allocationId: db5d62b8c9fe8172fc1883c148b150e8, jobId: 
>>> ba4e3974860af7dc00a28fdfbb44fe06).
>>> taskmanager_1    | 2020-11-15 17:46:53,846 INFO  
>>> org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot 
>>> TaskSlot(index:0, state:ACTIVE, resource profile: 
>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=219.333mb 
>>> (229987662 bytes), taskOffHeapMemory=166.667mb (174762666 bytes), 
>>> managedMemory=342.933mb (359591667 bytes), networkMemory=85.733mb (89897916 
>>> bytes)}, allocationId: 637d053a0726548c2bc9261fc0e55414, jobId: 
>>> ba4e3974860af7dc00a28fdfbb44fe06).
>>> taskmanager_1    | 2020-11-15 17:46:53,849 INFO  
>>> org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot 
>>> TaskSlot(index:3, state:ACTIVE, resource profile: 
>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=219.333mb 
>>> (229987662 bytes), taskOffHeapMemory=166.667mb (174762666 bytes), 
>>> managedMemory=342.933mb (359591667 bytes), networkMemory=85.733mb (89897916 
>>> bytes)}, allocationId: cfaa8633b9102e3a509cfc94dd97d38f, jobId: 
>>> ba4e3974860af7dc00a28fdfbb44fe06).
>>> taskmanager_1    | 2020-11-15 17:46:53,851 INFO  
>>> org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Remove 
>>> job ba4e3974860af7dc00a28fdfbb44fe06 from job leader monitoring.
>>> taskmanager_1    | 2020-11-15 17:46:53,851 INFO  
>>> org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Close 
>>> JobManager connection for job ba4e3974860af7dc00a28fdfbb44fe06.
>>> taskmanager_1    | 2020-11-15 17:46:54,371 ERROR 
>>> org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.concurrent.DefaultPromise.rejectedExecution
>>>  [] - Failed to submit a listener notification task. Event loop shut down?
>>> taskmanager_1    | java.lang.NoClassDefFoundError: 
>>> org/apache/beam/vendor/grpc/v1p26p0/io/netty/util/concurrent/GlobalEventExecutor$2
>>> taskmanager_1    |      at 
>>> org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.concurrent.GlobalEventExecutor.startThread(GlobalEventExecutor.java:227)
>>>  
>>> ~[blob_p-2cc5d5ac59c7842f512002d81251a3cbfed058cc-e14fe009bc07ddff407ea4c5d74bd4be:1.12.0]
>>> taskmanager_1    |      at 
>>> org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.concurrent.GlobalEventExecutor.execute(GlobalEventExecutor.java:215)
>>>  
>>> ~[blob_p-2cc5d5ac59c7842f512002d81251a3cbfed058cc-e14fe009bc07ddff407ea4c5d74bd4be:1.12.0]
>>> taskmanager_1    |      at 
>>> org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.concurrent.DefaultPromise.safeExecute(DefaultPromise.java:841)
>>>  
>>> [blob_p-2cc5d5ac59c7842f512002d81251a3cbfed058cc-e14fe009bc07ddff407ea4c5d74bd4be:1.12.0]
>>> taskmanager_1    |      at 
>>> org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:498)
>>>  
>>> [blob_p-2cc5d5ac59c7842f512002d81251a3cbfed058cc-e14fe009bc07ddff407ea4c5d74bd4be:1.12.0]
>>> taskmanager_1    |      at 
>>> org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
>>>  
>>> [blob_p-2cc5d5ac59c7842f512002d81251a3cbfed058cc-e14fe009bc07ddff407ea4c5d74bd4be:1.12.0]
>>> taskmanager_1    |      at 
>>> org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.concurrent.DefaultPromise.setSuccess0(DefaultPromise.java:604)
>>>  
>>> [blob_p-2cc5d5ac59c7842f512002d81251a3cbfed058cc-e14fe009bc07ddff407ea4c5d74bd4be:1.12.0]
>>> taskmanager_1    |      at 
>>> org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.concurrent.DefaultPromise.setSuccess(DefaultPromise.java:96)
>>>  
>>> [blob_p-2cc5d5ac59c7842f512002d81251a3cbfed058cc-e14fe009bc07ddff407ea4c5d74bd4be:1.12.0]
>>> taskmanager_1    |      at 
>>> org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1089)
>>>  
>>> [blob_p-2cc5d5ac59c7842f512002d81251a3cbfed058cc-e14fe009bc07ddff407ea4c5d74bd4be:1.12.0]
>>> taskmanager_1    |      at 
>>> org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>>>  
>>> [blob_p-2cc5d5ac59c7842f512002d81251a3cbfed058cc-e14fe009bc07ddff407ea4c5d74bd4be:1.12.0]
>>> taskmanager_1    |      at 
>>> org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>>>  
>>> [blob_p-2cc5d5ac59c7842f512002d81251a3cbfed058cc-e14fe009bc07ddff407ea4c5d74bd4be:1.12.0]
>>> taskmanager_1    |      at java.lang.Thread.run(Thread.java:748) 
>>> [?:1.8.0_275]
>>> taskmanager_1    | Caused by: java.lang.ClassNotFoundException: 
>>> org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.concurrent.GlobalEventExecutor$2
>>> taskmanager_1    |      at 
>>> java.net.URLClassLoader.findClass(URLClassLoader.java:382) ~[?:1.8.0_275]
>>> taskmanager_1    |      at 
>>> java.lang.ClassLoader.loadClass(ClassLoader.java:418) ~[?:1.8.0_275]
>>> taskmanager_1    |      at 
>>> org.apache.flink.util.FlinkUserCodeClassLoader.loadClassWithoutExceptionHandling(FlinkUserCodeClassLoader.java:63)
>>>  ~[flink-dist_2.11-1.12.0.jar:1.12.0]
>>> taskmanager_1    |      at 
>>> org.apache.flink.util.ChildFirstClassLoader.loadClassWithoutExceptionHandling(ChildFirstClassLoader.java:72)
>>>  ~[flink-dist_2.11-1.12.0.jar:1.12.0]
>>> taskmanager_1    |      at 
>>> org.apache.flink.util.FlinkUserCodeClassLoader.loadClass(FlinkUserCodeClassLoader.java:49)
>>>  ~[flink-dist_2.11-1.12.0.jar:1.12.0]
>>> taskmanager_1    |      at 
>>> java.lang.ClassLoader.loadClass(ClassLoader.java:351) ~[?:1.8.0_275]
>>> taskmanager_1    |      ... 11 more
>>> ###################################################################
>>> Logfiles for 3.
>>> 
>>> <large-data-set-taskmanager.log>
>>> <large-data-set-jobmanager.log>
>>> <large-data-set-executor.log>
>>> 
>>>>> Dockerfile for question 2.
>>>>> ####################################################################
>>>>> # This image has been build based on the Dockerfile used for the flink 
>>>>> image on docker hub.
>>>>> # The only change I applied is that I switched to flink 1.12.0-rc1.
>>>>> FROM flink:1.12.0-rc1-scala_2.12
>>>>> 
>>>>> # Install python
>>>>> # TODO: Minimize dependencies
>>>>> RUN apt-get update && apt-get install -y \
>>>>>     python3 \
>>>>>     python3-pip \
>>>>>     python3-dev \
>>>>>     zip \
>>>>>   && rm -rf /var/lib/apt/lists/* \
>>>>>   && ln -s /usr/bin/python3 /usr/bin/python \
>>>>>   && ln -s /usr/bin/pip3 /usr/bin/pip
>>>>> 
>>>>> # Install pyflink
>>>>> RUN wget --no-verbose 
>>>>> https://dist.apache.org/repos/dist/dev/flink/flink-1.12.0-rc1/python/apache_flink-1.12.0-cp37-cp37m-manylinux1_x86_64.whl
>>>>>  
>>>>> <https://dist.apache.org/repos/dist/dev/flink/flink-1.12.0-rc1/python/apache_flink-1.12.0-cp37-cp37m-manylinux1_x86_64.whl>
>>>>>  \
>>>>>   && pip install apache_flink-1.12.0-cp37-cp37m-manylinux1_x86_64.whl \
>>>>>   && rm apache_flink-1.12.0-cp37-cp37m-manylinux1_x86_64.whl
>>>>> ####################################################################
>>>>> Stack Trace for question 3.
>>>>> ####################################################################
>>>>> Caused by: java.lang.RuntimeException: Failed to close remote bundle
>>>>>         at 
>>>>> org.apache.flink.streaming.api.runners.python.beam.BeamPythonFunctionRunner.finishBundle(BeamPythonFunctionRunner.java:368)
>>>>>         at 
>>>>> org.apache.flink.streaming.api.runners.python.beam.BeamPythonFunctionRunner.flush(BeamPythonFunctionRunner.java:322)
>>>>>         at 
>>>>> org.apache.flink.streaming.api.operators.python.AbstractPythonFunctionOperator.invokeFinishBundle(AbstractPythonFunctionOperator.java:283)
>>>>>         at 
>>>>> org.apache.flink.streaming.api.operators.python.AbstractPythonFunctionOperator.checkInvokeFinishBundleByCount(AbstractPythonFunctionOperator.java:267)
>>>>>         at 
>>>>> org.apache.flink.table.runtime.operators.python.aggregate.arrow.batch.BatchArrowPythonGroupAggregateFunctionOperator.invokeCurrentBatch(BatchArrowPythonGroupAggregateFunctionOperator.java:64)
>>>>>         at 
>>>>> org.apache.flink.table.runtime.operators.python.aggregate.arrow.batch.AbstractBatchArrowPythonAggregateFunctionOperator.endInput(AbstractBatchArrowPythonAggregateFunctionOperator.java:94)
>>>>>         at 
>>>>> org.apache.flink.table.runtime.operators.python.aggregate.arrow.batch.BatchArrowPythonGroupAggregateFunctionOperator.endInput(BatchArrowPythonGroupAggregateFunctionOperator.java:33)
>>>>>         at 
>>>>> org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.endOperatorInput(StreamOperatorWrapper.java:91)
>>>>>         at 
>>>>> org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.lambda$close$0(StreamOperatorWrapper.java:127)
>>>>>         at 
>>>>> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47)
>>>>>         at 
>>>>> org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.close(StreamOperatorWrapper.java:127)
>>>>>         at 
>>>>> org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.close(StreamOperatorWrapper.java:134)
>>>>>         at 
>>>>> org.apache.flink.streaming.runtime.tasks.OperatorChain.closeOperators(OperatorChain.java:412)
>>>>>         at 
>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask.afterInvoke(StreamTask.java:587)
>>>>>         at 
>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:549)
>>>>>         at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:722)
>>>>>         at org.apache.flink.runtime.taskmanager.Task.run(Task.java:547)
>>>>>         at java.lang.Thread.run(Thread.java:748)
>>>>> Caused by: java.util.concurrent.ExecutionException: 
>>>>> org.apache.beam.vendor.grpc.v1p26p0.io.grpc.StatusRuntimeException: 
>>>>> CANCELLED: cancelled before receiving half close
>>>>>         at 
>>>>> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
>>>>>         at 
>>>>> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
>>>>>         at org.apache.beam.sdk.util.MoreFutures.get(MoreFutures.java:57)
>>>>>         at 
>>>>> org.apache.beam.runners.fnexecution.control.SdkHarnessClient$BundleProcessor$ActiveBundle.close(SdkHarnessClient.java:458)
>>>>>         at 
>>>>> org.apache.beam.runners.fnexecution.control.DefaultJobBundleFactory$SimpleStageBundleFactory$1.close(DefaultJobBundleFactory.java:547)
>>>>>         at 
>>>>> org.apache.flink.streaming.api.runners.python.beam.BeamPythonFunctionRunner.finishBundle(BeamPythonFunctionRunner.java:366)
>>>>>         ... 17 more
>>>>> Caused by: 
>>>>> org.apache.beam.vendor.grpc.v1p26p0.io.grpc.StatusRuntimeException: 
>>>>> CANCELLED: cancelled before receiving half close
>>>>>         at 
>>>>> org.apache.beam.vendor.grpc.v1p26p0.io.grpc.Status.asRuntimeException(Status.java:524)
>>>>>         at 
>>>>> org.apache.beam.vendor.grpc.v1p26p0.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onCancel(ServerCalls.java:275)
>>>>>         at 
>>>>> org.apache.beam.vendor.grpc.v1p26p0.io.grpc.PartialForwardingServerCallListener.onCancel(PartialForwardingServerCallListener.java:40)
>>>>>         at 
>>>>> org.apache.beam.vendor.grpc.v1p26p0.io.grpc.ForwardingServerCallListener.onCancel(ForwardingServerCallListener.java:23)
>>>>>         at 
>>>>> org.apache.beam.vendor.grpc.v1p26p0.io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onCancel(ForwardingServerCallListener.java:40)
>>>>>         at 
>>>>> org.apache.beam.vendor.grpc.v1p26p0.io.grpc.Contexts$ContextualizedServerCallListener.onCancel(Contexts.java:96)
>>>>>         at 
>>>>> org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.closedInternal(ServerCallImpl.java:353)
>>>>>         at 
>>>>> org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.closed(ServerCallImpl.java:341)
>>>>>         at 
>>>>> org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1Closed.runInContext(ServerImpl.java:867)
>>>>>         at 
>>>>> org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
>>>>>         at 
>>>>> org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
>>>>>         at 
>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>>         at 
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>>         ... 1 more
>>>>> ################################################################
>>>>> Code for question 4.
>>>>> ################################################################
>>>>> # UDAF signature
>>>>> @udaf(input_types=[DataTypes.FLOAT(), DataTypes.FLOAT()],
>>>>>      result_type=DataTypes.VARCHAR(10000), func_type='pandas')
>>>>> def forcast(ds_float_series, y):
>>>>> 
>>>>> # SQL DDL
>>>>> "create table mySource (ds FLOAT, riid VARCHAR(100), y FLOAT ) with ( ... 
>>>>> )"
>>>>> "create table mySink (riid VARCHAR(100), yhatd VARCHAR(10000)) with ( ... 
>>>>> )"
>>>>> 
>>>>> # SQL INSERT
>>>>> "INSERT INTO mySink SELECT riid, forcast(ds, y) AS yhat FROM mySource 
>>>>> GROUP BY riid"
>>>>> ################################################################
>>>>> 
>>>>>> On 12. Nov 2020, at 12:53, Dian Fu <dian0511...@gmail.com 
>>>>>> <mailto:dian0511...@gmail.com>> wrote:
>>>>>> 
>>>>>> Hi Niklas,
>>>>>> 
>>>>>> Python DataStream API will also be supported in coming release of 1.12.0 
>>>>>> [1]. However, the functionalities are still limited for the time being 
>>>>>> compared to the Java DataStream API, e.g. it will only support the 
>>>>>> stateless operations, such as map, flat_map, etc.
>>>>>> 
>>>>>> [1] 
>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/python/datastream_tutorial.html
>>>>>>  
>>>>>> <https://ci.apache.org/projects/flink/flink-docs-master/dev/python/datastream_tutorial.html>
>>>>>>> 在 2020年11月12日，下午7:46，Niklas Wilcke <niklas.wil...@uniberg.com 
>>>>>>> <mailto:niklas.wil...@uniberg.com>> 写道：
>>>>>>> 
>>>>>>> Hi Dian,
>>>>>>> 
>>>>>>> thank you very much for this valuable response. I already read about 
>>>>>>> the UDAF, but I wasn't aware of the fact that it is possible to return 
>>>>>>> and UNNEST an array. I will definitely have a try and hopefully this 
>>>>>>> will solve my issue.
>>>>>>> 
>>>>>>> Another question that came up to my mind is whether PyFlink supports 
>>>>>>> any other API except Table and SQL, like the Streaming and Batch API. 
>>>>>>> The documentation is only covering the Table API, but I'm not sure 
>>>>>>> about that. Can you maybe tell me whether the Table and SQL API is the 
>>>>>>> only one supported by PyFlink?
>>>>>>> 
>>>>>>> Kind Regards,
>>>>>>> Niklas
>>>>>>> 
>>>>>>>  
>>>>>>> 
>>>>>>>> On 11. Nov 2020, at 15:32, Dian Fu <dian0511...@gmail.com 
>>>>>>>> <mailto:dian0511...@gmail.com>> wrote:
>>>>>>>> 
>>>>>>>> Hi Niklas,
>>>>>>>> 
>>>>>>>> You are correct that the input/output length of Pandas UDF must be of 
>>>>>>>> the same size and that Flink will split the input data into multiple 
>>>>>>>> bundles for Pandas UDF and the bundle size is non-determinstic. Both 
>>>>>>>> of the above two limitations are by design and so I guess Pandas UDF 
>>>>>>>> could not meet your requirements.
>>>>>>>> 
>>>>>>>> However, you could take a look at if the Pandas UDAF[1] which was 
>>>>>>>> supported in 1.12 could meet your requirements:
>>>>>>>> - As group_by only generate one record per group key just as you said, 
>>>>>>>> you could declare the output type of Pandas UDAF as an array type
>>>>>>>> - You need then flatten the aggregation results, e.g. using UNNEST
>>>>>>>> 
>>>>>>>> NOTE: Flink 1.12 is still not released. You could try the PyFlink 
>>>>>>>> package of RC1[2] for 1.12.0 or build it yourself according to [3].
>>>>>>>> 
>>>>>>>> [1] 
>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/python/table-api-users-guide/udfs/vectorized_python_udfs.html#vectorized-aggregate-functions
>>>>>>>>  
>>>>>>>> <https://ci.apache.org/projects/flink/flink-docs-master/dev/python/table-api-users-guide/udfs/vectorized_python_udfs.html#vectorized-aggregate-functions>
>>>>>>>> [2] 
>>>>>>>> https://dist.apache.org/repos/dist/dev/flink/flink-1.12.0-rc1/python/ 
>>>>>>>> <https://dist.apache.org/repos/dist/dev/flink/flink-1.12.0-rc1/python/>
>>>>>>>> [3] 
>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/flinkDev/building.html#build-pyflink
>>>>>>>>  
>>>>>>>> <https://ci.apache.org/projects/flink/flink-docs-master/flinkDev/building.html#build-pyflink>
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Dian
>>>>>>>> 
>>>>>>>>> 在 2020年11月11日，下午9:03，Niklas Wilcke <niklas.wil...@uniberg.com 
>>>>>>>>> <mailto:niklas.wil...@uniberg.com>> 写道：
>>>>>>>>> 
>>>>>>>>> Hi Flink Community,
>>>>>>>>> 
>>>>>>>>> I'm currently trying to implement a parallel machine learning job 
>>>>>>>>> with Flink. The goal is to train models in parallel for independent 
>>>>>>>>> time series in the same data stream. For that purpose I'm using a 
>>>>>>>>> Python library, which lead me to PyFlink. Let me explain the use case 
>>>>>>>>> a bit more.
>>>>>>>>> I want to implement a batch job, which partitions/groups the data by 
>>>>>>>>> a device identifier. After that I need to process the data for each 
>>>>>>>>> device all at once. There is no way to iteratively train the model 
>>>>>>>>> unfortunately. The challenge I'm facing is to guarantee that all data 
>>>>>>>>> belonging to a certain device is processed in one single step. I'm 
>>>>>>>>> aware of the fact that this does not scale well, but for a reasonable 
>>>>>>>>> amount of input data per device it should be fine from my perspective.
>>>>>>>>> I investigated a lot and I ended up using the Table API and Pandas 
>>>>>>>>> UDF, which roughly fulfil my requirements, but there are the 
>>>>>>>>> following limitations left, which I wanted to talk about.
>>>>>>>>> 
>>>>>>>>> 1. Pandas UDF takes multiple Series as input parameters, which is 
>>>>>>>>> fine for my purpose, but as far as I can see there is no way to 
>>>>>>>>> guarantee that the chunk of data in the Series is "complete". Flink 
>>>>>>>>> will slice the Series and maybe call the UDF multiple times for each 
>>>>>>>>> device. As far as I can see there are some config options like 
>>>>>>>>> "python.fn-execution.arrow.batch.size" and 
>>>>>>>>> "python.fn-execution.bundle.time", which might help, but I'm not 
>>>>>>>>> sure, whether this is the right path to take.
>>>>>>>>> 2. The length of the input Series needs to be of the same size as the 
>>>>>>>>> output Series, which isn't nice for my use case. What I would like to 
>>>>>>>>> do is to process n rows and emit m rows. There shouldn't be any 
>>>>>>>>> dependency between the number of input rows and the number of output 
>>>>>>>>> rows.
>>>>>>>>> 
>>>>>>>>> 3. How do I partition the data stream. The Table API offers a 
>>>>>>>>> groupby, but this doesn't serve my purpose, because I don't want to 
>>>>>>>>> aggregate all the grouped lines. Instead as stated above I want to 
>>>>>>>>> emit m result lines per group. Are there other options using the 
>>>>>>>>> Table API or any other API to do this kind of grouping. I would need 
>>>>>>>>> something like a "keyBy()" from the streaming API. Maybe this can be 
>>>>>>>>> combined? Can I create a separate table for each key?
>>>>>>>>> 
>>>>>>>>> I'm also open to ideas for a completely different approach not using 
>>>>>>>>> the Table API or Pandas UDF. Any idea is welcome.
>>>>>>>>> 
>>>>>>>>> You can find a condensed version of the source code attached.
>>>>>>>>> 
>>>>>>>>> Kind Regards,
>>>>>>>>> Niklas
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> #############################################################
>>>>>>>>> 
>>>>>>>>> from pyflink.datastream import StreamExecutionEnvironment
>>>>>>>>> from pyflink.table import StreamTableEnvironment, DataTypes
>>>>>>>>> from pyflink.table.udf import udf
>>>>>>>>> 
>>>>>>>>> env = StreamExecutionEnvironment.get_execution_environment()
>>>>>>>>> env.set_parallelism(1)
>>>>>>>>> t_env = StreamTableEnvironment.create(env)
>>>>>>>>> t_env.get_config().get_configuration().set_boolean("python.fn-execution.memory.managed",
>>>>>>>>>  True)
>>>>>>>>> 
>>>>>>>>> @udf(input_types=[DataTypes.FLOAT(), DataTypes.FLOAT()],
>>>>>>>>>     result_type=DataTypes.FLOAT(), udf_type='pandas')
>>>>>>>>> def forcast(ds_float_series, y):
>>>>>>>>> 
>>>>>>>>>    # Train the model and create the forcast
>>>>>>>>> 
>>>>>>>>>    yhat_ts = forcast['yhat'].tail(input_size)
>>>>>>>>>    return yhat_ts
>>>>>>>>> 
>>>>>>>>> t_env.register_function("forcast", forcast)
>>>>>>>>> 
>>>>>>>>> # Define sink and source here
>>>>>>>>> 
>>>>>>>>> t_env.execute_sql(my_source_ddl)
>>>>>>>>> t_env.execute_sql(my_sink_ddl)
>>>>>>>>> 
>>>>>>>>> # TODO: key_by instead of filter
>>>>>>>>> t_env.from_path('mySource') \
>>>>>>>>>    .where("riid === 'r1i1'") \
>>>>>>>>>    .select("ds, riid, y, forcast(ds, y) as yhat_90d") \
>>>>>>>>>    .insert_into('mySink')
>>>>>>>>> 
>>>>>>>>> t_env.execute("pandas_udf_demo")
>>>>>>>>> 
>>>>>>>>> #############################################################
>> 
>

smime.p7s
Description: S/MIME cryptographic signature

Re: PyFlink Table API and UDF Limitations

Reply via email to