Re: How does PySpark send "import" to the worker when executing Python UDFs?
Aha I see. Thanks Hyukjin! On Tue, Jul 19, 2022 at 9:09 PM Hyukjin Kwon wrote: > This is done by cloudpickle. They pickle global variables referred within > the func together, and register it to the global imported modules. > > On Wed, 20 Jul 2022 at 00:55, Li Jin wrote: > >> Hi, >> >> I have a question about how does "imports" get send to the python worker. >> >> For example, I have >> >> def foo(x): >> return np.abs(x) >> >> If I run this code directly, it obviously failed (because np is undefined >> on the driver process): >> >> sc.paralleilize([1, 2, 3]).map(foo).collect() >> >> However, if I add the import statement "import numpy as np" on the >> driver, it works. So somehow driver is sending that "imports" to the worker >> when executing foo on the worker but I cannot seem t o find the code that >> does this - Can someone please send me a pointer? >> >> Thanks, >> Li >> >
Re: How does PySpark send "import" to the worker when executing Python UDFs?
This is done by cloudpickle. They pickle global variables referred within the func together, and register it to the global imported modules. On Wed, 20 Jul 2022 at 00:55, Li Jin wrote: > Hi, > > I have a question about how does "imports" get send to the python worker. > > For example, I have > > def foo(x): > return np.abs(x) > > If I run this code directly, it obviously failed (because np is undefined > on the driver process): > > sc.paralleilize([1, 2, 3]).map(foo).collect() > > However, if I add the import statement "import numpy as np" on the driver, > it works. So somehow driver is sending that "imports" to the worker when > executing foo on the worker but I cannot seem t o find the code that does > this - Can someone please send me a pointer? > > Thanks, > Li >
How does PySpark send "import" to the worker when executing Python UDFs?
Hi, I have a question about how does "imports" get send to the python worker. For example, I have def foo(x): return np.abs(x) If I run this code directly, it obviously failed (because np is undefined on the driver process): sc.paralleilize([1, 2, 3]).map(foo).collect() However, if I add the import statement "import numpy as np" on the driver, it works. So somehow driver is sending that "imports" to the worker when executing foo on the worker but I cannot seem t o find the code that does this - Can someone please send me a pointer? Thanks, Li