I see. Then you should use `mapPartitions` rather than using ThreadLocal. E.g.,
dstream.mapPartitions( iter -> val d = new SomeClass(); return iter.map { p => somefunc(p, d.get()) }; }; ); On Fri, Jan 29, 2016 at 5:29 PM, N B <nb.nos...@gmail.com> wrote: > Well won't the code in lambda execute inside multiple threads in the > worker because it has to process many records? I would just want to have a > single copy of SomeClass instantiated per thread rather than once per each > record being processed. That was what triggered this thought anyways. > > Thanks > NB > > > On Fri, Jan 29, 2016 at 5:09 PM, Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> It looks weird. Why don't you just pass "new SomeClass()" to "somefunc"? >> You don't need to use ThreadLocal if there are no multiple threads in your >> codes. >> >> On Fri, Jan 29, 2016 at 4:39 PM, N B <nb.nos...@gmail.com> wrote: >> >>> Fixed a typo in the code to avoid any confusion.... Please comment on >>> the code below... >>> >>> dstream.map( p -> { ThreadLocal<SomeClass> d = new ThreadLocal<>() { >>> public SomeClass initialValue() { return new SomeClass(); } >>> }; >>> somefunc(p, d.get()); >>> d.remove(); >>> return p; >>> }; ); >>> >>> On Fri, Jan 29, 2016 at 4:32 PM, N B <nb.nos...@gmail.com> wrote: >>> >>>> So this use of ThreadLocal will be inside the code of a function >>>> executing on the workers i.e. within a call from one of the lambdas. Would >>>> it just look like this then: >>>> >>>> dstream.map( p -> { ThreadLocal<Data> d = new ThreadLocal<>() { >>>> public SomeClass initialValue() { return new SomeClass(); } >>>> }; >>>> somefunc(p, d.get()); >>>> d.remove(); >>>> return p; >>>> }; ); >>>> >>>> Will this make sure that all threads inside the worker clean up the >>>> ThreadLocal once they are done with processing this task? >>>> >>>> Thanks >>>> NB >>>> >>>> >>>> On Fri, Jan 29, 2016 at 1:00 PM, Shixiong(Ryan) Zhu < >>>> shixi...@databricks.com> wrote: >>>> >>>>> Spark Streaming uses threadpools so you need to remove ThreadLocal >>>>> when it's not used. >>>>> >>>>> On Fri, Jan 29, 2016 at 12:55 PM, N B <nb.nos...@gmail.com> wrote: >>>>> >>>>>> Thanks for the response Ryan. So I would say that it is in fact the >>>>>> purpose of a ThreadLocal i.e. to have a copy of the variable as long as >>>>>> the >>>>>> thread lives. I guess my concern is around usage of threadpools and >>>>>> whether >>>>>> Spark streaming will internally create many threads that rotate between >>>>>> tasks on purpose thereby holding onto ThreadLocals that may actually >>>>>> never >>>>>> be used again. >>>>>> >>>>>> Thanks >>>>>> >>>>>> On Fri, Jan 29, 2016 at 12:12 PM, Shixiong(Ryan) Zhu < >>>>>> shixi...@databricks.com> wrote: >>>>>> >>>>>>> Of cause. If you use a ThreadLocal in a long living thread and >>>>>>> forget to remove it, it's definitely a memory leak. >>>>>>> >>>>>>> On Thu, Jan 28, 2016 at 9:31 PM, N B <nb.nos...@gmail.com> wrote: >>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> Does anyone know if there are any potential pitfalls associated >>>>>>>> with using ThreadLocal variables in a Spark streaming application? One >>>>>>>> things I have seen mentioned in the context of app servers that use >>>>>>>> thread >>>>>>>> pools is that ThreadLocals can leak memory. Could this happen in Spark >>>>>>>> streaming also? >>>>>>>> >>>>>>>> Thanks >>>>>>>> Nikunj >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >