I see. Then you should use `mapPartitions` rather than using ThreadLocal.
E.g.,

dstream.mapPartitions( iter ->
    val d = new SomeClass();
    return iter.map { p =>
       somefunc(p, d.get())
    };
}; );


On Fri, Jan 29, 2016 at 5:29 PM, N B <nb.nos...@gmail.com> wrote:

> Well won't the code in lambda execute inside multiple threads in the
> worker because it has to process many records? I would just want to have a
> single copy of SomeClass instantiated per thread rather than once per each
> record being processed. That was what triggered this thought anyways.
>
> Thanks
> NB
>
>
> On Fri, Jan 29, 2016 at 5:09 PM, Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> It looks weird. Why don't you just pass "new SomeClass()" to "somefunc"?
>> You don't need to use ThreadLocal if there are no multiple threads in your
>> codes.
>>
>> On Fri, Jan 29, 2016 at 4:39 PM, N B <nb.nos...@gmail.com> wrote:
>>
>>> Fixed a typo in the code to avoid any confusion.... Please comment on
>>> the code below...
>>>
>>> dstream.map( p -> { ThreadLocal<SomeClass> d = new ThreadLocal<>() {
>>>          public SomeClass initialValue() { return new SomeClass(); }
>>>     };
>>>     somefunc(p, d.get());
>>>     d.remove();
>>>     return p;
>>> }; );
>>>
>>> On Fri, Jan 29, 2016 at 4:32 PM, N B <nb.nos...@gmail.com> wrote:
>>>
>>>> So this use of ThreadLocal will be inside the code of a function
>>>> executing on the workers i.e. within a call from one of the lambdas. Would
>>>> it just look like this then:
>>>>
>>>> dstream.map( p -> { ThreadLocal<Data> d = new ThreadLocal<>() {
>>>>          public SomeClass initialValue() { return new SomeClass(); }
>>>>     };
>>>>     somefunc(p, d.get());
>>>>     d.remove();
>>>>     return p;
>>>> }; );
>>>>
>>>> Will this make sure that all threads inside the worker clean up the
>>>> ThreadLocal once they are done with processing this task?
>>>>
>>>> Thanks
>>>> NB
>>>>
>>>>
>>>> On Fri, Jan 29, 2016 at 1:00 PM, Shixiong(Ryan) Zhu <
>>>> shixi...@databricks.com> wrote:
>>>>
>>>>> Spark Streaming uses threadpools so you need to remove ThreadLocal
>>>>> when it's not used.
>>>>>
>>>>> On Fri, Jan 29, 2016 at 12:55 PM, N B <nb.nos...@gmail.com> wrote:
>>>>>
>>>>>> Thanks for the response Ryan. So I would say that it is in fact the
>>>>>> purpose of a ThreadLocal i.e. to have a copy of the variable as long as 
>>>>>> the
>>>>>> thread lives. I guess my concern is around usage of threadpools and 
>>>>>> whether
>>>>>> Spark streaming will internally create many threads that rotate between
>>>>>> tasks on purpose thereby holding onto ThreadLocals that may actually 
>>>>>> never
>>>>>> be used again.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> On Fri, Jan 29, 2016 at 12:12 PM, Shixiong(Ryan) Zhu <
>>>>>> shixi...@databricks.com> wrote:
>>>>>>
>>>>>>> Of cause. If you use a ThreadLocal in a long living thread and
>>>>>>> forget to remove it, it's definitely a memory leak.
>>>>>>>
>>>>>>> On Thu, Jan 28, 2016 at 9:31 PM, N B <nb.nos...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> Does anyone know if there are any potential pitfalls associated
>>>>>>>> with using ThreadLocal variables in a Spark streaming application? One
>>>>>>>> things I have seen mentioned in the context of app servers that use 
>>>>>>>> thread
>>>>>>>> pools is that ThreadLocals can leak memory. Could this happen in Spark
>>>>>>>> streaming also?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Nikunj
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to