Re: [Spark UDF]: Where does UDF stores temporary Arrays/Sets

Abhimanyu Kumar Singh Wed, 26 Jan 2022 08:05:42 -0800

Thanks for your quick response.

For some reasons I can't use spark-xml (schema related issue).


I've tried reducing number of tasks per executor by increasing the number
of executors, but it still throws same error.

I can't understand why does even 15gb of executor memory is not sufficient
to parse just 2gb XML file.
How can I check the max amount of JVM memory utilised for each task?

Do I need to tweak some other configurations for increasing JVM memory
rather than spark.executor.memory?

On Wed, Jan 26, 2022, 9:23 PM Sean Owen <sro...@gmail.com> wrote:

> Executor memory used shows data that is cached, not the VM usage. You're
> running out of memory somewhere, likely in your UDF, which probably parses
> massive XML docs as a DOM first or something. Use more memory, fewer tasks
> per executor, or consider using spark-xml if you are really just parsing
> pieces of it. It'll be more efficient.
>
> On Wed, Jan 26, 2022 at 9:47 AM Abhimanyu Kumar Singh <
> abhimanyu.kr.sing...@gmail.com> wrote:
>
>> I'm doing some complex operations inside spark UDF (parsing huge XML).
>>
>> Dataframe:
>> | value |
>> | Content of XML File 1 |
>> | Content of XML File 2 |
>> | Content of XML File N |
>>
>> val df = Dataframe.select(UDF_to_parse_xml(value))
>>
>> UDF looks something like:
>>
>> val XMLelements : Array[MyClass1] = getXMLelements(xmlContent)
>> val myResult: Array[MyClass2] = XMLelements.map(myfunction).distinct
>>
>> Parsing requires creation and de-duplication of arrays from the XML
>> containing
>> around 0.1 million elements (consisting of MyClass(Strings, Maps,
>> Integers, .... )).
>>
>> In the Spark UI "executor memory used" is barely 60-70 MB. But still
>> Spark processing fails
>> with *ExecutorLostFailure *error for XMLs of size around 2GB.
>> When I increase the executor size (say 15GB to 25 GB) it works fine. One
>> partition can contain only
>> one XML file (with max size 2GB) and 1 task/executor runs in parallel.
>>
>> *My question is which memory is being used by UDF for storing arrays,
>> maps or sets while parsing?*
>> *And how can I configure it?*
>>
>> Should I increase *spark*.*memory*.*offHeap*.size,
>> spark.yarn.executor.memoryOverhead or spark.executor.memoryOverhead?
>>
>> Thanks a lot,
>> Abhimanyu
>>
>> PS: I know I shouldn't use UDF this way, but I don't have any other
>> alternative here.
>>
>>
>>
>>
>>
>>
>>
>>

Re: [Spark UDF]: Where does UDF stores temporary Arrays/Sets

Reply via email to