Executor memory used shows data that is cached, not the VM usage. You're
running out of memory somewhere, likely in your UDF, which probably parses
massive XML docs as a DOM first or something. Use more memory, fewer tasks
per executor, or consider using spark-xml if you are really just parsing
pieces of it. It'll be more efficient.

On Wed, Jan 26, 2022 at 9:47 AM Abhimanyu Kumar Singh <
abhimanyu.kr.sing...@gmail.com> wrote:

> I'm doing some complex operations inside spark UDF (parsing huge XML).
>
> Dataframe:
> | value |
> | Content of XML File 1 |
> | Content of XML File 2 |
> | Content of XML File N |
>
> val df = Dataframe.select(UDF_to_parse_xml(value))
>
> UDF looks something like:
>
> val XMLelements : Array[MyClass1] = getXMLelements(xmlContent)
> val myResult: Array[MyClass2] = XMLelements.map(myfunction).distinct
>
> Parsing requires creation and de-duplication of arrays from the XML
> containing
> around 0.1 million elements (consisting of MyClass(Strings, Maps,
> Integers, .... )).
>
> In the Spark UI "executor memory used" is barely 60-70 MB. But still Spark
> processing fails
> with *ExecutorLostFailure *error for XMLs of size around 2GB.
> When I increase the executor size (say 15GB to 25 GB) it works fine. One
> partition can contain only
> one XML file (with max size 2GB) and 1 task/executor runs in parallel.
>
> *My question is which memory is being used by UDF for storing arrays, maps
> or sets while parsing?*
> *And how can I configure it?*
>
> Should I increase *spark*.*memory*.*offHeap*.size,
> spark.yarn.executor.memoryOverhead or spark.executor.memoryOverhead?
>
> Thanks a lot,
> Abhimanyu
>
> PS: I know I shouldn't use UDF this way, but I don't have any other
> alternative here.
>
>
>
>
>
>
>
>

Reply via email to