I'm doing some complex operations inside spark UDF (parsing huge XML).

Dataframe:
| value |
| Content of XML File 1 |
| Content of XML File 2 |
| Content of XML File N |

val df = Dataframe.select(UDF_to_parse_xml(value))

UDF looks something like:

val XMLelements : Array[MyClass1] = getXMLelements(xmlContent)
val myResult: Array[MyClass2] = XMLelements.map(myfunction).distinct

Parsing requires creation and de-duplication of arrays from the XML
containing
around 0.1 million elements (consisting of MyClass(Strings, Maps, Integers,
.... )).

In the Spark UI "executor memory used" is barely 60-70 MB. But still Spark
processing fails
with *ExecutorLostFailure *error for XMLs of size around 2GB.
When I increase the executor size (say 15GB to 25 GB) it works fine. One
partition can contain only
one XML file (with max size 2GB) and 1 task/executor runs in parallel.

*My question is which memory is being used by UDF for storing arrays, maps
or sets while parsing?*
*And how can I configure it?*

Should I increase *spark*.*memory*.*offHeap*.size,
spark.yarn.executor.memoryOverhead or spark.executor.memoryOverhead?

Thanks a lot,
Abhimanyu

PS: I know I shouldn't use UDF this way, but I don't have any other
alternative here.

Reply via email to