Thanks for your quick response. For some reasons I can't use spark-xml (schema related issue).
I've tried reducing number of tasks per executor by increasing the number of executors, but it still throws same error. I can't understand why does even 15gb of executor memory is not sufficient to parse just 2gb XML file. How can I check the max amount of JVM memory utilised for each task? Do I need to tweak some other configurations for increasing JVM memory rather than spark.executor.memory? On Wed, Jan 26, 2022, 9:23 PM Sean Owen <sro...@gmail.com> wrote: > Executor memory used shows data that is cached, not the VM usage. You're > running out of memory somewhere, likely in your UDF, which probably parses > massive XML docs as a DOM first or something. Use more memory, fewer tasks > per executor, or consider using spark-xml if you are really just parsing > pieces of it. It'll be more efficient. > > On Wed, Jan 26, 2022 at 9:47 AM Abhimanyu Kumar Singh < > abhimanyu.kr.sing...@gmail.com> wrote: > >> I'm doing some complex operations inside spark UDF (parsing huge XML). >> >> Dataframe: >> | value | >> | Content of XML File 1 | >> | Content of XML File 2 | >> | Content of XML File N | >> >> val df = Dataframe.select(UDF_to_parse_xml(value)) >> >> UDF looks something like: >> >> val XMLelements : Array[MyClass1] = getXMLelements(xmlContent) >> val myResult: Array[MyClass2] = XMLelements.map(myfunction).distinct >> >> Parsing requires creation and de-duplication of arrays from the XML >> containing >> around 0.1 million elements (consisting of MyClass(Strings, Maps, >> Integers, .... )). >> >> In the Spark UI "executor memory used" is barely 60-70 MB. But still >> Spark processing fails >> with *ExecutorLostFailure *error for XMLs of size around 2GB. >> When I increase the executor size (say 15GB to 25 GB) it works fine. One >> partition can contain only >> one XML file (with max size 2GB) and 1 task/executor runs in parallel. >> >> *My question is which memory is being used by UDF for storing arrays, >> maps or sets while parsing?* >> *And how can I configure it?* >> >> Should I increase *spark*.*memory*.*offHeap*.size, >> spark.yarn.executor.memoryOverhead or spark.executor.memoryOverhead? >> >> Thanks a lot, >> Abhimanyu >> >> PS: I know I shouldn't use UDF this way, but I don't have any other >> alternative here. >> >> >> >> >> >> >> >>