Re: [Spark UDF]: Where does UDF stores temporary Arrays/Sets

2022-01-30 Thread Gourav Sengupta
Hi, Can you please try to see if you can increase the number of cores per task, and therefore give each task more memory per executor? I do not understand what is the XML, what is the data in it, and what is the problem that you are trying to solve writing UDF's to parse XML. So maybe we are not

Re: [Spark UDF]: Where does UDF stores temporary Arrays/Sets

2022-01-26 Thread Sean Owen
Really depends on what your UDF is doing. You could read 2GB of XML into much more than that as a DOM representation in memory. Remember 15GB of executor memory is shared across tasks. You need to get a handle on what memory your code is using to begin with to start to reason about whether that's

Re: [Spark UDF]: Where does UDF stores temporary Arrays/Sets

2022-01-26 Thread Abhimanyu Kumar Singh
Thanks for your quick response. For some reasons I can't use spark-xml (schema related issue). I've tried reducing number of tasks per executor by increasing the number of executors, but it still throws same error. I can't understand why does even 15gb of executor memory is not sufficient to

Re: [Spark UDF]: Where does UDF stores temporary Arrays/Sets

2022-01-26 Thread Sean Owen
Executor memory used shows data that is cached, not the VM usage. You're running out of memory somewhere, likely in your UDF, which probably parses massive XML docs as a DOM first or something. Use more memory, fewer tasks per executor, or consider using spark-xml if you are really just parsing

[Spark UDF]: Where does UDF stores temporary Arrays/Sets

2022-01-26 Thread Abhimanyu Kumar Singh
I'm doing some complex operations inside spark UDF (parsing huge XML). Dataframe: | value | | Content of XML File 1 | | Content of XML File 2 | | Content of XML File N | val df = Dataframe.select(UDF_to_parse_xml(value)) UDF looks something like: val XMLelements : Array[MyClass1] =