I'm trying to do a very large aggregation of sparse matrices in which my source data looks like
root |-- device_id: string (nullable = true) |-- row_id: array (nullable = true) | |-- element: integer (containsNull = true) |-- column_id: array (nullable = true) | |-- element: integer (containsNull = true) I assume each row to reflect a sparse matrix where each combination of (row_id, column_id) has value of 1. I have a PandasUDF which performs a GROUPED_MAP that transforms every row into a scipy.sparse.csr_matrix and, within the group, sums the matrices before returning columns of (count, row_id, column_id). It works at small scale but gets unstable as I scale up. Is there a way to profile this function in a spark session or am I limited to profiling on pandas data frames without spark? -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016