Profiling options for PandasUDF (2.4.7 on yarn)

Patrick McCarthy Fri, 28 May 2021 06:51:52 -0700

I'm trying to do a very large aggregation of sparse matrices in which my
source data looks like


root
 |-- device_id: string (nullable = true)
 |-- row_id: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- column_id: array (nullable = true)
 |    |-- element: integer (containsNull = true)



I assume each row to reflect a sparse matrix where each combination of
(row_id, column_id) has value of 1. I have a PandasUDF which performs a
GROUPED_MAP that transforms every row into a scipy.sparse.csr_matrix and,
within the group, sums the matrices before returning columns of (count,
row_id, column_id).

It works at small scale but gets unstable as I scale up. Is there a way to
profile this function in a spark session or am I limited to profiling on
pandas data frames without spark?

-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016

Profiling options for PandasUDF (2.4.7 on yarn)

Reply via email to