Hi, I’d like to propose the following improvement of the sparse tensor format and implementation.
(1) To make variable bit-width indices available. The main purpose of the first part of the proposal is making 32-bit indices available. It allows us to serialize scipy.sparse.csr_matrix objects etc. with 32-bit indices without converting the index arrays to 64-bit values. As Jed said in the previous discussion [1] in this ML, since 32-bit indices have advantages of the small memory footprints, I strongly consider this change is necessary for the sparse tensor support for Apache Arrow. Adding both the type field in each sparse index format and the stride field in SparseCOOIndex format is necessary to do this. (2) Adding the new COO format with separated row and column indices Scipy.sparse.coo_matrix manages the indices of row and column in separated numpy arrays. It is enough for representing a sparse matrix. On the other hand, for supporting sparse tensors with arbitrary ranks, Arrow's SparseCOOIndex manages COO indices as one matrix. Hence we need to make a copy of indices to convert scipy.sparse.coo_matrix to Arrow’s SparseTensor. Introducing the new COO format with separated row and column indices can resolve this issue. (3) Adding SparseCSCIndex The CSC format of sparse matrices has the advantage of faster scanning in columnar direction while the CSR format is faster in a row-wise scan. Because The aptitude of CSC is different from the one of CSR, I want to support CSC before releasing Arrow 1.0. There are work-in-progress branch [2] of (1) above. I’d appreciate any comments or suggestions. [1] http://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3c87pnqz70rg....@jedbrown.org%3e [2] https://github.com/mrkn/arrow/tree/sparse_tensor_index_value_type Regards, Kenta Murata