Hi,

I’d like to propose the following improvement of the sparse tensor
format and implementation.

(1) To make variable bit-width indices available.

The main purpose of the first part of the proposal is making 32-bit
indices available.  It allows us to serialize scipy.sparse.csr_matrix
objects etc. with 32-bit indices without converting the index arrays
to 64-bit values.  As Jed said in the previous discussion [1] in this
ML, since 32-bit indices have advantages of the small memory
footprints, I strongly consider this change is necessary for the
sparse tensor support for Apache Arrow.  Adding both the type field in
each sparse index format and the stride field in SparseCOOIndex format
is necessary to do this.

(2) Adding the new COO format with separated row and column indices

Scipy.sparse.coo_matrix manages the indices of row and column in
separated numpy arrays.  It is enough for representing a sparse
matrix.  On the other hand, for supporting sparse tensors with
arbitrary ranks, Arrow's SparseCOOIndex manages COO indices as one
matrix. Hence we need to make a copy of indices to convert
scipy.sparse.coo_matrix to Arrow’s SparseTensor.  Introducing the new
COO format with separated row and column indices can resolve this
issue.

(3) Adding SparseCSCIndex

The CSC format of sparse matrices has the advantage of faster scanning
in columnar direction while the CSR format is faster in a row-wise
scan. Because The aptitude of CSC is different from the one of CSR, I
want to support CSC before releasing Arrow 1.0.

There are work-in-progress branch [2] of (1) above.  I’d appreciate
any comments or suggestions.

[1] 
http://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3c87pnqz70rg....@jedbrown.org%3e

[2] https://github.com/mrkn/arrow/tree/sparse_tensor_index_value_type

Regards,
Kenta Murata

Reply via email to