Hi

I am trying to use the Arrow Glib API to read/write from C. Specifically, while 
Arrow is a columnar format, I'm really excited to be able to write a lot of 
rows from a C like runtime and access it from python for analytics as an array 
per column. And vice versa.

 To get a quick example running, I created an Arrow table in python with 100 
million entries as follows:
```py
import pyarrow as pa

foo = {
    "colA": np.arange(0, 1000_000),
    "colB": [np.arange(1, 5)] * 1000_000
}

table = pa.table(foo)
with pa.RecordBatchFileWriter("/tmp/batch.arrow", table.schema) as writer:
    for _ in range(100):
        writer.write_table(table)
```

However, using the Glib API to read the ListArray column data looks really 
slow. It takes like 5 seconds per record batch with a million entries. While 
the integer column over the entire table can be iterated over under 2 seconds.

The relevant snippet is this:
```C
    guint num_batches = 100;
    for (i = 0; i < num_batches; i++) {
        GArrowRecordBatch *record_batch;
        record_batch = 
garrow_record_batch_file_reader_read_record_batch(reader, i, &error);

        GArrowArray* column = garrow_record_batch_get_column_data(record_batch, 
1);
        guint length_list = garrow_array_get_length(column);
        GArrowListArray* list_arr = (GArrowListArray*)column;

        guint j;
        GArrowArray* list_elem;
        for (j = 0; j < length_list; j++) {
            list_elem = garrow_list_array_get_value(list_arr, j);
        }
    }
```

I can't seem to find a quicker alternative in the public Glib API to read data 
out of a list array. Is there a way to speed up this loop?


Thank you,
Ishan



Reply via email to