Thank you very much for the commit Kouhei-san. I'd love to use it sooner so I'll use the source code directly to build Arrow-glib once this PR is in.
Thank you, Ishan ________________________________ From: Sutou Kouhei <k...@clear-code.com> Sent: Monday, September 7, 2020 6:44 AM To: user@arrow.apache.org <user@arrow.apache.org> Subject: Re: [C-GLib] reading values quickly from a list array Hi, garrow_list_array_get_value() is a bit high cost function because it creates a sub list array. It doesn't copy array data (it shares array data) but it creates a new sub array (container for data) in C++ level and C level. Apache Arrow GLib 1.0.1 doesn't have low level APIs to access list array values. Sorry. I've implemented them: https://github.com/apache/arrow/pull/8119 It'll be included in Apache Arrow GLib 2.0.0 that will be released in a few months. (Can you wait 2.0.0?) With these APIs, you can write like the following: ---- #include <stdlib.h> #include <arrow-glib/arrow-glib.h> int main(void) { GError *error = NULL; GArrowMemoryMappedInputStream *input; input = garrow_memory_mapped_input_stream_new("/tmp/batch.arrow", &error); if (!input) { g_print("failed to open file: %s\n", error->message); g_error_free(error); return EXIT_FAILURE; } { GArrowRecordBatchFileReader *reader; reader = garrow_record_batch_file_reader_new(GARROW_SEEKABLE_INPUT_STREAM(input), &error); if (!reader) { g_print("failed to open file reader: %s\n", error->message); g_error_free(error); g_object_unref(input); return EXIT_FAILURE; } { guint i; guint num_batches = 100; for (i = 0; i < num_batches; i++) { GArrowRecordBatch *record_batch; record_batch = garrow_record_batch_file_reader_read_record_batch(reader, i, &error); GArrowArray* column = garrow_record_batch_get_column_data(record_batch, 1); guint length_list = garrow_array_get_length(column); GArrowListArray* list_arr = (GArrowListArray*)column; GArrowInt64Array *list_values = GARROW_INT64_ARRAY(garrow_list_array_get_values(list_arr)); gint64 n_list_values; const gint64 *raw_list_values = garrow_int64_array_get_values(list_values, &n_list_values); gint64 n_value_offsets; const gint32 *value_offsets = garrow_list_array_get_value_offsets(list_arr, &n_value_offsets); guint j; for (j = 0; j < n_value_offsets; ++j) { gint32 value_offset = value_offsets[j]; gint32 value_length = value_offsets[j + 1] - value_offset; gint32 k; for (k = 0; k < value_length; ++k) { raw_list_values[value_offset + k]; } } g_object_unref(list_values); g_object_unref(column); g_object_unref(record_batch); } } g_object_unref(reader); } g_object_unref(input); return EXIT_SUCCESS; } ---- It takes 0.5sec on my machine. Thanks, -- kou In <ch2pr20mb30959cc8165932970cd856c6eb...@ch2pr20mb3095.namprd20.prod.outlook.com> "[C-GLib] reading values quickly from a list array " on Sun, 6 Sep 2020 07:40:06 +0000, Ishan Anand <anand.is...@outlook.com> wrote: > Hi > > I am trying to use the Arrow Glib API to read/write from C. Specifically, > while Arrow is a columnar format, I'm really excited to be able to write a > lot of rows from a C like runtime and access it from python for analytics as > an array per column. And vice versa. > > To get a quick example running, I created an Arrow table in python with 100 > million entries as follows: > ```py > import pyarrow as pa > > foo = { > "colA": np.arange(0, 1000_000), > "colB": [np.arange(1, 5)] * 1000_000 > } > > table = pa.table(foo) > with pa.RecordBatchFileWriter("/tmp/batch.arrow", table.schema) as writer: > for _ in range(100): > writer.write_table(table) > ``` > > However, using the Glib API to read the ListArray column data looks really > slow. It takes like 5 seconds per record batch with a million entries. While > the integer column over the entire table can be iterated over under 2 seconds. > > The relevant snippet is this: > ```C > guint num_batches = 100; > for (i = 0; i < num_batches; i++) { > GArrowRecordBatch *record_batch; > record_batch = > garrow_record_batch_file_reader_read_record_batch(reader, i, &error); > > GArrowArray* column = > garrow_record_batch_get_column_data(record_batch, 1); > guint length_list = garrow_array_get_length(column); > GArrowListArray* list_arr = (GArrowListArray*)column; > > guint j; > GArrowArray* list_elem; > for (j = 0; j < length_list; j++) { > list_elem = garrow_list_array_get_value(list_arr, j); > } > } > ``` > > I can't seem to find a quicker alternative in the public Glib API to read > data out of a list array. Is there a way to speed up this loop? > > > Thank you, > Ishan > > >