Thank you very much for the commit Kouhei-san. I'd love to use it sooner so 
I'll use the source code directly to build Arrow-glib once this PR is in.


Thank you,
Ishan
________________________________
From: Sutou Kouhei <k...@clear-code.com>
Sent: Monday, September 7, 2020 6:44 AM
To: user@arrow.apache.org <user@arrow.apache.org>
Subject: Re: [C-GLib] reading values quickly from a list array

Hi,

garrow_list_array_get_value() is a bit high cost function
because it creates a sub list array. It doesn't copy array
data (it shares array data) but it creates a new sub array
(container for data) in C++ level and C level.

Apache Arrow GLib 1.0.1 doesn't have low level APIs to access
list array values. Sorry. I've implemented them:
https://github.com/apache/arrow/pull/8119

It'll be included in Apache Arrow GLib 2.0.0 that will be
released in a few months.

(Can you wait 2.0.0?)

With these APIs, you can write like the following:

----
#include <stdlib.h>
#include <arrow-glib/arrow-glib.h>

int
main(void)
{
  GError *error = NULL;

  GArrowMemoryMappedInputStream *input;
  input = garrow_memory_mapped_input_stream_new("/tmp/batch.arrow", &error);
  if (!input) {
    g_print("failed to open file: %s\n", error->message);
    g_error_free(error);
    return EXIT_FAILURE;
  }

  {
    GArrowRecordBatchFileReader *reader;
    reader =
      garrow_record_batch_file_reader_new(GARROW_SEEKABLE_INPUT_STREAM(input),
                                          &error);

    if (!reader) {
      g_print("failed to open file reader: %s\n", error->message);
      g_error_free(error);
      g_object_unref(input);
      return EXIT_FAILURE;
    }

    {
      guint i;
      guint num_batches = 100;
      for (i = 0; i < num_batches; i++) {
        GArrowRecordBatch *record_batch;
        record_batch = 
garrow_record_batch_file_reader_read_record_batch(reader, i, &error);

        GArrowArray* column = garrow_record_batch_get_column_data(record_batch, 
1);
        guint length_list = garrow_array_get_length(column);

        GArrowListArray* list_arr = (GArrowListArray*)column;

        GArrowInt64Array *list_values =
          GARROW_INT64_ARRAY(garrow_list_array_get_values(list_arr));
        gint64 n_list_values;
        const gint64 *raw_list_values =
          garrow_int64_array_get_values(list_values, &n_list_values);
        gint64 n_value_offsets;
        const gint32 *value_offsets =
          garrow_list_array_get_value_offsets(list_arr, &n_value_offsets);
        guint j;
        for (j = 0; j < n_value_offsets; ++j) {
          gint32 value_offset = value_offsets[j];
          gint32 value_length = value_offsets[j + 1] - value_offset;
          gint32 k;
          for (k = 0; k < value_length; ++k) {
            raw_list_values[value_offset + k];
          }
        }
        g_object_unref(list_values);

        g_object_unref(column);

        g_object_unref(record_batch);
      }
    }
    g_object_unref(reader);
  }

  g_object_unref(input);

  return EXIT_SUCCESS;
}
----

It takes 0.5sec on my machine.


Thanks,
--
kou

In
 
<ch2pr20mb30959cc8165932970cd856c6eb...@ch2pr20mb3095.namprd20.prod.outlook.com>
  "[C-GLib] reading values quickly from a list array " on Sun, 6 Sep 2020 
07:40:06 +0000,
  Ishan Anand <anand.is...@outlook.com> wrote:

> Hi
>
> I am trying to use the Arrow Glib API to read/write from C. Specifically, 
> while Arrow is a columnar format, I'm really excited to be able to write a 
> lot of rows from a C like runtime and access it from python for analytics as 
> an array per column. And vice versa.
>
>  To get a quick example running, I created an Arrow table in python with 100 
> million entries as follows:
> ```py
> import pyarrow as pa
>
> foo = {
>     "colA": np.arange(0, 1000_000),
>     "colB": [np.arange(1, 5)] * 1000_000
> }
>
> table = pa.table(foo)
> with pa.RecordBatchFileWriter("/tmp/batch.arrow", table.schema) as writer:
>     for _ in range(100):
>         writer.write_table(table)
> ```
>
> However, using the Glib API to read the ListArray column data looks really 
> slow. It takes like 5 seconds per record batch with a million entries. While 
> the integer column over the entire table can be iterated over under 2 seconds.
>
> The relevant snippet is this:
> ```C
>     guint num_batches = 100;
>     for (i = 0; i < num_batches; i++) {
>         GArrowRecordBatch *record_batch;
>         record_batch = 
> garrow_record_batch_file_reader_read_record_batch(reader, i, &error);
>
>         GArrowArray* column = 
> garrow_record_batch_get_column_data(record_batch, 1);
>         guint length_list = garrow_array_get_length(column);
>         GArrowListArray* list_arr = (GArrowListArray*)column;
>
>         guint j;
>         GArrowArray* list_elem;
>         for (j = 0; j < length_list; j++) {
>             list_elem = garrow_list_array_get_value(list_arr, j);
>         }
>     }
> ```
>
> I can't seem to find a quicker alternative in the public Glib API to read 
> data out of a list array. Is there a way to speed up this loop?
>
>
> Thank you,
> Ishan
>
>
>

Reply via email to