Re: [PR] feat(python): Add array creation/building from buffers [arrow-nanoarrow]

via GitHub Thu, 15 Feb 2024 03:39:33 -0800


jorisvandenbossche commented on code in PR #378:
URL: https://github.com/apache/arrow-nanoarrow/pull/378#discussion_r1490814612



##########
python/src/nanoarrow/c_lib.py:
##########
@@ -257,7 +413,60 @@ def c_array_view(obj, requested_schema=None) -> CArrayView:
     return CArrayView.from_cpu_array(c_array(obj, requested_schema))
 
 
-def allocate_c_schema():
+def c_buffer(obj, requested_schema=None) -> CBuffer:
+    """Owning, read-only ArrowBuffer wrapper
+
+    If obj implement the Python buffer protocol, ``c_buffer()`` Wraps
+    obj in nanoarrow's owning buffer structure, the ArrowBuffer,
+    such that it can be used to construct arrays. The ownership of the
+    underlying buffer is handled by the Python buffer protocol
+    (i.e., ``PyObject_GetBuffer()`` and ``PyBuffer_Release()``).
+
+    If obj is iterable, a buffer will be allocated and populated with
+    the contents of obj according to ``requested_schema``. The
+    ``requested_schema`` parameter is required to create a buffer from
+    a Python iterable. The ``struct`` module is currently used to encode
+    values from obj into binary form.
+
+    Unlike with :func:`c_array`, ``requested_schema`` is explicitly
+    honoured (or an error will be raised).

Review Comment:
   Maybe it is also worth noting that _if_ `requested_schema` is passed, that 
the input will be treated as an iterable? (so also if you pass something 
buffer-like such as a numpy array)



##########
python/src/nanoarrow/c_lib.py:
##########
@@ -305,3 +514,101 @@ def allocate_c_array_stream():
     >>> pa_reader._export_to_c(array_stream._addr())
     """
     return CArrayStream.allocate()
+
+
+# This is a heuristic for detecting a pyarrow.Array or pyarrow.RecordBatch
+# for pyarrow < 14.0.0, after which the the __arrow_c_array__ protocol
+# is sufficient to detect such an array. This check can't use isinstance()
+# to avoid importing pyarrow unnecessarily.
+def _obj_is_pyarrow_array(obj):
+    obj_type = type(obj)
+    if obj_type.__module__ != "pyarrow.lib":
+        return False
+
+    if not obj_type.__name__.endswith("Array") and obj_type.__name__ != 
"RecordBatch":
+        return False
+
+    return hasattr(obj, "_export_to_c")
+
+
+def _obj_is_iterable(obj):
+    return hasattr(obj, "__iter__")

Review Comment:
   We might want to check for `__len__` as well? (because in theory you can 
have an infinite iterable or generator)



##########
python/src/nanoarrow/c_lib.py:
##########
@@ -257,7 +413,60 @@ def c_array_view(obj, requested_schema=None) -> CArrayView:
     return CArrayView.from_cpu_array(c_array(obj, requested_schema))
 
 
-def allocate_c_schema():
+def c_buffer(obj, requested_schema=None) -> CBuffer:

Review Comment:
   ```suggestion
   def c_buffer(obj, schema=None) -> CBuffer:
   ```
   
   Just schema here? (in other public APIs, I think we typically use schema?)



##########
python/src/nanoarrow/c_lib.py:
##########
@@ -257,7 +413,60 @@ def c_array_view(obj, requested_schema=None) -> CArrayView:
     return CArrayView.from_cpu_array(c_array(obj, requested_schema))
 
 
-def allocate_c_schema():
+def c_buffer(obj, requested_schema=None) -> CBuffer:
+    """Owning, read-only ArrowBuffer wrapper
+
+    If obj implement the Python buffer protocol, ``c_buffer()`` Wraps
+    obj in nanoarrow's owning buffer structure, the ArrowBuffer,
+    such that it can be used to construct arrays. The ownership of the
+    underlying buffer is handled by the Python buffer protocol
+    (i.e., ``PyObject_GetBuffer()`` and ``PyBuffer_Release()``).
+
+    If obj is iterable, a buffer will be allocated and populated with
+    the contents of obj according to ``requested_schema``. The
+    ``requested_schema`` parameter is required to create a buffer from
+    a Python iterable. The ``struct`` module is currently used to encode
+    values from obj into binary form.
+
+    Unlike with :func:`c_array`, ``requested_schema`` is explicitly
+    honoured (or an error will be raised).
+
+    Parameters
+    ----------
+
+    obj : buffer-like or iterable
+        A Python object that supports the Python buffer protocol. This includes
+        bytes, memoryview, bytearray, bulit-in types as well as numpy arrays.
+    requested_schema : The data type of the desired buffer as sanitized by
+        :func:`c_schema`. Only values that make sense as buffer types are

Review Comment:
   ```suggestion
       requested_schema :  schema-like, optional
           The data type of the desired buffer as sanitized by
           :func:`c_schema`. Only values that make sense as buffer types are
   ```



##########
python/src/nanoarrow/c_lib.py:
##########
@@ -305,3 +514,101 @@ def allocate_c_array_stream():
     >>> pa_reader._export_to_c(array_stream._addr())
     """
     return CArrayStream.allocate()
+
+
+# This is a heuristic for detecting a pyarrow.Array or pyarrow.RecordBatch
+# for pyarrow < 14.0.0, after which the the __arrow_c_array__ protocol
+# is sufficient to detect such an array. This check can't use isinstance()
+# to avoid importing pyarrow unnecessarily.
+def _obj_is_pyarrow_array(obj):
+    obj_type = type(obj)
+    if obj_type.__module__ != "pyarrow.lib":

Review Comment:
   ```suggestion
       if not obj_type.__module__.startswith("pyarrow"):
   ```
   
   That might be more robust in case pyarrow would change the module? (i.e. we 
could make pyarrow.lib.Array _look_ like it is pyarrow.Array, if we wanted)



##########
python/src/nanoarrow/_lib.pyx:
##########
@@ -1177,11 +1429,395 @@ cdef class CBufferView:
         buffer.strides = &self._strides
         buffer.suboffsets = NULL
 
-    def __releasebuffer__(self, Py_buffer *buffer):
+    cdef _do_releasebuffer(self, Py_buffer* buffer):
         pass
 
     def __repr__(self):
-        return f"<nanoarrow.c_lib.CBufferView>\n  
{_lib_utils.buffer_view_repr(self)[1:]}"
+        return f"CBufferView({_lib_utils.buffer_view_repr(self)})"
+
+
+cdef class CBuffer:
+    """Wrapper around readable owned buffer content
+
+    Like the CBufferView, the CBuffer represents readable buffer content; 
however,
+    unlike the CBufferView, the CBuffer always represents a valid ArrowBuffer 
C object.
+    """
+    cdef object _base
+    cdef ArrowBuffer* _ptr
+    cdef ArrowType _data_type
+    cdef int _element_size_bits
+    cdef char _format[32]
+    cdef CDevice _device
+    cdef CBufferView _view
+    cdef int _get_buffer_count
+
+    def __cinit__(self):
+        self._base = None
+        self._ptr = NULL
+        self._data_type = NANOARROW_TYPE_BINARY
+        self._element_size_bits = 0
+        self._device = CDEVICE_CPU
+        self._format[0] = 0
+        self._get_buffer_count = 0
+        self._reset_view()
+
+    cdef _assert_valid(self):
+        if self._ptr == NULL:
+            raise RuntimeError("CBuffer is not valid")
+
+    cdef _assert_buffer_count_zero(self):
+        if self._get_buffer_count != 0:
+            raise RuntimeError(
+                f"CBuffer already open ({self._get_buffer_count} ",
+                f"references, {self._writable_get_buffer_count} writable)")
+
+    cdef _reset_view(self):
+        self._view = CBufferView(None, 0, 0, NANOARROW_TYPE_BINARY, 8, 
self._device)
+
+    cdef _populate_view(self):
+        self._assert_valid()
+        self._assert_buffer_count_zero()
+        self._view = CBufferView(
+            self._base, <uintptr_t>self._ptr.data,
+            self._ptr.size_bytes, self._data_type, self._element_size_bits,
+            self._device
+        )
+
+    cdef _refresh_view_if_needed(self):
+        if self._get_buffer_count > 0:
+            return
+
+        self._assert_valid()
+        cdef int addr_equal = self._ptr.data == self._view._ptr.data.as_uint8
+        cdef int size_equal = self._ptr.size_bytes == 
self._view._ptr.size_bytes
+        cdef int types_equal = self._data_type == self._view._data_type
+        cdef int element_size_equal = self._element_size_bits == 
self._view.element_size_bits
+        if addr_equal and size_equal and types_equal and element_size_equal:
+            return
+
+        self._populate_view()
+
+    def set_empty(self):
+        self._assert_buffer_count_zero()
+        if self._ptr == NULL:
+            self._base = alloc_c_buffer(&self._ptr)
+        ArrowBufferReset(self._ptr)
+
+        self._data_type = NANOARROW_TYPE_BINARY
+        self._element_size_bits = 0
+        self._device = CDEVICE_CPU
+        self._reset_view()
+        return self
+
+    def set_pybuffer(self, obj):

Review Comment:
   Instead of having this "mutable" interface, essentially this function just 
creates a CBuffer from a python buffer, right? So it could also be a class 
method, and instead of using ``CBuffer().set_pybuffer(obj)`` one would do 
`CBuffer.from_pybuffer(obj)` ?



##########
python/src/nanoarrow/c_lib.py:
##########
@@ -120,15 +157,134 @@ def c_array(obj=None, requested_schema=None) -> CArray:
             *obj.__arrow_c_array__(requested_schema=requested_schema_capsule)
         )
 
-    # for pyarrow < 14.0
-    if hasattr(obj, "_export_to_c"):
+    # Try buffer protocol (e.g., numpy arrays or a c_buffer())
+    if _obj_is_buffer(obj):
+        return _c_array_from_pybuffer(obj)
+
+    # Try import of bare capsule
+    if _obj_is_capsule(obj, "arrow_array"):
+        if requested_schema is None:
+            requested_schema_capsule = CSchema.allocate()._capsule
+        else:
+            requested_schema_capsule = requested_schema.__arrow_c_schema__()
+
+        return CArray._import_from_c_capsule(requested_schema_capsule, obj)
+
+    # Try _export_to_c for Array/RecordBatch objects if pyarrow < 14.0
+    if _obj_is_pyarrow_array(obj):
         out = CArray.allocate(CSchema.allocate())
         obj._export_to_c(out._addr(), out.schema._addr())
         return out
-    else:
-        raise TypeError(
-            f"Can't convert object of type {type(obj).__name__} to 
nanoarrow.c_array"
-        )
+
+    # Try import of iterable
+    if _obj_is_iterable(obj):
+        return _c_array_from_iterable(obj, requested_schema)
+
+    raise TypeError(
+        f"Can't convert object of type {type(obj).__name__} to 
nanoarrow.c_array"
+    )
+
+
+def c_array_from_buffers(
+    schema,
+    length: int,
+    buffers: Iterable[Any],
+    null_count: int = -1,
+    offset: int = 0,
+    children: Iterable[Any] = (),
+    validation_level: Literal["full", "default", "minimal", "none"] = 
"default",
+) -> CArray:
+    """Create an ArrowArray wrapper from components
+
+    Given a schema, build an ArrowArray buffer-wise. This allows almost any 
array
+    to be assembled; however, requires some knowledge of the Arrow Columnar
+    specification. This function will do its best to validate the sizes and
+    content of buffers according to ``validation_level``, which can be set
+    to ``"full""`` for maximum safety.
+
+    Parameters
+    ----------
+
+    schema : schema-like
+        The data type of the desired array as sanitized by :func:`c_schema`.
+    length : int
+        The length of the output array.
+    buffers : Iterable of buffer-like or None
+        An iterable of buffers as sanitized by :func:`c_buffer`. Any object
+        supporting the Python Buffer protocol is accepted. Buffer data types
+        are not checked. A buffer value of ``None`` will skip setting a buffer
+        (i.e., that buffer will be of length zero and its pointer will
+        be ``NULL``).
+    null_count : int, optional
+        The number of null values, if known in advance. If -1 (the default),
+        the null count will be calculated based on the validity bitmap. If
+        the validity bitmap was set to ``None``, the calculated null count
+        will be zero.
+    offset : int, optional
+        The logical offset from the start of the array.
+    children : Iterable of array-like
+        An iterable of arrays used to set child fields of the array. Can 
contain
+        any object accepted by :func:`c_array`. Must contain the exact number 
of
+        required children as specifed by ``schema``.
+    validation_level: str, optional
+        One of "none" (no check), "minimal" (check buffer sizes that do not 
require
+        dereferencing buffer content), "default" (check all buffer sizes), or 
"full"
+        (check all buffer sizes and all buffer content).
+
+    Examples
+    --------
+
+    >>> import nanoarrow as na
+    >>> c_array = na.c_array_from_buffers(na.uint8(), 5, [None, b"12345"])
+    >>> na.c_array_view(c_array)
+    <nanoarrow.c_lib.CArrayView>
+    - storage_type: 'uint8'
+    - length: 5
+    - offset: 0
+    - null_count: 0
+    - buffers[2]:
+      - validity <bool[0 b] >
+      - data <uint8[5 b] 49 50 51 52 53>
+    - dictionary: NULL
+    - children[0]:
+    """
+    schema = c_schema(schema)
+    builder = CArrayBuilder.allocate()
+
+    # This is slightly wasteful: it will allocate arrays recursively and we 
are about
+    # to immediately release them and replace them with another value. We 
could also
+    # create an ArrowArrayView from the buffers, which would make it more
+    # straightforward to check the buffer types and avoid the extra structure
+    # allocation.
+    builder.init_from_schema(schema)
+
+    # Set buffers. This moves ownership of the buffers as well (i.e., the 
objects
+    # in the input buffers are replaced with an empty ArrowBuffer)

Review Comment:
   This was a bit confusing (I ran into this while trying out the method):
   
   ```
   In [74]: buf = na.c_buffer(b"0123")
   
   In [75]: na.c_array_from_buffers(na.int32(), 1, [None, buf])
   Out[75]: 
   <nanoarrow.c_lib.CArray int32>
   - length: 1
   - offset: 0
   - null_count: 0
   - buffers: (0, 139688822562528)
   - dictionary: NULL
   - children[0]:
   
   In [76]: na.c_array_from_buffers(na.int32(), 1, [None, buf])
   ---------------------------------------------------------------------------
   NanoarrowException                        Traceback (most recent call last)
   Cell In [76], line 1
   ----> 1 na.c_array_from_buffers(na.int32(), 1, [None, buf])
   
   File ~/scipy/repos/arrow-nanoarrow/python/src/nanoarrow/c_lib.py:287, in 
c_array_from_buffers(schema, length, buffers, null_count, offset, children, 
validation_level)
       284 builder.resolve_null_count()
       286 # Validate + finish
   --> 287 return builder.finish(validation_level=validation_level)
   
   File src/nanoarrow/_lib.pyx:1818, in nanoarrow._lib.CArrayBuilder.finish()
   
   File src/nanoarrow/_lib.pyx:422, in 
nanoarrow._lib.Error.raise_message_not_ok()
   
   File src/nanoarrow/_lib.pyx:417, in nanoarrow._lib.Error.raise_message()
   
   NanoarrowException: ArrowArrayFinishBuildingDefault() failed (22): Expected 
int32 array buffer 1 to have size >= 4 bytes but found buffer with 0 bytes
   
   In [77]: buf
   Out[77]: CBuffer(uint8[0 b] )
   ```
   
   When passing python object through buffer protocol, the original object 
keeps owning the buffer content. We can't do something similar for CBuffer? 
(and that's because the CArray doesn't actually keep track of CBuffer objects, 
but only the C ArrowArray that has a pointer to the buffer?)
   
   And where does this actually happen? I don't see anything in 
`CArrayBuilder.set_buffer` that would invalidate the buffer?



##########
python/tests/test_c_array.py:
##########
@@ -0,0 +1,298 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import pytest
+from nanoarrow._lib import NanoarrowException
+from nanoarrow.c_lib import CArrayBuilder
+
+import nanoarrow as na
+
+
+def test_c_array_builder_init():
+    builder = CArrayBuilder.allocate()
+    builder.init_from_type(na.Type.INT32.value)
+
+    with pytest.raises(RuntimeError, match="CArrayBuilder is already 
initialized"):
+        builder.init_from_type(na.Type.INT32.value)
+
+    with pytest.raises(RuntimeError, match="CArrayBuilder is already 
initialized"):
+        builder.init_from_schema(na.c_schema(na.int32()))
+
+
+def test_c_array_from_c_array():
+    c_array = na.c_array([1, 2, 3], na.int32())
+    c_array_from_c_array = na.c_array(c_array)
+    assert c_array_from_c_array.length == c_array.length
+    assert c_array_from_c_array.buffers == c_array.buffers
+
+
+def test_c_array_from_capsule_protocol():
+    class CArrayWrapper:
+        def __init__(self, obj):
+            self.obj = obj
+
+        def __arrow_c_array__(self, *args, **kwargs):
+            return self.obj.__arrow_c_array__(*args, **kwargs)
+
+    c_array = na.c_array([1, 2, 3], na.int32())
+    c_array_wrapper = CArrayWrapper(c_array)
+    c_array_from_protocol = na.c_array(c_array_wrapper)
+    assert c_array_from_protocol.length == c_array.length
+    assert c_array_from_protocol.buffers == c_array.buffers
+
+
+def test_c_array_from_old_pyarrow():
+    # Simulate a pyarrow Array with no __arrow_c_array__
+    class MockLegacyPyarrowArray:
+        def __init__(self, obj):
+            self.obj = obj
+
+        def _export_to_c(self, *args):
+            return self.obj._export_to_c(*args)
+
+    MockLegacyPyarrowArray.__module__ = "pyarrow.lib"
+
+    pa = pytest.importorskip("pyarrow")
+    array = MockLegacyPyarrowArray(pa.array([1, 2, 3], pa.int32()))
+
+    c_array = na.c_array(array)
+    assert c_array.length == 3
+    assert c_array.schema.format == "i"
+
+    # Make sure that this heuristic won't result in trying to import
+    # something else that has an _export_to_c method
+    with pytest.raises(TypeError, match="Can't convert object of type 
DataType"):
+        not_array = pa.int32()
+        assert hasattr(not_array, "_export_to_c")
+        na.c_array(not_array)
+
+
+def test_c_array_from_bare_capsule():
+    c_array = na.c_array([1, 2, 3], na.int32())
+
+    # Check from bare capsule without supplying a schema
+    schema_capsule, array_capsule = c_array.__arrow_c_array__()
+    del schema_capsule
+    c_array_from_capsule = na.c_array(array_capsule)
+    assert c_array_from_capsule.length == c_array.length
+    assert c_array_from_capsule.buffers == c_array.buffers
+
+    # Check from bare capsule supplying a schema
+    schema_capsule, array_capsule = c_array.__arrow_c_array__()
+    c_array_from_capsule = na.c_array(array_capsule, schema_capsule)
+    assert c_array_from_capsule.length == c_array.length
+    assert c_array_from_capsule.buffers == c_array.buffers
+
+
+def test_c_array_type_not_supported():
+    with pytest.raises(TypeError, match="Can't convert object of type 
NoneType"):
+        na.c_array(None)
+
+
+def test_c_array_from_pybuffer_uint8():
+    data = b"abcdefg"
+    c_array = na.c_array(data)
+    assert c_array.length == len(data)
+    assert c_array.null_count == 0
+    assert c_array.offset == 0
+    assert na.c_schema_view(c_array.schema).type == "uint8"
+
+    c_array_view = na.c_array_view(c_array)
+    assert list(c_array_view.buffer(1)) == list(data)
+
+
+def test_c_array_from_pybuffer_string():
+    data = b"abcdefg"
+    buffer = na.c_buffer(data).set_format("c")

Review Comment:
   This could also look like `na.c_buffer(data, format="c")` ? 
   (related to my comment above about those builder-like set methods on CBuffer 
itself)



##########
python/src/nanoarrow/_lib.pyx:
##########
@@ -1177,11 +1429,395 @@ cdef class CBufferView:
         buffer.strides = &self._strides
         buffer.suboffsets = NULL
 
-    def __releasebuffer__(self, Py_buffer *buffer):
+    cdef _do_releasebuffer(self, Py_buffer* buffer):
         pass
 
     def __repr__(self):
-        return f"<nanoarrow.c_lib.CBufferView>\n  
{_lib_utils.buffer_view_repr(self)[1:]}"
+        return f"CBufferView({_lib_utils.buffer_view_repr(self)})"
+
+
+cdef class CBuffer:
+    """Wrapper around readable owned buffer content
+
+    Like the CBufferView, the CBuffer represents readable buffer content; 
however,
+    unlike the CBufferView, the CBuffer always represents a valid ArrowBuffer 
C object.
+    """
+    cdef object _base
+    cdef ArrowBuffer* _ptr
+    cdef ArrowType _data_type
+    cdef int _element_size_bits
+    cdef char _format[32]
+    cdef CDevice _device
+    cdef CBufferView _view
+    cdef int _get_buffer_count
+
+    def __cinit__(self):
+        self._base = None
+        self._ptr = NULL
+        self._data_type = NANOARROW_TYPE_BINARY
+        self._element_size_bits = 0
+        self._device = CDEVICE_CPU
+        self._format[0] = 0
+        self._get_buffer_count = 0
+        self._reset_view()
+
+    cdef _assert_valid(self):
+        if self._ptr == NULL:
+            raise RuntimeError("CBuffer is not valid")
+
+    cdef _assert_buffer_count_zero(self):
+        if self._get_buffer_count != 0:
+            raise RuntimeError(
+                f"CBuffer already open ({self._get_buffer_count} ",
+                f"references, {self._writable_get_buffer_count} writable)")
+
+    cdef _reset_view(self):
+        self._view = CBufferView(None, 0, 0, NANOARROW_TYPE_BINARY, 8, 
self._device)
+
+    cdef _populate_view(self):
+        self._assert_valid()
+        self._assert_buffer_count_zero()
+        self._view = CBufferView(
+            self._base, <uintptr_t>self._ptr.data,
+            self._ptr.size_bytes, self._data_type, self._element_size_bits,
+            self._device
+        )
+
+    cdef _refresh_view_if_needed(self):
+        if self._get_buffer_count > 0:
+            return
+
+        self._assert_valid()
+        cdef int addr_equal = self._ptr.data == self._view._ptr.data.as_uint8
+        cdef int size_equal = self._ptr.size_bytes == 
self._view._ptr.size_bytes
+        cdef int types_equal = self._data_type == self._view._data_type
+        cdef int element_size_equal = self._element_size_bits == 
self._view.element_size_bits
+        if addr_equal and size_equal and types_equal and element_size_equal:
+            return
+
+        self._populate_view()
+
+    def set_empty(self):
+        self._assert_buffer_count_zero()
+        if self._ptr == NULL:
+            self._base = alloc_c_buffer(&self._ptr)
+        ArrowBufferReset(self._ptr)
+
+        self._data_type = NANOARROW_TYPE_BINARY
+        self._element_size_bits = 0
+        self._device = CDEVICE_CPU
+        self._reset_view()
+        return self
+
+    def set_pybuffer(self, obj):

Review Comment:
   What I think I find confusing about those `set_` methods, is that it makes 
CBuffer look like a builder class, while there _is_ a separate BufferBuilder 
class?



##########
python/src/nanoarrow/_lib.pyx:
##########
@@ -1177,11 +1429,395 @@ cdef class CBufferView:
         buffer.strides = &self._strides
         buffer.suboffsets = NULL
 
-    def __releasebuffer__(self, Py_buffer *buffer):
+    cdef _do_releasebuffer(self, Py_buffer* buffer):
         pass
 
     def __repr__(self):
-        return f"<nanoarrow.c_lib.CBufferView>\n  
{_lib_utils.buffer_view_repr(self)[1:]}"
+        return f"CBufferView({_lib_utils.buffer_view_repr(self)})"
+
+
+cdef class CBuffer:
+    """Wrapper around readable owned buffer content
+
+    Like the CBufferView, the CBuffer represents readable buffer content; 
however,
+    unlike the CBufferView, the CBuffer always represents a valid ArrowBuffer 
C object.
+    """
+    cdef object _base
+    cdef ArrowBuffer* _ptr
+    cdef ArrowType _data_type
+    cdef int _element_size_bits
+    cdef char _format[32]
+    cdef CDevice _device
+    cdef CBufferView _view
+    cdef int _get_buffer_count
+
+    def __cinit__(self):
+        self._base = None
+        self._ptr = NULL
+        self._data_type = NANOARROW_TYPE_BINARY
+        self._element_size_bits = 0
+        self._device = CDEVICE_CPU
+        self._format[0] = 0
+        self._get_buffer_count = 0
+        self._reset_view()
+
+    cdef _assert_valid(self):
+        if self._ptr == NULL:
+            raise RuntimeError("CBuffer is not valid")
+
+    cdef _assert_buffer_count_zero(self):
+        if self._get_buffer_count != 0:
+            raise RuntimeError(
+                f"CBuffer already open ({self._get_buffer_count} ",
+                f"references, {self._writable_get_buffer_count} writable)")
+
+    cdef _reset_view(self):
+        self._view = CBufferView(None, 0, 0, NANOARROW_TYPE_BINARY, 8, 
self._device)
+
+    cdef _populate_view(self):
+        self._assert_valid()
+        self._assert_buffer_count_zero()
+        self._view = CBufferView(
+            self._base, <uintptr_t>self._ptr.data,
+            self._ptr.size_bytes, self._data_type, self._element_size_bits,
+            self._device
+        )
+
+    cdef _refresh_view_if_needed(self):
+        if self._get_buffer_count > 0:
+            return
+
+        self._assert_valid()
+        cdef int addr_equal = self._ptr.data == self._view._ptr.data.as_uint8
+        cdef int size_equal = self._ptr.size_bytes == 
self._view._ptr.size_bytes
+        cdef int types_equal = self._data_type == self._view._data_type
+        cdef int element_size_equal = self._element_size_bits == 
self._view.element_size_bits
+        if addr_equal and size_equal and types_equal and element_size_equal:
+            return
+
+        self._populate_view()
+
+    def set_empty(self):
+        self._assert_buffer_count_zero()
+        if self._ptr == NULL:
+            self._base = alloc_c_buffer(&self._ptr)
+        ArrowBufferReset(self._ptr)
+
+        self._data_type = NANOARROW_TYPE_BINARY
+        self._element_size_bits = 0
+        self._device = CDEVICE_CPU
+        self._reset_view()
+        return self
+
+    def set_pybuffer(self, obj):
+        self._assert_buffer_count_zero()
+        if self._ptr == NULL:
+            self._base = alloc_c_buffer(&self._ptr)
+
+        self.set_format(c_buffer_set_pybuffer(obj, &self._ptr))
+        self._device = CDEVICE_CPU
+        self._reset_view()
+        return self
+
+    def set_format(self, str format):
+        self._assert_buffer_count_zero()
+        element_size_bytes, data_type = c_arrow_type_from_format(format)
+        self._data_type = data_type
+        self._element_size_bits = element_size_bytes * 8
+        format_bytes = format.encode("UTF-8")
+        snprintf(self._format, sizeof(self._format), "%s", <const 
char*>format_bytes)
+        return self
+
+    def set_data_type(self, ArrowType type_id, int element_size_bits=0):

Review Comment:
   And the same for `set_empty`, which in practice is also used for 
CBufferBuilder (except for a test where it's used to create an emtpy CBuffer, 
but those tests are explicitly testing the `set_empty` method)



##########
python/src/nanoarrow/_lib.pyx:
##########
@@ -1177,11 +1429,395 @@ cdef class CBufferView:
         buffer.strides = &self._strides
         buffer.suboffsets = NULL
 
-    def __releasebuffer__(self, Py_buffer *buffer):
+    cdef _do_releasebuffer(self, Py_buffer* buffer):
         pass
 
     def __repr__(self):
-        return f"<nanoarrow.c_lib.CBufferView>\n  
{_lib_utils.buffer_view_repr(self)[1:]}"
+        return f"CBufferView({_lib_utils.buffer_view_repr(self)})"
+
+
+cdef class CBuffer:
+    """Wrapper around readable owned buffer content
+
+    Like the CBufferView, the CBuffer represents readable buffer content; 
however,
+    unlike the CBufferView, the CBuffer always represents a valid ArrowBuffer 
C object.
+    """
+    cdef object _base
+    cdef ArrowBuffer* _ptr
+    cdef ArrowType _data_type
+    cdef int _element_size_bits
+    cdef char _format[32]
+    cdef CDevice _device
+    cdef CBufferView _view
+    cdef int _get_buffer_count
+
+    def __cinit__(self):
+        self._base = None
+        self._ptr = NULL
+        self._data_type = NANOARROW_TYPE_BINARY
+        self._element_size_bits = 0
+        self._device = CDEVICE_CPU
+        self._format[0] = 0
+        self._get_buffer_count = 0
+        self._reset_view()
+
+    cdef _assert_valid(self):
+        if self._ptr == NULL:
+            raise RuntimeError("CBuffer is not valid")
+
+    cdef _assert_buffer_count_zero(self):
+        if self._get_buffer_count != 0:
+            raise RuntimeError(
+                f"CBuffer already open ({self._get_buffer_count} ",
+                f"references, {self._writable_get_buffer_count} writable)")
+
+    cdef _reset_view(self):
+        self._view = CBufferView(None, 0, 0, NANOARROW_TYPE_BINARY, 8, 
self._device)
+
+    cdef _populate_view(self):
+        self._assert_valid()
+        self._assert_buffer_count_zero()
+        self._view = CBufferView(
+            self._base, <uintptr_t>self._ptr.data,
+            self._ptr.size_bytes, self._data_type, self._element_size_bits,
+            self._device
+        )
+
+    cdef _refresh_view_if_needed(self):
+        if self._get_buffer_count > 0:
+            return
+
+        self._assert_valid()
+        cdef int addr_equal = self._ptr.data == self._view._ptr.data.as_uint8
+        cdef int size_equal = self._ptr.size_bytes == 
self._view._ptr.size_bytes
+        cdef int types_equal = self._data_type == self._view._data_type
+        cdef int element_size_equal = self._element_size_bits == 
self._view.element_size_bits
+        if addr_equal and size_equal and types_equal and element_size_equal:
+            return
+
+        self._populate_view()
+
+    def set_empty(self):
+        self._assert_buffer_count_zero()
+        if self._ptr == NULL:
+            self._base = alloc_c_buffer(&self._ptr)
+        ArrowBufferReset(self._ptr)
+
+        self._data_type = NANOARROW_TYPE_BINARY
+        self._element_size_bits = 0
+        self._device = CDEVICE_CPU
+        self._reset_view()
+        return self
+
+    def set_pybuffer(self, obj):
+        self._assert_buffer_count_zero()
+        if self._ptr == NULL:
+            self._base = alloc_c_buffer(&self._ptr)
+
+        self.set_format(c_buffer_set_pybuffer(obj, &self._ptr))
+        self._device = CDEVICE_CPU
+        self._reset_view()
+        return self
+
+    def set_format(self, str format):
+        self._assert_buffer_count_zero()
+        element_size_bytes, data_type = c_arrow_type_from_format(format)
+        self._data_type = data_type
+        self._element_size_bits = element_size_bytes * 8
+        format_bytes = format.encode("UTF-8")
+        snprintf(self._format, sizeof(self._format), "%s", <const 
char*>format_bytes)
+        return self
+
+    def set_data_type(self, ArrowType type_id, int element_size_bits=0):

Review Comment:
   Related to the comment just above, does this method need to live on CBuffer? 
As in practice, it only seems to be used for CBufferBuilder?



##########
python/src/nanoarrow/c_lib.py:
##########
@@ -305,3 +514,101 @@ def allocate_c_array_stream():
     >>> pa_reader._export_to_c(array_stream._addr())
     """
     return CArrayStream.allocate()
+
+
+# This is a heuristic for detecting a pyarrow.Array or pyarrow.RecordBatch
+# for pyarrow < 14.0.0, after which the the __arrow_c_array__ protocol
+# is sufficient to detect such an array. This check can't use isinstance()
+# to avoid importing pyarrow unnecessarily.
+def _obj_is_pyarrow_array(obj):
+    obj_type = type(obj)
+    if obj_type.__module__ != "pyarrow.lib":

Review Comment:
   Although in practice we currently use this function for _old_ pyarrows, 
where it will of course not change anymore



##########
python/src/nanoarrow/_lib.pyx:
##########
@@ -1177,11 +1429,395 @@ cdef class CBufferView:
         buffer.strides = &self._strides
         buffer.suboffsets = NULL
 
-    def __releasebuffer__(self, Py_buffer *buffer):
+    cdef _do_releasebuffer(self, Py_buffer* buffer):
         pass
 
     def __repr__(self):
-        return f"<nanoarrow.c_lib.CBufferView>\n  
{_lib_utils.buffer_view_repr(self)[1:]}"
+        return f"CBufferView({_lib_utils.buffer_view_repr(self)})"
+
+
+cdef class CBuffer:
+    """Wrapper around readable owned buffer content
+
+    Like the CBufferView, the CBuffer represents readable buffer content; 
however,
+    unlike the CBufferView, the CBuffer always represents a valid ArrowBuffer 
C object.
+    """
+    cdef object _base
+    cdef ArrowBuffer* _ptr
+    cdef ArrowType _data_type
+    cdef int _element_size_bits
+    cdef char _format[32]
+    cdef CDevice _device
+    cdef CBufferView _view
+    cdef int _get_buffer_count
+
+    def __cinit__(self):
+        self._base = None
+        self._ptr = NULL
+        self._data_type = NANOARROW_TYPE_BINARY
+        self._element_size_bits = 0
+        self._device = CDEVICE_CPU
+        self._format[0] = 0
+        self._get_buffer_count = 0
+        self._reset_view()
+
+    cdef _assert_valid(self):
+        if self._ptr == NULL:
+            raise RuntimeError("CBuffer is not valid")
+
+    cdef _assert_buffer_count_zero(self):
+        if self._get_buffer_count != 0:
+            raise RuntimeError(
+                f"CBuffer already open ({self._get_buffer_count} ",
+                f"references, {self._writable_get_buffer_count} writable)")
+
+    cdef _reset_view(self):
+        self._view = CBufferView(None, 0, 0, NANOARROW_TYPE_BINARY, 8, 
self._device)
+
+    cdef _populate_view(self):
+        self._assert_valid()
+        self._assert_buffer_count_zero()
+        self._view = CBufferView(
+            self._base, <uintptr_t>self._ptr.data,
+            self._ptr.size_bytes, self._data_type, self._element_size_bits,
+            self._device
+        )
+
+    cdef _refresh_view_if_needed(self):
+        if self._get_buffer_count > 0:
+            return
+
+        self._assert_valid()
+        cdef int addr_equal = self._ptr.data == self._view._ptr.data.as_uint8
+        cdef int size_equal = self._ptr.size_bytes == 
self._view._ptr.size_bytes
+        cdef int types_equal = self._data_type == self._view._data_type
+        cdef int element_size_equal = self._element_size_bits == 
self._view.element_size_bits
+        if addr_equal and size_equal and types_equal and element_size_equal:
+            return
+
+        self._populate_view()
+
+    def set_empty(self):
+        self._assert_buffer_count_zero()
+        if self._ptr == NULL:
+            self._base = alloc_c_buffer(&self._ptr)
+        ArrowBufferReset(self._ptr)
+
+        self._data_type = NANOARROW_TYPE_BINARY
+        self._element_size_bits = 0
+        self._device = CDEVICE_CPU
+        self._reset_view()
+        return self
+
+    def set_pybuffer(self, obj):
+        self._assert_buffer_count_zero()
+        if self._ptr == NULL:
+            self._base = alloc_c_buffer(&self._ptr)
+
+        self.set_format(c_buffer_set_pybuffer(obj, &self._ptr))
+        self._device = CDEVICE_CPU
+        self._reset_view()
+        return self
+
+    def set_format(self, str format):
+        self._assert_buffer_count_zero()
+        element_size_bytes, data_type = c_arrow_type_from_format(format)
+        self._data_type = data_type
+        self._element_size_bits = element_size_bytes * 8
+        format_bytes = format.encode("UTF-8")
+        snprintf(self._format, sizeof(self._format), "%s", <const 
char*>format_bytes)
+        return self
+
+    def set_data_type(self, ArrowType type_id, int element_size_bits=0):
+        self._assert_buffer_count_zero()
+        self._element_size_bits = c_format_from_arrow_type(
+            type_id,
+            element_size_bits,
+            sizeof(self._format),
+            self._format
+        )
+        self._data_type = type_id
+
+        return self
+
+    def _addr(self):
+        self._assert_valid()
+        return <uintptr_t>self._ptr.data
+
+    @property
+    def size_bytes(self):
+        self._assert_valid()
+        return self._ptr.size_bytes
+
+    @property
+    def capacity_bytes(self):
+        self._assert_valid()
+        return self._ptr.capacity_bytes
+
+    @property
+    def data_type(self):
+        return ArrowTypeString(self._data_type).decode("UTF-8")
+
+    @property
+    def data_type_id(self):
+        return self._data_type
+
+    @property
+    def element_size_bits(self):
+        return self._element_size_bits
+
+    @property
+    def item_size(self):
+        self._refresh_view_if_needed()
+        return self._view.item_size
+
+    @property
+    def format(self):
+        return self._format.decode("UTF-8")
+
+    def __len__(self):
+        self._refresh_view_if_needed()
+        return len(self._view)
+
+    def __getitem__(self, k):
+        self._refresh_view_if_needed()
+        return self._view[k]
+
+    def __iter__(self):
+        self._refresh_view_if_needed()
+        return iter(self._view)
+
+    def __getbuffer__(self, Py_buffer* buffer, int flags):
+        self._refresh_view_if_needed()
+        self._view._do_getbuffer(buffer, flags)
+        self._get_buffer_count += 1
+
+    def __releasebuffer__(self, Py_buffer* buffer):
+        if self._get_buffer_count <= 0:
+            raise RuntimeError("CBuffer buffer reference count underflow 
(releasebuffer)")
+
+        self._view._do_releasebuffer(buffer)
+        self._get_buffer_count -= 1
+
+    def __repr__(self):
+        if self._ptr == NULL:
+            return "CBuffer(<invalid>)"
+
+        self._refresh_view_if_needed()
+        return f"CBuffer({_lib_utils.buffer_view_repr(self._view)})"
+
+
+cdef class CBufferBuilder(CBuffer):
+    """Wrapper around writable owned buffer CPU content"""
+
+    def reserve_bytes(self, int64_t additional_bytes):
+        self._assert_valid()
+        self._assert_buffer_count_zero()
+        cdef int code = ArrowBufferReserve(self._ptr, additional_bytes)
+        Error.raise_error_not_ok("ArrowBufferReserve()", code)
+        return self
+
+    def write(self, content):
+        self._assert_valid()
+        self._assert_buffer_count_zero()
+
+        cdef Py_buffer buffer
+        cdef int64_t out
+        PyObject_GetBuffer(content, &buffer, PyBUF_ANY_CONTIGUOUS)
+
+        cdef int code = ArrowBufferReserve(self._ptr, buffer.len)
+        if code != NANOARROW_OK:
+            PyBuffer_Release(&buffer)
+            Error.raise_error("ArrowBufferReserve()", code)
+
+        code = PyBuffer_ToContiguous(
+            self._ptr.data + self._ptr.size_bytes,
+            &buffer,
+            buffer.len,
+            # 'C' (not sure how to pass a character literal here)
+            43
+        )
+        out = buffer.len
+        PyBuffer_Release(&buffer)
+        Error.raise_error_not_ok("PyBuffer_ToContiguous()", code)
+
+        self._ptr.size_bytes += out
+        return out
+
+    def write_values(self, obj):
+        self._assert_valid()
+
+        if self._data_type == NANOARROW_TYPE_BOOL:
+            return self._write_bits(obj)
+
+        cdef int64_t n_values = 0
+        struct_obj = Struct(self._format)
+        pack = struct_obj.pack
+        write = self.write
+
+        if self._data_type in (NANOARROW_TYPE_INTERVAL_DAY_TIME,
+                               NANOARROW_TYPE_INTERVAL_MONTH_DAY_NANO):
+            for item in obj:
+                n_values += 1
+                write(pack(*item))
+        else:
+            for item in obj:
+                n_values += 1
+                write(pack(item))
+
+        return n_values
+
+    cdef _write_bits(self, obj):
+        if self._ptr.size_bytes != 0:
+            raise NotImplementedError("Append to bitmap that has already been 
appended to")
+
+        cdef char buffer_item = 0
+        cdef int buffer_item_i = 0
+        cdef int code
+        cdef int64_t n_values = 0
+        for item in obj:
+            n_values += 1
+            if item:
+                buffer_item |= (<char>1 << buffer_item_i)
+
+            buffer_item_i += 1
+            if buffer_item_i == 8:
+                code = ArrowBufferAppendInt8(self._ptr, buffer_item)
+                Error.raise_error_not_ok("ArrowBufferAppendInt8()", code)
+                buffer_item = 0
+                buffer_item_i = 0
+
+        if buffer_item_i != 0:
+            code = ArrowBufferAppendInt8(self._ptr, buffer_item)
+            Error.raise_error_not_ok("ArrowBufferAppendInt8()", code)
+
+        return n_values
+
+    def finish(self):
+        return self

Review Comment:
   This means that it returns a BufferBuilder, which has some additional 
methods to further append to the Buffer? 



##########
python/src/nanoarrow/c_lib.py:
##########
@@ -257,7 +413,60 @@ def c_array_view(obj, requested_schema=None) -> CArrayView:
     return CArrayView.from_cpu_array(c_array(obj, requested_schema))
 
 
-def allocate_c_schema():
+def c_buffer(obj, requested_schema=None) -> CBuffer:

Review Comment:
   Ah, I see we used `requested_schema` in `c_array(..)` before as well. 
Personally, for places where the _user_ would provide a schema, I would go for 
`schema` instead of `requested_schema` (like `c_array_from_buffers` uses)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(python): Add array creation/building from buffers [arrow-nanoarrow]

Reply via email to