adamreeve opened a new issue, #40517:
URL: https://github.com/apache/arrow/issues/40517

   ### Describe the enhancement requested
   
   Currently the C# `ArrowStreamWriter` always writes all data in a buffer. 
This behaviour differs to the Python/C++ implementation, which only writes 
slices of the buffers when an array has a nonzero offset or size in bytes less 
than the buffer length. This can be observed by looking at the sizes of IPC 
files for a whole RecordBatch, compared to slices of the data.
   
   Python:
   ```python
   import pyarrow as pa
   import numpy as np
   
   
   num_rows = 400
   rows_per_batch = 100
   
   ints = pa.array(np.arange(0, num_rows, 1, dtype=np.int32))
   floats = pa.array(np.arange(0, num_rows / 10.0, 0.1, dtype=np.float32))
   
   all_data = pa.RecordBatch.from_arrays([ints, floats], names=["a", "b"])
   
   sink = pa.BufferOutputStream()
   with pa.ipc.new_stream(sink, all_data.schema) as writer:
       writer.write_batch(all_data)
   buf = sink.getvalue()
   print(f"Size of serialized full batch = {buf.size}")
   
   for offset in range(0, num_rows, rows_per_batch):
       slice = all_data.slice(offset, rows_per_batch)
       sink = pa.BufferOutputStream()
       with pa.ipc.new_stream(sink, slice.schema) as writer:
           writer.write_batch(slice)
       buf = sink.getvalue()
       print(f"Size of serialized slice at offset {offset} = {buf.size}")
   ```
   
   This outputs:
   ```
   Size of serialized full batch = 3576
   Size of serialized slice at offset 0 = 1176
   Size of serialized slice at offset 100 = 1176
   Size of serialized slice at offset 200 = 1176
   Size of serialized slice at offset 300 = 1176
   ```
   
   The size of the full batch is 1/4 the full batch after accounting for the 
overhead of metadata.
   
   Doing the same in C#:
   ```C#
   const int numRows = 400;
   const int rowsPerBatch = 100;
   
   var allData = new RecordBatch.Builder()
       .Append("a", false, col => col.Int32(array => 
array.AppendRange(Enumerable.Range(0, numRows))))
       .Append("b", false, col => col.Float(array => 
array.AppendRange(Enumerable.Range(0, numRows).Select(i => 0.1f * i))))
       .Build();
   
   {
       using var ms = new MemoryStream();
       using var writer = new ArrowFileWriter(ms, allData.Schema, false, new 
IpcOptions());
       await writer.WriteStartAsync();
       await writer.WriteRecordBatchAsync(allData);
       await writer.WriteEndAsync();
   
       Console.WriteLine($"Size of serialized full batch = {ms.Length}");
   }
   
   for (var offset = 0; offset < allData.Length; offset += rowsPerBatch)
   {
       var arraySlices = allData.Arrays
           .Select(arr => ArrowArrayFactory.Slice(arr, offset, rowsPerBatch))
           .ToArray();
       var slice = new RecordBatch(allData.Schema, arraySlices, 
arraySlices[0].Length);
   
       using var ms = new MemoryStream();
       using var writer = new ArrowFileWriter(ms, slice.Schema, false, new 
IpcOptions());
       await writer.WriteStartAsync();
       await writer.WriteRecordBatchAsync(slice);
       await writer.WriteEndAsync();
   
       Console.WriteLine($"Size of serialized slice at offset {offset} = 
{ms.Length}");
   }
   ```
   
   This outputs:
   ```
   Size of serialized full batch = 3802
   Size of serialized slice at offset 0 = 3802
   Size of serialized slice at offset 100 = 3802
   Size of serialized slice at offset 200 = 3802
   Size of serialized slice at offset 300 = 3802
   ```
   
   Writing a slice of the data results in the same file size as writing the 
full data, but we'd like to be able to break IPC data into smaller slices in 
order to send it over a transport that has a message size limit.
   
   From a quick look at the C++ implementation, one complication is dealing 
with null bitmaps, which need to be copied to ensure the start is aligned with 
a byte boundary.
   
   ### Component(s)
   
   C#


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to