[jira] [Created] (ARROW-9048) [C#] Support Float16
Eric Erhardt created ARROW-9048: --- Summary: [C#] Support Float16 Key: ARROW-9048 URL: https://issues.apache.org/jira/browse/ARROW-9048 Project: Apache Arrow Issue Type: Bug Components: C# Reporter: Eric Erhardt With [https://github.com/dotnet/runtime/issues/936], .NET is getting a `System.Half` type, which is a 16-bit floating point number. Once that type lands in .NET we can implement support for the Float16 type in Arrow. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8953) [C#] Update to .NET SDK 3.1
Eric Erhardt created ARROW-8953: --- Summary: [C#] Update to .NET SDK 3.1 Key: ARROW-8953 URL: https://issues.apache.org/jira/browse/ARROW-8953 Project: Apache Arrow Issue Type: Bug Components: C# Reporter: Eric Erhardt We should update our tools to the latest .NET SDK - 3.1. This will enable new tooling features, such as the code style rules package that will enforce coding style: [https://github.com/apache/arrow/pull/7246#issuecomment-634206767] There are 3 places that I know of that need updating: [https://github.com/apache/arrow/blob/master/.github/workflows/csharp.yml] [https://github.com/apache/arrow/blob/f16f76ab7693ae085e82f4269a0a0bc23770bef9/.github/workflows/dev.yml#L132] [https://github.com/apache/arrow/blob/f16f76ab7693ae085e82f4269a0a0bc23770bef9/dev/release/verify-release-candidate.sh#L327] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8882) [C#] Add .editorconfig to C# code
Eric Erhardt created ARROW-8882: --- Summary: [C#] Add .editorconfig to C# code Key: ARROW-8882 URL: https://issues.apache.org/jira/browse/ARROW-8882 Project: Apache Arrow Issue Type: Bug Components: C# Reporter: Eric Erhardt This allows for a consistent code format throughout the C# code in the repo. That way when a new contributor submits a change, the editors will automatically format the code to be in the same format as the current code base. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7516) [C#] .NET Benchmarks are broken
Eric Erhardt created ARROW-7516: --- Summary: [C#] .NET Benchmarks are broken Key: ARROW-7516 URL: https://issues.apache.org/jira/browse/ARROW-7516 Project: Apache Arrow Issue Type: Bug Components: C# Reporter: Eric Erhardt See [https://github.com/apache/arrow/pull/6030#issuecomment-571877721] It looks like the issue is that in the Benchmarks, `Length` is specified as `1_000_000`, and there has only been ~730,000 days since `DateTime.Min`, so this line fails: https://github.com/apache/arrow/blob/4634c89fc77f70fb5b5d035d6172263a4604da82/csharp/test/Apache.Arrow.Tests/TestData.cs#L130 A simple fix would be to cap what we pass into `AddDays` to some number like `100_000`, or so. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6795) [C#] Reading large Arrow files in C# results in an exception
Eric Erhardt created ARROW-6795: --- Summary: [C#] Reading large Arrow files in C# results in an exception Key: ARROW-6795 URL: https://issues.apache.org/jira/browse/ARROW-6795 Project: Apache Arrow Issue Type: Bug Components: C# Reporter: Eric Erhardt If you try to read a large Arrow file (2GB+) using the C# reader, you get an exception because it is casting the file position (a 64-bit long) to a 32-bit integer. When the file size is large See [https://github.com/apache/arrow/pull/5412] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6728) [C#] Support reading and writing Date32 and Date64 arrays
Eric Erhardt created ARROW-6728: --- Summary: [C#] Support reading and writing Date32 and Date64 arrays Key: ARROW-6728 URL: https://issues.apache.org/jira/browse/ARROW-6728 Project: Apache Arrow Issue Type: Bug Components: C# Reporter: Eric Erhardt The C# implementation doesn't support reading and writing Date32 and Date64 arrays. We need to add support and some tests. It looks like it is only a couple of lines to get this enabled. See [https://github.com/apache/arrow/pull/5413]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6643) [C#] Write no IPC buffer metadata for NullType
Eric Erhardt created ARROW-6643: --- Summary: [C#] Write no IPC buffer metadata for NullType Key: ARROW-6643 URL: https://issues.apache.org/jira/browse/ARROW-6643 Project: Apache Arrow Issue Type: Improvement Components: C# Reporter: Eric Erhardt We need to align the C# writer (and test the reader) for NullType. See [https://github.com/apache/arrow/pull/5287] and ARROW-6379. >The C++ implementation has been writing 2 {{Buffer}} Flatbuffer struct values >with length 0 for NullType. Rather than having dummy/placeholder Buffer I >think it is more consistent to write no metadata for this type. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6603) [C#] ArrayBuilder API to support writing nulls
Eric Erhardt created ARROW-6603: --- Summary: [C#] ArrayBuilder API to support writing nulls Key: ARROW-6603 URL: https://issues.apache.org/jira/browse/ARROW-6603 Project: Apache Arrow Issue Type: Improvement Components: C# Reporter: Eric Erhardt There is currently no API in the PrimitiveArrayBuilder class to support writing nulls. See this TODO - [https://github.com/apache/arrow/blob/1515fe10c039fb6685df2e282e2e888b773caa86/csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs#L101.] Also see [https://github.com/apache/arrow/issues/5381]. We should add some APIs to support writing nulls. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6553) [C#] Decide how to read message lengths - little-endian or machine dependent
Eric Erhardt created ARROW-6553: --- Summary: [C#] Decide how to read message lengths - little-endian or machine dependent Key: ARROW-6553 URL: https://issues.apache.org/jira/browse/ARROW-6553 Project: Apache Arrow Issue Type: Bug Components: C# Reporter: Eric Erhardt See the discussion [here|[https://github.com/apache/arrow/pull/5280#discussion_r323896532]]. We are currently reading message lengths using machine dependent endianness. Should this be changed to little-endian all the time? It appears the C++ implementation does this same thing - use machine dependent endianness. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6322) [C#] Implement a plasma client
Eric Erhardt created ARROW-6322: --- Summary: [C#] Implement a plasma client Key: ARROW-6322 URL: https://issues.apache.org/jira/browse/ARROW-6322 Project: Apache Arrow Issue Type: New Feature Components: C# Reporter: Eric Erhardt Assignee: Eric Erhardt We should create a C# plasma client, so .NET code can get and put objects into the plasma store. An easy-ish way of implementing this would be to build on the c_glib C APIs already exposed for the plasma client. Unfortunately, I haven't found a decent C# GObject generator, so I think the C bindings will need to be written by hand, but there isn't too many of them. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-5908) [C#] ArrowStreamWriter doesn't align buffers to 8 bytes
Eric Erhardt created ARROW-5908: --- Summary: [C#] ArrowStreamWriter doesn't align buffers to 8 bytes Key: ARROW-5908 URL: https://issues.apache.org/jira/browse/ARROW-5908 Project: Apache Arrow Issue Type: Improvement Components: C# Reporter: Eric Erhardt Assignee: Eric Erhardt When writing RecordBatches using ArrowStreamWriter, if the ArrowBuffers being written aren't all 8 byte aligned, the serialized RecordBatch won't conform to the Arrow specification. This leads to other languages' readers to throw an error when reading Arrow streams written by the C# writer. For example, if reading the stream from Python or C++, an error is raised here: [https://github.com/apache/arrow/blob/f77c3427ca801597b572fb197b92b0133269049b/cpp/src/arrow/ipc/reader.cc#L107-L110] A similar error is raised when Java tries to read the stream. We should be ensuring that the buffers being written to the stream are padded to 8 bytes, no matter their length, as specified in [https://arrow.apache.org/docs/format/Layout.html#requirements-goals-and-non-goals] {quote} * It is required to have all the contiguous memory buffers in an IPC payload aligned at 8-byte boundaries. In other words, each buffer must start at an aligned 8-byte offset. Additionally, each buffer should be padded to a multiple of 8 bytes.{quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5896) [C#] Array Builders should take an initial capacity in their constructors
Eric Erhardt created ARROW-5896: --- Summary: [C#] Array Builders should take an initial capacity in their constructors Key: ARROW-5896 URL: https://issues.apache.org/jira/browse/ARROW-5896 Project: Apache Arrow Issue Type: Improvement Components: C# Reporter: Eric Erhardt When using the Fluent Array Builder API, we should take in an initial capacity in the constructor, so we can avoid allocating unnecessary memory. Today, if you create a builder, and then .Reserve(length) on it, the initial byte[] that was created in the constructor is wasted. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5887) [C#] ArrowStreamWriter writes FieldNodes in wrong order
Eric Erhardt created ARROW-5887: --- Summary: [C#] ArrowStreamWriter writes FieldNodes in wrong order Key: ARROW-5887 URL: https://issues.apache.org/jira/browse/ARROW-5887 Project: Apache Arrow Issue Type: Bug Components: C# Reporter: Eric Erhardt Assignee: Eric Erhardt When ArrowStreamWriter is writing a {{RecordBatch}} with {{null}}s in it, it is mixing up the column's {{NullCount}}. You can see here: [https://github.com/apache/arrow/blob/90affbd2c41e80aa8c3fac1e4dbff60aafb415d3/csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs#L195-L200] It is writing the fields from {{0}} -> {{fieldCount}} order. But then [lower|https://github.com/apache/arrow/blob/90affbd2c41e80aa8c3fac1e4dbff60aafb415d3/csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs#L216-L220], it is writing the fields from {{fieldCount}} -> {{0}}. Looking at the [Java implementation|https://github.com/apache/arrow/blob/7b2d68570b4336308c52081a0349675e488caf11/java/vector/src/main/java/org/apache/arrow/vector/ipc/message/FBSerializables.java#L36-L44] it says {quote}// struct vectors have to be created in reverse order {quote} A simple test of roundtripping the following RecordBatch shows the issue: {code:java} var result = new RecordBatch( new Schema.Builder() .Field(f => f.Name("age").DataType(Int32Type.Default)) .Field(f => f.Name("CharCount").DataType(Int32Type.Default)) .Build(), new IArrowArray[] { new Int32Array( new ArrowBuffer.Builder().Append(0).Build(), new ArrowBuffer.Builder().Append(0).Build(), length: 1, nullCount: 1, offset: 0), new Int32Array( new ArrowBuffer.Builder().Append(7).Build(), ArrowBuffer.Empty, length: 1, nullCount: 0, offset: 0) }, length: 1); {code} Here, the "age" column should have a `null` in it. However, when you write and read this RecordBatch back, you see that the "CharCount" column has `NullCount` == 1 and "age" column has `NullCount` == 0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5708) [C#] Null support for BooleanArray
Eric Erhardt created ARROW-5708: --- Summary: [C#] Null support for BooleanArray Key: ARROW-5708 URL: https://issues.apache.org/jira/browse/ARROW-5708 Project: Apache Arrow Issue Type: Improvement Components: C# Reporter: Eric Erhardt See the conversation [here|https://github.com/apache/arrow/pull/4640#discussion_r296417726] and [here|https://github.com/apache/arrow/pull/3574#discussion_r262662083]. We should add null support for BooleanArray. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5546) [C#] Remove IArrowArray and use Array base class.
Eric Erhardt created ARROW-5546: --- Summary: [C#] Remove IArrowArray and use Array base class. Key: ARROW-5546 URL: https://issues.apache.org/jira/browse/ARROW-5546 Project: Apache Arrow Issue Type: Improvement Components: C# Affects Versions: 0.13.0 Reporter: Eric Erhardt In .NET libraries, we have historically favored classes (abstract or otherwise) over interfaces. See [Choosing Between Classes and Interfaces|https://docs.microsoft.com/en-us/previous-versions/dotnet/netframework-4.0/ms229013(v%3dvs.100)]. The main reasoning is that you can add members to a class over time, but once you ship an interface, it can never be changed. You can only add new interfaces. In light of this, we should remove the IArrowArray interface, and instead just the base `Array` class as the abstraction for all Arrow Arrays. As part of this, we should also consider renaming `Array` because it conflicts with the System.Array type. Instead we should consider naming it `ArrowArray` to make it unique from the very common System.Array type in .NET. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5278) [C#] ArrowBuffer should either implement IEquatable correctly or not at all
Eric Erhardt created ARROW-5278: --- Summary: [C#] ArrowBuffer should either implement IEquatable correctly or not at all Key: ARROW-5278 URL: https://issues.apache.org/jira/browse/ARROW-5278 Project: Apache Arrow Issue Type: Bug Reporter: Eric Erhardt See the discussion [here|https://github.com/apache/arrow/pull/3925/#discussion_r281378027]. ArrowBuffer currently implement IEquatable, but doesn't override `GetHashCode`. We should either implement IEquatable correctly by overriding Equals and GetHashCode, or remove IEquatable all together. Looking at ArrowBuffer's [Equals implementation|https://github.com/apache/arrow/blob/08829248fd540b7e3bd96b980e357f8a4db7970e/csharp/src/Apache.Arrow/ArrowBuffer.cs#L66-L69], it compares each value in the buffer, which is not very efficient. Also, this implementation is not consistent with how `Memory` implements IEquatable - [https://source.dot.net/#System.Private.CoreLib/shared/System/Memory.cs,500]. If we continue implementing IEquatable on ArrowBuffer, we should consider implementing it in the same fashion as Memory does. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5277) [C#] MemoryAllocator.Allocate(length: 0) should not return null
Eric Erhardt created ARROW-5277: --- Summary: [C#] MemoryAllocator.Allocate(length: 0) should not return null Key: ARROW-5277 URL: https://issues.apache.org/jira/browse/ARROW-5277 Project: Apache Arrow Issue Type: Improvement Components: C# Reporter: Eric Erhardt See the conversation [here|https://github.com/apache/arrow/pull/3925#discussion_r281187184]. We should change MemoryAllocator to not return `null` when the requested memory length is `0`. Instead, we should create a cached "NullObject" IMemoryOwner that has a no-op `Dispose` method, and always returns `Memory.Empty`. This way consuming code doesn't need to check for `null` being returned from MemoryAllocator.Allocate. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5276) [C#] NativeMemoryAllocator expose an option for clearing allocated memory
Eric Erhardt created ARROW-5276: --- Summary: [C#] NativeMemoryAllocator expose an option for clearing allocated memory Key: ARROW-5276 URL: https://issues.apache.org/jira/browse/ARROW-5276 Project: Apache Arrow Issue Type: Improvement Components: C# Reporter: Eric Erhardt See the discussion [here|https://github.com/apache/arrow/pull/3925#discussion_r281192698]. We should expose an option on NativeMemoryAllocator for controlling whether the allocated memory is cleared or not. Maybe we should make the default not clear the memory, that way it is the best performing by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5092) [C#] Source Link doesn't work with the C# release script
Eric Erhardt created ARROW-5092: --- Summary: [C#] Source Link doesn't work with the C# release script Key: ARROW-5092 URL: https://issues.apache.org/jira/browse/ARROW-5092 Project: Apache Arrow Issue Type: Bug Components: C# Affects Versions: 0.13.0 Reporter: Eric Erhardt With the 0.13.0 C# NuGet package, [Source Link|https://docs.microsoft.com/en-us/dotnet/standard/library-guidance/sourcelink] doesn't work. The symbols can be downloaded from nuget.org correctly, but when Visual Studio tries to download the code, it cannot find the correct files. The following is why it doesn't work: The .NET tooling expects the build of an official release to happen in the context of a {{git}} repository. This does 2 things to the produced assets: # In the {{.nupkg}} file that is generated, the .NET tooling will encode the current git commit's SHA hash into both the {{Apache.Arrow.nuspec}} file, and into the compiled {{Apache.Arrow.dll}} assembly. Looking at the released version that was published over the weekend: [https://www.nuget.org/packages/Apache.Arrow/0.13.0], this information made it into the {{.nuspec}} and the {{.dll}}: {code} [assembly: AssemblyInformationalVersion("0.13.0+57de5c3adffe526f37366bb15c3ff0d4a2e84655")] https://github.com/apache/arrow"; commit="57de5c3adffe526f37366bb15c3ff0d4a2e84655" /> {code} However, I don't see how the [C# release script|https://github.com/apache/arrow/blob/master/dev/release/post-06-csharp.sh] could have done that. # Also, .NET has a feature called "Source Link", which allows for the source code to be automatically downloaded from GitHub when debugging into this library. The way the tooling works today, it requires that the git repository's {{origin}} remote is set to [https://github.com/apache/arrow.git]. The tooling reads uses the `origin` git remote to encode the GitHub URL into the symbols file in the {{.snupkg}} file. This, however, doesn't work with the 0.13.0 release that occurred over the weekend. I tried using the Source Link feature, and it didn't automatically download the source files from GitHub. Looking into the symbols file, I see the Source Link information that was embedded: {code} 1: '/home/kou/work/cpp/arrow.kou/apache-arrow-0.13.0/csharp/src/Apache.Arrow/Flatbuf/FlatBuffers/ByteBuffer.cs' (#19c)C# (#3) SHA-1 (#2) 04-64-A0-48-82-EA-F5-B5-50-EC-CA-9F-85-75-E2-95-A4-EC-AB-B3 (#1b7) 2: '/home/kou/work/cpp/arrow.kou/apache-arrow-0.13.0/csharp/src/Apache.Arrow/Flatbuf/FlatBuffers/ByteBufferUtil.cs' (#68f)C# (#3) SHA-1 (#2) F0-4F-28-53-88-A4-E0-6E-F1-1F-17-F6-CD-FE-0E-64-AB-0B-C2-95 (#6aa) {code} {code:json} { "documents": { "/home/kou/work/cpp/arrow.kou/*": "https://raw.githubusercontent.com/kou/arrow/57de5c3adffe526f37366bb15c3ff0d4a2e84655/*";, "/home/kou/work/cpp/arrow.kou/cpp/submodules/parquet-testing/*": "https://raw.githubusercontent.com/apache/parquet-testing/bb7b6abbb3fbeff845646364a4286142127be04c/*"; } } {code} Here it appears the {{origin}} remote was set to {{kou/arrow}}, and not {{apache/arrow}}. Also, it appears the {{apache-arrow-0.13.0}} folder was under a git repository, and so the sources aren't matched up with the git repository. (Basically that folder shouldn't have appeared in the Documents list that has the {{.cs}} file path.) I think this explains how (1) above happened - the build was under a git repository - but this script downloaded an extra copy of the sources into that git repository. I'm wondering how we can fix either this script, or the .NET Tooling, or both, to make this experience better for the next release. I think we need to ensure two things: # The git commit information is set correctly in the {{.nuspec}} and the {{.dll}} when the release build is run. I think it just happened by pure luck this time. It just so happened that the script was executed in an already established repo, and it just so happened to be on the right commit (or maybe it wasn't the right commit?). # The source link information is set correctly in the symbols file. [~wesmckinn] [~kou] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5035) [C#] ArrowBuffer.Builder is broken
Eric Erhardt created ARROW-5035: --- Summary: [C#] ArrowBuffer.Builder is broken Key: ARROW-5035 URL: https://issues.apache.org/jira/browse/ARROW-5035 Project: Apache Arrow Issue Type: Improvement Components: C# Reporter: Eric Erhardt If someone creates and uses `ArrowBuffer.Builder` in their code to create an ArrowBuffer filled with Boolean values, it is currently producing the wrong results. The reason it is producing the wrong results is because it is taking the `sizeof(bool)` (which is 1) and using that for how many bytes to write into the backing buffer for each element being added to the builder. However, in Arrow, Boolean values are stored in a bit-wise fashion allowing for 8 Boolean values in a single byte. Thus, when I add 4 `true` values to the buffer, I expect to get a buffer with 1 byte in it with the value 0x0F. However, I am getting a buffer with 4 bytes in it, each with value 0x01. One way to fix this would be to throw in `ArrowBuffer.Builder`'s constructor if `T` == `bool` and instead create a new class `ArrowBuffer.BooleanBuilder`, which will create Boolean buffers correctly. Looking at the current implementation, I think it would be rather hard to special case `typeof(bool)` all over in the `Builder` class, but if someone wanted to take that approach and made it work, that would be great too. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5034) [C#] ArrowStreamWriter should expose synchronous Write methods
Eric Erhardt created ARROW-5034: --- Summary: [C#] ArrowStreamWriter should expose synchronous Write methods Key: ARROW-5034 URL: https://issues.apache.org/jira/browse/ARROW-5034 Project: Apache Arrow Issue Type: Improvement Components: C# Reporter: Eric Erhardt There are times when callers are in a synchronous method and need to write an Arrow stream. However, ArrowStreamWriter (and ArrowFileWriter) only expose WriteAsync methods, which means the caller needs to call the Async method, and then block on the resulting Task. Instead, we should also expose Write methods that complete in a synchronous fashion, so the callers are free to choose the sync or async methods as they need. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5019) [C#] ArrowStreamWriter doesn't work on a non-seekable stream
Eric Erhardt created ARROW-5019: --- Summary: [C#] ArrowStreamWriter doesn't work on a non-seekable stream Key: ARROW-5019 URL: https://issues.apache.org/jira/browse/ARROW-5019 Project: Apache Arrow Issue Type: Bug Components: C# Reporter: Eric Erhardt Assignee: Eric Erhardt When writing to a non-seekable .NET Stream (like a network/socket stream), ArrowStreamWriter will throw an exception: {code:java} Exception thrown: 'System.NotSupportedException' in System.Net.Sockets.dll This stream does not support seek operations. {code} The reason this throws is because we are using `BastStream.Position` in the writer to calculate the length of bytes that we've written to the stream. We don't need to use the Position in order to calculate the lengths. We should be able to write an Arrow RecordBatch to a NetworkStream directly. Today, we need to write to a MemoryStream, and then copy the MemoryStream to the NetworkStream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4997) [C#] ArrowStreamReader doesn't consume whole stream and doesn't implement sync read
Eric Erhardt created ARROW-4997: --- Summary: [C#] ArrowStreamReader doesn't consume whole stream and doesn't implement sync read Key: ARROW-4997 URL: https://issues.apache.org/jira/browse/ARROW-4997 Project: Apache Arrow Issue Type: Bug Components: C# Reporter: Eric Erhardt Assignee: Eric Erhardt There are 2 major issues with the ArrowStreamReader that are blocking me from using it. # When it reads a batch from a .NET Stream that doesn't return the whole chunk of memory in one "Read" call (like a socket/network stream), it only calls Read once, and then continues on. This is an issue because it has "garbage" at the end of its buffer (which was never written to by the stream), and when attempting to read the next batch, it is in the middle of the previous batch from the .NET Stream. This causes all sorts of issues because it assumes the next 4 bytes are the message length, which it obviously isn't. See [the reading code|https://github.com/apache/arrow/blob/13fd813445b4738cbebbd137490fe3c02071c04b/csharp/src/Apache.Arrow/Ipc/ArrowStreamReaderImplementation.cs#L90-L97] for where it only calls Read once - it should be in a loop. # ArrowStreamReader has a synchronous ReadNextRecordBatch() method - but it throws NotImplementedException. This is necessary when a caller isn't in an async method, they can't/shouldn't call the async API. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4839) [C#] Add NuGet support
Eric Erhardt created ARROW-4839: --- Summary: [C#] Add NuGet support Key: ARROW-4839 URL: https://issues.apache.org/jira/browse/ARROW-4839 Project: Apache Arrow Issue Type: Improvement Components: C# Reporter: Eric Erhardt Assignee: Eric Erhardt We should add the metadata to the .csproj so we can create a NuGet package without changing any source code. Also, we should add any scripts and documentation on how to create the NuGet package to allow ease of creation at release time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4737) [C#] tests are not running in CI
Eric Erhardt created ARROW-4737: --- Summary: [C#] tests are not running in CI Key: ARROW-4737 URL: https://issues.apache.org/jira/browse/ARROW-4737 Project: Apache Arrow Issue Type: Bug Components: C# Reporter: Eric Erhardt Assignee: Eric Erhardt The C# tests are not running in CI because the filtering logic needs to be updated. For example see https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/22671460/job/nk1nn59k5njie720 {quote}Build started git clone -q https://github.com/apache/arrow.git C:\projects\arrow git fetch -q origin +refs/pull/3662/merge: git checkout -qf FETCH_HEAD Running Install scripts python ci\detect-changes.py > generated_changes.bat Affected files: [u'csharp/src/Apache.Arrow/Field.Builder.cs', u'csharp/src/Apache.Arrow/Schema.Builder.cs', u'csharp/test/Apache.Arrow.Tests/SchemaBuilderTests.cs', u'csharp/test/Apache.Arrow.Tests/TypeTests.cs'] Affected topics: {'c_glib': False, 'cpp': False, 'dev': False, 'docs': False, 'go': False, 'integration': False, 'java': False, 'js': False, 'python': False, 'r': False, 'ruby': False, 'rust': False, 'site': False} call generated_changes.bat call ci\appveyor-filter-changes.bat === === No C++ or Python changes, exiting job === Build was forcibly terminated Build success{quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4717) [C#] Consider exposing ValueTask instead of Task
Eric Erhardt created ARROW-4717: --- Summary: [C#] Consider exposing ValueTask instead of Task Key: ARROW-4717 URL: https://issues.apache.org/jira/browse/ARROW-4717 Project: Apache Arrow Issue Type: Improvement Components: C# Reporter: Eric Erhardt See [https://github.com/apache/arrow/pull/3736#pullrequestreview-207169204] for the discussion and [https://devblogs.microsoft.com/dotnet/understanding-the-whys-whats-and-whens-of-valuetask/] for the reasoning. Using `Task` in public API requires that a new Task instance be allocated on every call. When returning synchronously, using ValueTask will allow the method to not allocate. In order to do this, we will need to take a new dependency on {{System.Threading.Tasks.Extensions}} NuGet package. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4571) [Format] Tensor.fbs file has multiple root_type declarations
Eric Erhardt created ARROW-4571: --- Summary: [Format] Tensor.fbs file has multiple root_type declarations Key: ARROW-4571 URL: https://issues.apache.org/jira/browse/ARROW-4571 Project: Apache Arrow Issue Type: Bug Components: Format Reporter: Eric Erhardt Looking at [the flatbuffers doc|https://google.github.io/flatbuffers/flatbuffers_guide_tutorial.html], it appears there should only be one `root_type` declaration in an fbs file: {code:java} The last part of the schema is the root_type. The root type declares what will be the root table for the serialized data. In our case, the root type is our Monster table.{code} However, the Tensor.fbs file has multiple `root_type` declarations: [https://github.com/apache/arrow/blob/69d595ae4c61902b3f2778e536fca6675350c88c/format/Tensor.fbs#L53] [https://github.com/apache/arrow/blob/69d595ae4c61902b3f2778e536fca6675350c88c/format/Tensor.fbs#L146] See the discussion here: https://github.com/apache/arrow/pull/2546#discussion_r256549256 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4543) [C#] Update Flat Buffers code to latest version
Eric Erhardt created ARROW-4543: --- Summary: [C#] Update Flat Buffers code to latest version Key: ARROW-4543 URL: https://issues.apache.org/jira/browse/ARROW-4543 Project: Apache Arrow Issue Type: Improvement Components: C# Reporter: Eric Erhardt Assignee: Eric Erhardt In order to support zero-copy reads, we should update to the latest Google Flat Buffers code. A recent change now allows [C# support for directly reading and writing to memory other than byte|https://github.com/google/flatbuffers/pull/4886][] which will make reading native memory using `Memory` possible. Along with this update, we should mark the flat buffers types as `internal`, since they are an implementation detail of the library. From an API perspective, it is confusing to see multiple public types named "Schema", "Field", "RecordBatch" etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4503) [C#] ArrowStreamReader allocates and copies data excessively
Eric Erhardt created ARROW-4503: --- Summary: [C#] ArrowStreamReader allocates and copies data excessively Key: ARROW-4503 URL: https://issues.apache.org/jira/browse/ARROW-4503 Project: Apache Arrow Issue Type: Improvement Components: C# Reporter: Eric Erhardt When reading `RecordBatch` instances using the `ArrowStreamReader` class, it is currently allocating and copying memory 3 times for the data. # It is allocating memory in order to [read the data from the Stream|https://github.com/apache/arrow/blob/044b418fa108a57f0b4e2e887546cc3e68271397/csharp/src/Apache.Arrow/Ipc/ArrowStreamReader.cs#L72-L74], and then reading from the Stream. (This should be the only allocation that is necessary.) # It then [creates a new `ArrowBuffer.Builder`|https://github.com/apache/arrow/blob/044b418fa108a57f0b4e2e887546cc3e68271397/csharp/src/Apache.Arrow/Ipc/ArrowStreamReader.cs#L227-L228], which allocates another `byte[]`, and calls `Append` on it, which copies the values to the new `byte[]`. # Finally, it then calls `.Build()` on the `ArrowBuffer.Builder`, which [allocates memory from the MemoryPool, and then copies the intermediate buffer|https://github.com/apache/arrow/blob/044b418fa108a57f0b4e2e887546cc3e68271397/csharp/src/Apache.Arrow/ArrowBuffer.Builder.cs#L112-L121] into it. We should reduce this overhead to only allocating a single time (from the MemoryPool), and not copying the data more times than necessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4502) [C#] Add support for zero-copy reads
Eric Erhardt created ARROW-4502: --- Summary: [C#] Add support for zero-copy reads Key: ARROW-4502 URL: https://issues.apache.org/jira/browse/ARROW-4502 Project: Apache Arrow Issue Type: Improvement Components: C# Reporter: Eric Erhardt Assignee: Eric Erhardt In the Python (and C++) API, you can create a `RecordBatchStreamReader`, and if you give it an `InputStream` that supports zero-copy reads, you can get back `RecordBatch` objects without allocating new memory and copying all the data. There is currently no way to read Arrow RecordBatch instances without allocating new memory and copying all the data. We should enable this scenario in the C# API. My proposal is to create a new `class ArrowRecordBatchReader : IArrowReader`. It's constructor will take a `ReadOnlyMemory data` parameter, and it will be able to read `RecordBatch` instances just like the existing `ArrowStreamReader`. As part of this new class, we will refactor any common code out of `ArrowStreamReader` in order for the parsing logic to be shared, where necessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4435) [C#] Add .sln file and minor .csproj fix ups
Eric Erhardt created ARROW-4435: --- Summary: [C#] Add .sln file and minor .csproj fix ups Key: ARROW-4435 URL: https://issues.apache.org/jira/browse/ARROW-4435 Project: Apache Arrow Issue Type: Task Components: C# Reporter: Eric Erhardt There is currently no .sln file in the repo, which makes it hard to use the src and test code at the same time. Also, there are some settings in the .csproj that can be moved up to the outer PropertyGroup, and not under a "Configuration|Platform" conditional, like they were in the old .csproj format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)