This is an automated email from the ASF dual-hosted git repository.
kou pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/main by this push:
new 80b7658472 GH-36250: [MATLAB] Add `arrow.array.StringArray` class
(#36366)
80b7658472 is described below
commit 80b76584729d72ddf08195f9ed4727d916823e6e
Author: Kevin Gurney <[email protected]>
AuthorDate: Thu Jul 6 20:57:12 2023 -0400
GH-36250: [MATLAB] Add `arrow.array.StringArray` class (#36366)
### Rationale for this change
Thanks to @ sgilmore10's [recent changes to enable UTF-8 <-> UTF-16 string
conversions](#36167), we can now add support for creating Arrow `String` arrays
(UTF-8 encoded) from MATLAB `string` arrays (UTF-16 encoded).
### What changes are included in this PR?
1. Added new `arrow.array.StringArray` class that can be constructed from
MATLAB
[`string`](https://www.mathworks.com/help/matlab/ref/string.html?s_tid=doc_ta)
and [`cellstr`](https://www.mathworks.com/help/matlab/ref/cellstr.html) types.
**Note**: We explicitly decided to *not* support
[`char`](https://www.mathworks.com/help/matlab/ref/char.html?s_tid=doc_ta)
arrays for the time being.
2. Factored out code for extracting "raw" `const uint8_t*` from a MATLAB
`logical` Data Array into a new function `bit::unpacked_as_ptr` so that it can
be reused across multiple Array `Proxy` classes. See
https://github.com/apache/arrow/issues/36335.
3. Added new `arrow.type.StringType` type class and associated
`arrow.type.ID.String` enum value.
4. Enabled support for creating `RecordBatch` objects from MATLAB `table`s
containing `string` data.
5. Updated `arrow::matlab::array::proxy::Array::toString` code to convert
from UTF-8 to UTF-16 for display in MATLAB.
**Examples**
*Most MATLAB `string` arrays round-trip*
```matlab
>> matlabArray = ["A"; "B"; "C"]
matlabArray =
3x1 string array
"A"
"B"
"C"
>> arrowArray = arrow.array.StringArray(matlabArray)
arrowArray =
[
"A",
"B",
"C"
]
>> matlabArrayRoundTrip = toMATLAB(arrowArray)
matlabArrayRoundTrip =
3x1 string array
"A"
"B"
"C"
>> isequal(matlabArray, matlabArrayRoundTrip)
ans =
logical
1
```
*MATLAB `string(missing)` Values get mapped to `null` by default*
```matlab
>> matlabArray = ["A"; string(missing); "C"]
matlabArray =
3x1 string array
"A"
<missing>
"C"
>> arrowArray = arrow.array.StringArray(matlabArray)
arrowArray =
[
"A",
null,
"C"
]
>> matlabArrayRoundTrip = toMATLAB(arrowArray)
matlabArrayRoundTrip =
3x1 string array
"A"
<missing>
"C"
>> isequaln(matlabArray, matlabArrayRoundTrip)
ans =
logical
1
```
*Unicode characters round-trip*
```matlab
>> matlabArray = ["😊"; "🌲"; "➞"]
matlabArray =
3×1 string array
"😊"
"🌲"
"âžž"
>> arrowArray = arrow.array.StringArray(matlabArray)
arrowArray =
[
"😊",
"🌲",
"âžž"
]
>> matlabArrayRoundTrip = toMATLAB(arrowArray)
matlabArrayRoundTrip =
3×1 string array
"😊"
"🌲"
"âžž"
```
*Create `StringArray` from `cellstr`*
```matlab
>> matlabArray = {'red'; 'green'; 'blue'}
matlabArray =
3×1 cell array
{'red' }
{'green'}
{'blue' }
>> arrowArray = arrow.array.StringArray(matlabArray)
arrowArray =
[
"red",
"green",
"blue"
]
>> matlabArrayRoundTrip = toMATLAB(arrowArray)
matlabArrayRoundTrip =
3×1 string array
"red"
"green"
"blue"
```
*Create `RecordBatch` from MATLAB `string` data*
```matlab
>> matlabTable = table(["😊"; "🌲"; "➞"])
matlabTable =
3×1 table
Var1
____
"😊"
"🌲"
"âžž"
>> arrowRecordBatch = arrow.tabular.RecordBatch(matlabTable)
arrowRecordBatch =
Var1: [
"😊",
"🌲",
"âžž"
]
>> matlabTableRoundTrip = toMATLAB(arrowRecordBatch)
matlabTableRoundTrip =
3×1 table
Var1
____
"😊"
"🌲"
"âžž"
>> isequaln(matlabTable, matlabTableRoundTrip)
ans =
logical
1
```
### Are these changes tested?
Yes.
1. Added new `tStringArray` test class.
2. Added new `tStringType` test class.
3. Extended `tRecordBatch` test class to verify support for MATLAB `table`s
which contain `string` data (see above).
### Are there any user-facing changes?
Yes.
1. Users can now create `arrow.array.StringArray` objects from MATLAB
`string` arrays and `cellstr`s.
2. Users can now create `arrow.type.StringType` objects.
3. Users can now construct `RecordBatch` objects from MATLAB `table`s that
contain `string` data.
### Future Directions
1. The implementation of this initial version of `StringArray` is
relatively simple in that it does not include a `BinaryArray` class hierarchy.
In the future, we will likely want to refactor `StringArray` to inherit from a
more general abstract `BinaryArray` class hierarchy.
2. Following on from 1., we will ideally want to add support for
`LargeStringArray`, `BinaryArray`, and `LargeBinaryArray`, and
`FixedLengthBinaryArray` by creating common infrastructure for representing
binary types. This initial version of `StringArray` helps to solidify the
user-facing design and provide a shorter term solution to working with `string`
data, since it is quite common.
3. It may make sense to change the `arrow.type.Type` hierarchy (e.g.
`arrow.type.StringType`) in the future to delegate to C++ `Proxy` classes under
the hood. See: #36363.
4. Use `bit::unpacked_as_ptr` in other classes. See
https://github.com/apache/arrow/issues/36335.
5. Look for more ways to optimize the conversion from MATLAB UTF-16 encoded
string data to Arrow UTF-8 encoded string data (e.g. by avoiding unnecessary
data copies).
### Notes
1. Thank you @ sgilmore10 for your help with this pull request!
* Closes: #36250
Lead-authored-by: Kevin Gurney <[email protected]>
Co-authored-by: Kevin Gurney <[email protected]>
Co-authored-by: Sarah Gilmore <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Sarah Gilmore <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
---
matlab/src/cpp/arrow/matlab/array/proxy/array.cc | 10 +-
.../cpp/arrow/matlab/array/proxy/numeric_array.h | 14 +-
.../cpp/arrow/matlab/array/proxy/string_array.cc | 81 ++++++++
.../{bit/unpack.h => array/proxy/string_array.h} | 21 +-
matlab/src/cpp/arrow/matlab/bit/unpack.cc | 9 +
matlab/src/cpp/arrow/matlab/bit/unpack.h | 1 +
matlab/src/cpp/arrow/matlab/error/error.h | 3 +-
matlab/src/cpp/arrow/matlab/proxy/factory.cc | 4 +-
matlab/src/matlab/+arrow/+array/StringArray.m | 54 +++++
matlab/src/matlab/+arrow/+tabular/RecordBatch.m | 2 +
matlab/src/matlab/+arrow/+type/ID.m | 2 +-
matlab/src/matlab/+arrow/+type/StringType.m | 29 +++
matlab/test/arrow/array/tStringArray.m | 231 +++++++++++++++++++++
matlab/test/arrow/tabular/tRecordBatch.m | 1 +
matlab/test/arrow/type/tStringType.m | 41 ++++
matlab/tools/cmake/BuildMatlabArrowInterface.cmake | 3 +-
16 files changed, 482 insertions(+), 24 deletions(-)
diff --git a/matlab/src/cpp/arrow/matlab/array/proxy/array.cc
b/matlab/src/cpp/arrow/matlab/array/proxy/array.cc
index 6f5b8b12f2..35dc496bdd 100644
--- a/matlab/src/cpp/arrow/matlab/array/proxy/array.cc
+++ b/matlab/src/cpp/arrow/matlab/array/proxy/array.cc
@@ -15,9 +15,11 @@
// specific language governing permissions and limitations
// under the License.
-#include "arrow/matlab/array/proxy/array.h"
+#include "arrow/util/utf8.h"
+#include "arrow/matlab/array/proxy/array.h"
#include "arrow/matlab/bit/unpack.h"
+#include "arrow/matlab/error/error.h"
namespace arrow::matlab::array::proxy {
@@ -36,9 +38,9 @@ namespace arrow::matlab::array::proxy {
void Array::toString(libmexclass::proxy::method::Context& context) {
::matlab::data::ArrayFactory factory;
-
- // TODO: handle non-ascii characters
- auto str_mda = factory.createScalar(array->ToString());
+ const auto str_utf8 = array->ToString();
+ MATLAB_ASSIGN_OR_ERROR_WITH_CONTEXT(const auto str_utf16,
arrow::util::UTF8StringToUTF16(str_utf8), context,
error::UNICODE_CONVERSION_ERROR_ID);
+ auto str_mda = factory.createScalar(str_utf16);
context.outputs[0] = str_mda;
}
diff --git a/matlab/src/cpp/arrow/matlab/array/proxy/numeric_array.h
b/matlab/src/cpp/arrow/matlab/array/proxy/numeric_array.h
index 62c6d9dc26..43e7aec622 100644
--- a/matlab/src/cpp/arrow/matlab/array/proxy/numeric_array.h
+++ b/matlab/src/cpp/arrow/matlab/array/proxy/numeric_array.h
@@ -27,22 +27,12 @@
#include "arrow/matlab/array/proxy/array.h"
#include "arrow/matlab/error/error.h"
#include "arrow/matlab/bit/pack.h"
+#include "arrow/matlab/bit/unpack.h"
#include "libmexclass/proxy/Proxy.h"
namespace arrow::matlab::array::proxy {
-namespace {
-const uint8_t* getUnpackedValidityBitmap(const
::matlab::data::TypedArray<bool>& valid_elements) {
- if (valid_elements.getNumberOfElements() > 0) {
- const auto valid_elements_iterator(valid_elements.cbegin());
- return reinterpret_cast<const
uint8_t*>(valid_elements_iterator.operator->());
- } else {
- return nullptr;
- }
-}
-} // anonymous namespace
-
template<typename CType>
class NumericArray : public arrow::matlab::array::proxy::Array {
public:
@@ -70,7 +60,7 @@ class NumericArray : public
arrow::matlab::array::proxy::Array {
if (make_deep_copy) {
// Get the unpacked validity bitmap (if it exists)
- auto unpacked_validity_bitmap =
getUnpackedValidityBitmap(valid_mda);
+ auto unpacked_validity_bitmap = bit::extract_ptr(valid_mda);
BuilderType builder;
diff --git a/matlab/src/cpp/arrow/matlab/array/proxy/string_array.cc
b/matlab/src/cpp/arrow/matlab/array/proxy/string_array.cc
new file mode 100644
index 0000000000..51f39d72fc
--- /dev/null
+++ b/matlab/src/cpp/arrow/matlab/array/proxy/string_array.cc
@@ -0,0 +1,81 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/matlab/array/proxy/string_array.h"
+
+#include "arrow/array/builder_binary.h"
+
+#include "arrow/matlab/error/error.h"
+#include "arrow/matlab/bit/pack.h"
+#include "arrow/matlab/bit/unpack.h"
+#include "arrow/util/utf8.h"
+
+namespace arrow::matlab::array::proxy {
+
+ libmexclass::proxy::MakeResult StringArray::make(const
libmexclass::proxy::FunctionArguments& constructor_arguments) {
+ namespace mda = ::matlab::data;
+
+ mda::StructArray opts = constructor_arguments[0];
+ const mda::StringArray array_mda = opts[0]["MatlabArray"];
+ const mda::TypedArray<bool> unpacked_validity_bitmap_mda =
opts[0]["Valid"];
+
+ // Convert UTF-16 encoded MATLAB string values to UTF-8 encoded
Arrow string values.
+ const auto array_length = array_mda.getNumberOfElements();
+ std::vector<std::string> strings;
+ strings.reserve(array_length);
+ for (const auto& str : array_mda) {
+ if (!str) {
+ // Substitute MATLAB string(missing) values with the empty
string value ("")
+ strings.emplace_back("");
+ } else {
+ MATLAB_ASSIGN_OR_ERROR(auto str_utf8,
arrow::util::UTF16StringToUTF8(*str), error::UNICODE_CONVERSION_ERROR_ID);
+ strings.push_back(std::move(str_utf8));
+ }
+ }
+
+ auto unpacked_validity_bitmap_ptr =
bit::extract_ptr(unpacked_validity_bitmap_mda);
+
+ // Build up an Arrow StringArray from a vector of UTF-8 encoded
strings.
+ arrow::StringBuilder builder;
+ MATLAB_ERROR_IF_NOT_OK(builder.AppendValues(strings,
unpacked_validity_bitmap_ptr), error::STRING_BUILDER_APPEND_FAILED);
+ MATLAB_ASSIGN_OR_ERROR(auto array, builder.Finish(),
error::STRING_BUILDER_FINISH_FAILED);
+
+ return
std::make_shared<arrow::matlab::array::proxy::StringArray>(array);
+ }
+
+ void StringArray::toMATLAB(libmexclass::proxy::method::Context&
context) {
+ namespace mda = ::matlab::data;
+
+ // Convert UTF-8 encoded Arrow string values to UTF-16 encoded
MATLAB string values.
+ auto array_length = static_cast<size_t>(array->length());
+ std::vector<mda::MATLABString> strings;
+ strings.reserve(array_length);
+ for (size_t i = 0; i < array_length; ++i) {
+ auto string_array =
std::static_pointer_cast<arrow::StringArray>(array);
+ auto str_utf8 = string_array->GetView(i);
+ MATLAB_ASSIGN_OR_ERROR_WITH_CONTEXT(auto str_utf16,
arrow::util::UTF8StringToUTF16(str_utf8), context,
error::UNICODE_CONVERSION_ERROR_ID);
+ const mda::MATLABString matlab_string =
mda::MATLABString(std::move(str_utf16));
+ strings.push_back(matlab_string);
+ }
+
+ // Create a MATLAB String array from a vector of UTF-16 encoded
strings.
+ mda::ArrayFactory factory;
+ auto array_mda = factory.createArray({array_length, 1},
strings.begin(), strings.end());
+ context.outputs[0] = array_mda;
+ }
+
+}
diff --git a/matlab/src/cpp/arrow/matlab/bit/unpack.h
b/matlab/src/cpp/arrow/matlab/array/proxy/string_array.h
similarity index 56%
copy from matlab/src/cpp/arrow/matlab/bit/unpack.h
copy to matlab/src/cpp/arrow/matlab/array/proxy/string_array.h
index 2d7294d9d5..de0c462592 100644
--- a/matlab/src/cpp/arrow/matlab/bit/unpack.h
+++ b/matlab/src/cpp/arrow/matlab/array/proxy/string_array.h
@@ -17,10 +17,23 @@
#pragma once
-#include "arrow/buffer.h"
+#include "arrow/matlab/array/proxy/array.h"
-#include "MatlabDataArray.hpp"
+#include "libmexclass/proxy/Proxy.h"
+
+namespace arrow::matlab::array::proxy {
+
+ class StringArray : public arrow::matlab::array::proxy::Array {
+ public:
+ StringArray(const std::shared_ptr<arrow::Array> string_array)
+ : arrow::matlab::array::proxy::Array() {
+ array = string_array;
+ }
+
+ static libmexclass::proxy::MakeResult make(const
libmexclass::proxy::FunctionArguments& constructor_arguments);
+
+ protected:
+ void toMATLAB(libmexclass::proxy::method::Context& context)
override;
+ };
-namespace arrow::matlab::bit {
- ::matlab::data::TypedArray<bool> unpack(const
std::shared_ptr<arrow::Buffer>& packed_buffer, int64_t length);
}
diff --git a/matlab/src/cpp/arrow/matlab/bit/unpack.cc
b/matlab/src/cpp/arrow/matlab/bit/unpack.cc
index f6c1644909..7135d593cf 100644
--- a/matlab/src/cpp/arrow/matlab/bit/unpack.cc
+++ b/matlab/src/cpp/arrow/matlab/bit/unpack.cc
@@ -38,4 +38,13 @@ namespace arrow::matlab::bit {
return unpacked_matlab_logical_Array;
}
+
+ const uint8_t* extract_ptr(const ::matlab::data::TypedArray<bool>&
unpacked_validity_bitmap) {
+ if (unpacked_validity_bitmap.getNumberOfElements() > 0) {
+ const auto
unpacked_validity_bitmap_iterator(unpacked_validity_bitmap.cbegin());
+ return reinterpret_cast<const
uint8_t*>(unpacked_validity_bitmap_iterator.operator->());
+ } else {
+ return nullptr;
+ }
+ }
}
diff --git a/matlab/src/cpp/arrow/matlab/bit/unpack.h
b/matlab/src/cpp/arrow/matlab/bit/unpack.h
index 2d7294d9d5..b6debb85f8 100644
--- a/matlab/src/cpp/arrow/matlab/bit/unpack.h
+++ b/matlab/src/cpp/arrow/matlab/bit/unpack.h
@@ -23,4 +23,5 @@
namespace arrow::matlab::bit {
::matlab::data::TypedArray<bool> unpack(const
std::shared_ptr<arrow::Buffer>& packed_buffer, int64_t length);
+ const uint8_t* extract_ptr(const ::matlab::data::TypedArray<bool>&
unpacked_validity_bitmap);
}
diff --git a/matlab/src/cpp/arrow/matlab/error/error.h
b/matlab/src/cpp/arrow/matlab/error/error.h
index 598db363f3..b1b7b75b8c 100644
--- a/matlab/src/cpp/arrow/matlab/error/error.h
+++ b/matlab/src/cpp/arrow/matlab/error/error.h
@@ -168,6 +168,7 @@ namespace arrow::matlab::error {
static const char* SCHEMA_BUILDER_FINISH_ERROR_ID =
"arrow:matlab:tabular:proxy:SchemaBuilderAddFields";
static const char* SCHEMA_BUILDER_ADD_FIELDS_ERROR_ID =
"arrow:matlab:tabular:proxy:SchemaBuilderFinish";
static const char* UNICODE_CONVERSION_ERROR_ID =
"arrow:matlab:unicode:UnicodeConversion";
+ static const char* STRING_BUILDER_APPEND_FAILED =
"arrow:matlab:array:string:StringBuilderAppendFailed";
+ static const char* STRING_BUILDER_FINISH_FAILED =
"arrow:matlab:array:string:StringBuilderFinishFailed";
static const char* UKNOWN_TIME_UNIT_ERROR_ID =
"arrow:matlab:UnknownTimeUnit";
-
}
diff --git a/matlab/src/cpp/arrow/matlab/proxy/factory.cc
b/matlab/src/cpp/arrow/matlab/proxy/factory.cc
index 94ee1ca892..41f1357bce 100644
--- a/matlab/src/cpp/arrow/matlab/proxy/factory.cc
+++ b/matlab/src/cpp/arrow/matlab/proxy/factory.cc
@@ -17,8 +17,9 @@
#include "arrow/matlab/array/proxy/boolean_array.h"
#include "arrow/matlab/array/proxy/numeric_array.h"
-#include "arrow/matlab/tabular/proxy/record_batch.h"
+#include "arrow/matlab/array/proxy/string_array.h"
#include "arrow/matlab/array/proxy/timestamp_array.h"
+#include "arrow/matlab/tabular/proxy/record_batch.h"
#include "arrow/matlab/error/error.h"
#include "factory.h"
@@ -37,6 +38,7 @@ libmexclass::proxy::MakeResult Factory::make_proxy(const
ClassName& class_name,
REGISTER_PROXY(arrow.array.proxy.Int32Array ,
arrow::matlab::array::proxy::NumericArray<int32_t>);
REGISTER_PROXY(arrow.array.proxy.Int64Array ,
arrow::matlab::array::proxy::NumericArray<int64_t>);
REGISTER_PROXY(arrow.array.proxy.BooleanArray ,
arrow::matlab::array::proxy::BooleanArray);
+ REGISTER_PROXY(arrow.array.proxy.StringArray ,
arrow::matlab::array::proxy::StringArray);
REGISTER_PROXY(arrow.array.proxy.TimestampArray,
arrow::matlab::array::proxy::TimestampArray);
REGISTER_PROXY(arrow.tabular.proxy.RecordBatch ,
arrow::matlab::tabular::proxy::RecordBatch);
return libmexclass::error::Error{error::UNKNOWN_PROXY_ERROR_ID, "Did not
find matching C++ proxy for " + class_name};
diff --git a/matlab/src/matlab/+arrow/+array/StringArray.m
b/matlab/src/matlab/+arrow/+array/StringArray.m
new file mode 100644
index 0000000000..9ef3f02525
--- /dev/null
+++ b/matlab/src/matlab/+arrow/+array/StringArray.m
@@ -0,0 +1,54 @@
+% Licensed to the Apache Software Foundation (ASF) under one or more
+% contributor license agreements. See the NOTICE file distributed with
+% this work for additional information regarding copyright ownership.
+% The ASF licenses this file to you under the Apache License, Version
+% 2.0 (the "License"); you may not use this file except in compliance
+% with the License. You may obtain a copy of the License at
+%
+% http://www.apache.org/licenses/LICENSE-2.0
+%
+% Unless required by applicable law or agreed to in writing, software
+% distributed under the License is distributed on an "AS IS" BASIS,
+% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+% implied. See the License for the specific language governing
+% permissions and limitations under the License.
+
+classdef StringArray < arrow.array.Array
+% arrow.array.StringArray
+
+ properties (Hidden, SetAccess=private)
+ NullSubstitionValue = string(missing);
+ end
+
+ properties(SetAccess=private, GetAccess=public)
+ Type = arrow.type.StringType
+ end
+
+ methods
+ function obj = StringArray(data, opts)
+ arguments
+ data
+ opts.InferNulls(1,1) logical = true
+ opts.Valid
+ end
+ % Support constructing a StringArray from a cell array of strings
(i.e. cellstr),
+ % or a string array, but not a char array.
+ if ~ischar(data)
+ data = convertCharsToStrings(data);
+ end
+ arrow.args.validateTypeAndShape(data, "string");
+ validElements = arrow.args.parseValidElements(data, opts);
+ opts = struct(MatlabArray=data, Valid=validElements);
+ [email protected]("Name", "arrow.array.proxy.StringArray",
"ConstructorArguments", {opts});
+ end
+
+ function data = string(obj)
+ data = obj.toMATLAB();
+ end
+
+ function matlabArray = toMATLAB(obj)
+ matlabArray = obj.Proxy.toMATLAB();
+ matlabArray(~obj.Valid) = obj.NullSubstitionValue;
+ end
+ end
+end
diff --git a/matlab/src/matlab/+arrow/+tabular/RecordBatch.m
b/matlab/src/matlab/+arrow/+tabular/RecordBatch.m
index 5e5ab1d1d7..a7feb0c0a3 100644
--- a/matlab/src/matlab/+arrow/+tabular/RecordBatch.m
+++ b/matlab/src/matlab/+arrow/+tabular/RecordBatch.m
@@ -121,6 +121,8 @@ classdef RecordBatch < matlab.mixin.CustomDisplay & ...
arrowArray = arrow.array.Int64Array(matlabArray);
case "logical"
arrowArray = arrow.array.BooleanArray(matlabArray);
+ case "string"
+ arrowArray = arrow.array.StringArray(matlabArray);
case "datetime"
arrowArray = arrow.array.TimestampArray(matlabArray);
otherwise
diff --git a/matlab/src/matlab/+arrow/+type/ID.m
b/matlab/src/matlab/+arrow/+type/ID.m
index 0450fe8aea..2e320603d0 100644
--- a/matlab/src/matlab/+arrow/+type/ID.m
+++ b/matlab/src/matlab/+arrow/+type/ID.m
@@ -28,7 +28,7 @@ classdef ID < uint64
% Float16 (10) not yet supported
Float32 (11)
Float64 (12)
- % String (13)
+ String (13)
% Binary (14)
% FixedSizeBinary (15)
% Date32 (16)
diff --git a/matlab/src/matlab/+arrow/+type/StringType.m
b/matlab/src/matlab/+arrow/+type/StringType.m
new file mode 100644
index 0000000000..66a15dd0ea
--- /dev/null
+++ b/matlab/src/matlab/+arrow/+type/StringType.m
@@ -0,0 +1,29 @@
+% Licensed to the Apache Software Foundation (ASF) under one or more
+% contributor license agreements. See the NOTICE file distributed with
+% this work for additional information regarding copyright ownership.
+% The ASF licenses this file to you under the Apache License, Version
+% 2.0 (the "License"); you may not use this file except in compliance
+% with the License. You may obtain a copy of the License at
+%
+% http://www.apache.org/licenses/LICENSE-2.0
+%
+% Unless required by applicable law or agreed to in writing, software
+% distributed under the License is distributed on an "AS IS" BASIS,
+% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+% implied. See the License for the specific language governing
+% permissions and limitations under the License.
+
+classdef StringType < arrow.type.Type
+%STRINGTYPE Type class for string data.
+
+ properties(SetAccess = protected)
+ ID = arrow.type.ID.String
+ end
+
+ properties(Constant)
+ NumFields = 0
+ NumBuffers = 3
+ end
+
+end
+
diff --git a/matlab/test/arrow/array/tStringArray.m
b/matlab/test/arrow/array/tStringArray.m
new file mode 100644
index 0000000000..000a57b27b
--- /dev/null
+++ b/matlab/test/arrow/array/tStringArray.m
@@ -0,0 +1,231 @@
+% Licensed to the Apache Software Foundation (ASF) under one or more
+% contributor license agreements. See the NOTICE file distributed with
+% this work for additional information regarding copyright ownership.
+% The ASF licenses this file to you under the Apache License, Version
+% 2.0 (the "License"); you may not use this file except in compliance
+% with the License. You may obtain a copy of the License at
+%
+% http://www.apache.org/licenses/LICENSE-2.0
+%
+% Unless required by applicable law or agreed to in writing, software
+% distributed under the License is distributed on an "AS IS" BASIS,
+% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+% implied. See the License for the specific language governing
+% permissions and limitations under the License.
+
+classdef tStringArray < matlab.unittest.TestCase
+% Test class for arrow.array.StringArray
+
+ properties
+ ArrowArrayClassName(1, 1) string = "arrow.array.StringArray"
+ ArrowArrayConstructor = @arrow.array.StringArray
+ MatlabArrayFcn = @string
+ MatlabConversionFcn = @string
+ NullSubstitutionValue = string(missing)
+ ArrowType = arrow.type.StringType
+ end
+
+ methods(TestClassSetup)
+ function verifyOnMatlabPath(tc)
+ % Verify the arrow array class is on the MATLAB Search Path.
+ tc.assertTrue(~isempty(which(tc.ArrowArrayClassName)), ...
+ """" + tc.ArrowArrayClassName + """must be on the MATLAB path.
" + ...
+ "Use ""addpath"" to add folders to the MATLAB path.");
+ end
+ end
+
+ methods(Test)
+ function BasicTest(tc)
+ A = tc.ArrowArrayConstructor(tc.MatlabArrayFcn(["A", "B", "C"]));
+ className = string(class(A));
+ tc.verifyEqual(className, tc.ArrowArrayClassName);
+ end
+
+ function ToMATLAB(tc)
+ % Create array from a scalar
+ A1 = tc.ArrowArrayConstructor(tc.MatlabArrayFcn("A"));
+ data = toMATLAB(A1);
+ tc.verifyEqual(data, tc.MatlabArrayFcn("A"));
+
+ % Create array from a vector
+ A2 = tc.ArrowArrayConstructor(tc.MatlabArrayFcn(["A", "B", "C"]));
+ data = toMATLAB(A2);
+ tc.verifyEqual(data, tc.MatlabArrayFcn(["A", "B", "C"]'));
+
+ % Create a StringArray from an empty 0x0 string vector
+ A3 = tc.ArrowArrayConstructor(tc.MatlabArrayFcn(string.empty(0,
0)));
+ data = toMATLAB(A3);
+ tc.verifyEqual(data, tc.MatlabArrayFcn(reshape([], 0, 1)));
+
+ % Create a StringArray from an empty 0x1 string vector
+ A4= tc.ArrowArrayConstructor(tc.MatlabArrayFcn(string.empty(0,
1)));
+ data = toMATLAB(A4);
+ tc.verifyEqual(data, tc.MatlabArrayFcn(reshape([], 0, 1)));
+
+ % Create a StringArray from an empty 1x0 string vector
+ A5= tc.ArrowArrayConstructor(tc.MatlabArrayFcn(string.empty(0,
1)));
+ data = toMATLAB(A5);
+ tc.verifyEqual(data, tc.MatlabArrayFcn(reshape([], 0, 1)));
+ end
+
+ function MatlabConversion(tc)
+ % Tests the type-specific conversion method (i.e. string)
+
+ % Create array from a scalar
+ A1 = tc.ArrowArrayConstructor(tc.MatlabArrayFcn("A"));
+ data = tc.MatlabConversionFcn(A1);
+ tc.verifyEqual(data, tc.MatlabArrayFcn("A"));
+
+ % Create array from a vector
+ A2 = tc.ArrowArrayConstructor(tc.MatlabArrayFcn(["A", "B", "C"]));
+ data = tc.MatlabConversionFcn(A2);
+ tc.verifyEqual(data, tc.MatlabArrayFcn(["A", "B", "C"]'));
+
+ % Create a StringArray from an empty 0x0 string vector
+ A3 = tc.ArrowArrayConstructor(tc.MatlabArrayFcn(string.empty(0,
0)));
+ data = tc.MatlabConversionFcn(A3);
+ tc.verifyEqual(data, tc.MatlabArrayFcn(reshape([], 0, 1)));
+
+ % Create a StringArray from an empty 0x1 string vector
+ A4= tc.ArrowArrayConstructor(tc.MatlabArrayFcn(string.empty(0,
1)));
+ data = tc.MatlabConversionFcn(A4);
+ tc.verifyEqual(data, tc.MatlabArrayFcn(reshape([], 0, 1)));
+
+ % Create a StringArray from an empty 1x0 string vector
+ A5= tc.ArrowArrayConstructor(tc.MatlabArrayFcn(string.empty(0,
1)));
+ data = tc.MatlabConversionFcn(A5);
+ tc.verifyEqual(data, tc.MatlabArrayFcn(reshape([], 0, 1)));
+ end
+
+ function LogicalValidNVPair(tc)
+ % Verify the expected elements are treated as null when Valid
+ % is provided as a logical array
+ data = tc.MatlabArrayFcn(["A", "B", "C"]');
+ arrowArray = tc.ArrowArrayConstructor(data, Valid=[false true
true]);
+
+ expectedData = data;
+ expectedData(1) = tc.NullSubstitutionValue;
+ tc.verifyEqual(tc.MatlabConversionFcn(arrowArray), expectedData);
+ tc.verifyEqual(toMATLAB(arrowArray), expectedData);
+ tc.verifyEqual(arrowArray.Valid, [false; true; true]);
+ end
+
+ function NumericValidNVPair(tc)
+ % Verify the expected elements are treated as null when Valid
+ % is provided as a array of indices
+ data = tc.MatlabArrayFcn(["A", "B", "C"]');
+ arrowArray = tc.ArrowArrayConstructor(data, Valid=[1, 2]);
+
+ expectedData = data;
+ expectedData(3) = tc.NullSubstitutionValue;
+ tc.verifyEqual(tc.MatlabConversionFcn(arrowArray), expectedData);
+ tc.verifyEqual(toMATLAB(arrowArray), expectedData);
+ tc.verifyEqual(arrowArray.Valid, [true; true; false]);
+
+
+ % Make sure the optimization where the valid-bitmap is stored as
+ % a nullptr works as expected.
+ expectedData = data;
+ arrowArray = tc.ArrowArrayConstructor(data, Valid=[1, 2, 3]);
+ tc.verifyEqual(tc.MatlabConversionFcn(arrowArray), expectedData);
+ tc.verifyEqual(toMATLAB(arrowArray), expectedData);
+ tc.verifyEqual(arrowArray.Valid, [true; true; true]);
+ end
+
+ function ErrorIfNonVector(tc)
+ data = tc.MatlabArrayFcn(["A", "B", "A", "B", "A", "B", "A", "B",
"A"]);
+ data = reshape(data, 3, 1, 3);
+ fcn = @() tc.ArrowArrayConstructor(tc.MatlabArrayFcn(data));
+ tc.verifyError(fcn, "MATLAB:expectedVector");
+ end
+
+ function ErrorIfEmptyArrayIsNotTwoDimensional(tc)
+ data = tc.MatlabArrayFcn(reshape(string.empty(0, 0), [1 0 0]));
+ fcn = @() tc.ArrowArrayConstructor(data);
+ tc.verifyError(fcn, "MATLAB:expected2D");
+ end
+
+ function TestArrowType(tc)
+ % Verify the array has the expected arrow.type.Type object
+ data = tc.MatlabArrayFcn(["A", "B"]);
+ arrowArray = tc.ArrowArrayConstructor(data);
+ tc.verifyEqual(arrowArray.Type, tc.ArrowType);
+ end
+
+ function Unicode(tc)
+ % Verify that Unicode characters are preserved during round-trip
+ % conversion.
+ smiley = "😀";
+ tree = "🌲";
+ mango = "ðŸ¥";
+
+ matlabArray = tc.MatlabArrayFcn([smiley; tree; mango]);
+ arrowArray = tc.ArrowArrayConstructor(matlabArray);
+ matlabArrayConverted = toMATLAB(arrowArray);
+ tc.verifyEqual(matlabArrayConverted, matlabArray);
+ end
+
+ function Missing(tc)
+ % Verify that string(missing) values get mapped to the empty
+ % string value when InferNulls=false.
+ matlabArray = tc.MatlabArrayFcn(["A"; string(missing);
string(missing)]);
+ arrowArray = tc.ArrowArrayConstructor(matlabArray,
InferNulls=false);
+ matlabArrayConverted = toMATLAB(arrowArray);
+ tc.verifyEqual(matlabArrayConverted, ["A"; ""; ""]);
+ end
+
+ function CellStr(tc)
+ % Verify that a StringArray can be constructed from
+ % a cell array of character vectors (i.e. cellstr).
+
+ % Row vector
+ matlabArray = {'A', 'B', 'C'};
+ arrowArray = tc.ArrowArrayConstructor(matlabArray);
+ matlabArrayConverted = toMATLAB(arrowArray);
+ tc.verifyEqual(matlabArrayConverted, string(matlabArray'));
+
+ % Column vector
+ matlabArray = {'A'; 'B'; 'C'};
+ arrowArray = tc.ArrowArrayConstructor(matlabArray);
+ matlabArrayConverted = toMATLAB(arrowArray);
+ tc.verifyEqual(matlabArrayConverted, string(matlabArray));
+
+ % One element cellstr
+ matlabArray = {''};
+ arrowArray = tc.ArrowArrayConstructor(matlabArray);
+ matlabArrayConverted = toMATLAB(arrowArray);
+ tc.verifyEqual(matlabArrayConverted, string(matlabArray));
+
+ % Empty cell
+ matlabArray = {};
+ arrowArray = tc.ArrowArrayConstructor(matlabArray);
+ matlabArrayConverted = toMATLAB(arrowArray);
+ tc.verifyEqual(matlabArrayConverted, string.empty(0, 1));
+ end
+
+ function ErrorIfChar(tc)
+ % Verify that an error is thrown when a char array
+ % is passed to the StringArray constructor.
+
+ % Row vector
+ matlabArray = 'abc';
+ tc.verifyError(@() tc.ArrowArrayConstructor(matlabArray),
"MATLAB:invalidType");
+
+ % Column vector
+ matlabArray = ['a';'b';'c'];
+ tc.verifyError(@() tc.ArrowArrayConstructor(matlabArray),
"MATLAB:invalidType");
+
+ % Empty char (0x0)
+ matlabArray = '';
+ tc.verifyError(@() tc.ArrowArrayConstructor(matlabArray),
"MATLAB:invalidType");
+
+ % Empty char (0x1)
+ matlabArray = char.empty(0, 1);
+ tc.verifyError(@() tc.ArrowArrayConstructor(matlabArray),
"MATLAB:invalidType");
+
+ % Empty char (1x0)
+ matlabArray = char.empty(1, 0);
+ tc.verifyError(@() tc.ArrowArrayConstructor(matlabArray),
"MATLAB:invalidType");
+ end
+ end
+end
diff --git a/matlab/test/arrow/tabular/tRecordBatch.m
b/matlab/test/arrow/tabular/tRecordBatch.m
index d0b1df9621..89175c43da 100644
--- a/matlab/test/arrow/tabular/tRecordBatch.m
+++ b/matlab/test/arrow/tabular/tRecordBatch.m
@@ -38,6 +38,7 @@ classdef tRecordBatch < matlab.unittest.TestCase
logical([1, 0, 1]'), ...
single ([1, 2, 3]'), ...
double ([1, 2, 3]'), ...
+ string (["A", "B", "C"]'), ...
datetime(2023, 6, 28) + days(0:2)');
arrowRecordBatch = arrow.tabular.RecordBatch(TOriginal);
TConverted = arrowRecordBatch.toMATLAB();
diff --git a/matlab/test/arrow/type/tStringType.m
b/matlab/test/arrow/type/tStringType.m
new file mode 100644
index 0000000000..f3cf101ac6
--- /dev/null
+++ b/matlab/test/arrow/type/tStringType.m
@@ -0,0 +1,41 @@
+% Licensed to the Apache Software Foundation (ASF) under one or more
+% contributor license agreements. See the NOTICE file distributed with
+% this work for additional information regarding copyright ownership.
+% The ASF licenses this file to you under the Apache License, Version
+% 2.0 (the "License"); you may not use this file except in compliance
+% with the License. You may obtain a copy of the License at
+%
+% http://www.apache.org/licenses/LICENSE-2.0
+%
+% Unless required by applicable law or agreed to in writing, software
+% distributed under the License is distributed on an "AS IS" BASIS,
+% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+% implied. See the License for the specific language governing
+% permissions and limitations under the License.
+
+classdef tStringType < matlab.unittest.TestCase
+%TSTRINGTYPE Test class for arrow.type.StringType
+
+ methods (Test)
+
+ function Basic(tc)
+ type = arrow.type.StringType;
+ className = string(class(type));
+ tc.verifyEqual(className, "arrow.type.StringType");
+ tc.verifyEqual(type.ID, arrow.type.ID.String);
+ end
+
+ function NumBuffers(tc)
+ type = arrow.type.StringType;
+ tc.verifyEqual(type.NumBuffers, 3);
+ end
+
+ function NumFields(tc)
+ type = arrow.type.StringType;
+ tc.verifyEqual(type.NumFields, 0);
+ end
+
+ end
+
+end
+
diff --git a/matlab/tools/cmake/BuildMatlabArrowInterface.cmake
b/matlab/tools/cmake/BuildMatlabArrowInterface.cmake
index f56321ea73..27a64a19a9 100644
--- a/matlab/tools/cmake/BuildMatlabArrowInterface.cmake
+++ b/matlab/tools/cmake/BuildMatlabArrowInterface.cmake
@@ -42,8 +42,9 @@ set(MATLAB_ARROW_LIBMEXCLASS_CLIENT_PROXY_INCLUDE_DIR
"${CMAKE_SOURCE_DIR}/src/c
set(MATLAB_ARROW_LIBMEXCLASS_CLIENT_PROXY_SOURCES
"${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/array/proxy/array.cc"
"${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/array/proxy/boolean_array.cc"
-
"${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/tabular/proxy/record_batch.cc"
+
"${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/array/proxy/string_array.cc"
"${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/array/proxy/timestamp_array.cc"
+
"${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/tabular/proxy/record_batch.cc"
"${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/bit/pack.cc"
"${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/bit/unpack.cc"
"${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/type/time_unit.cc")