Repository: arrow Updated Branches: refs/heads/master fe45c2bb3 -> de2edc8d5
ARROW-1156: [C++/Python] Expand casting API, add UnaryKernel callable. Use Cast in appropriate places when converting from pandas cc @cloud-fan With this patch we now try to cast to indicated type on ingest of objects from pandas: ``` In [3]: arr = np.array([None] * 5) In [4]: pa.Array.from_pandas(arr) Out[4]: <pyarrow.lib.NullArray object at 0x7f6cf1485d18> [ NA, NA, NA, NA, NA ] In [5]: pa.Array.from_pandas(arr, type=pa.int32()) Out[5]: <pyarrow.lib.Int32Array object at 0x7f6cf1485d68> [ NA, NA, NA, NA, NA ] ``` I also added zero-copy casts from integers of the right size to each of the date and time types. Includes refactoring for ARROW-1481. Author: Wes McKinney <[email protected]> Closes #1063 from wesm/ARROW-1156 and squashes the following commits: 166d1a50 [Wes McKinney] iwyu 34f5c9d1 [Wes McKinney] Harden default cast options, fix unsafe Python case 1d07b756 [Wes McKinney] Add some basic casting unit tests in Python c1b45709 [Wes McKinney] Expose arrow::compute::Cast in Python as Array.cast. Still need to write tests a9a04c9c [Wes McKinney] UnaryKernel::Call returns Status for now for simplicity. Support pre-allocated memory 8903709b [Wes McKinney] Implement casts from null to numbers. Try to cast for types where we do not have an inference rule when converting from arrays of Python objects a22dd20a [Wes McKinney] Add test to assert zero copy for compatible integer to date/time a14b83f7 [Wes McKinney] Create callable CastKernel object. Add zero-copy cast rules for date/time types Project: http://git-wip-us.apache.org/repos/asf/arrow/repo Commit: http://git-wip-us.apache.org/repos/asf/arrow/commit/de2edc8d Tree: http://git-wip-us.apache.org/repos/asf/arrow/tree/de2edc8d Diff: http://git-wip-us.apache.org/repos/asf/arrow/diff/de2edc8d Branch: refs/heads/master Commit: de2edc8d591ac8e889495392f8ed20d1d7814dea Parents: fe45c2b Author: Wes McKinney <[email protected]> Authored: Fri Sep 8 10:09:38 2017 -0400 Committer: Wes McKinney <[email protected]> Committed: Fri Sep 8 10:09:38 2017 -0400 ---------------------------------------------------------------------- cpp/README.md | 3 + cpp/build-support/iwyu/mappings/arrow-misc.imp | 5 +- cpp/src/arrow/array.h | 28 +- cpp/src/arrow/compute/CMakeLists.txt | 2 + cpp/src/arrow/compute/api.h | 25 ++ cpp/src/arrow/compute/cast.cc | 358 +++++++++++++------- cpp/src/arrow/compute/cast.h | 14 +- cpp/src/arrow/compute/compute-test.cc | 96 +++++- cpp/src/arrow/compute/context.h | 3 +- cpp/src/arrow/compute/kernel.h | 49 +++ cpp/src/arrow/python/numpy_convert.h | 3 - cpp/src/arrow/python/pandas_to_arrow.cc | 172 ++++++---- python/pyarrow/array.pxi | 41 +++ python/pyarrow/includes/libarrow.pxd | 18 +- python/pyarrow/tests/test_array.py | 67 ++++ python/pyarrow/tests/test_convert_pandas.py | 20 ++ 16 files changed, 687 insertions(+), 217 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/arrow/blob/de2edc8d/cpp/README.md ---------------------------------------------------------------------- diff --git a/cpp/README.md b/cpp/README.md index 993efb8..4a51507 100644 --- a/cpp/README.md +++ b/cpp/README.md @@ -126,6 +126,9 @@ interfaces in this library as needed. The CUDA toolchain used to build the library can be customized by using the `$CUDA_HOME` environment variable. +This library is still in Alpha stages, and subject to API changes without +deprecation warnings. + ### API documentation To generate the (html) API documentation, run the following command in the apidoc http://git-wip-us.apache.org/repos/asf/arrow/blob/de2edc8d/cpp/build-support/iwyu/mappings/arrow-misc.imp ---------------------------------------------------------------------- diff --git a/cpp/build-support/iwyu/mappings/arrow-misc.imp b/cpp/build-support/iwyu/mappings/arrow-misc.imp index 7d9f09c..f39650d 100644 --- a/cpp/build-support/iwyu/mappings/arrow-misc.imp +++ b/cpp/build-support/iwyu/mappings/arrow-misc.imp @@ -24,9 +24,10 @@ { symbol: ["uint8_t", private, "<cstdint>", public ] }, { symbol: ["_Node_const_iterator", private, "<flatbuffers/flatbuffers.h>", public ] }, { symbol: ["unordered_map<>::mapped_type", private, "<flatbuffers/flatbuffers.h>", public ] }, - { symbol: ["__alloc_traits<>::value_type", private, "<vector>", public ] }, { symbol: ["move", private, "<utility>", public ] }, { symbol: ["pair", private, "<utility>", public ] }, { symbol: ["errno", private, "<cerrno>", public ] }, - { symbol: ["posix_memalign", private, "<cstdlib>", public ] } + { symbol: ["posix_memalign", private, "<cstdlib>", public ] }, + { include: ["<ext/alloc_traits.h>", private, "<vector>", public ] }, + { include: ["<string.h>", private, "<cstring>", public ] } ] http://git-wip-us.apache.org/repos/asf/arrow/blob/de2edc8d/cpp/src/arrow/array.h ---------------------------------------------------------------------- diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index 61ab2ef..3faff71 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -88,47 +88,47 @@ struct ARROW_EXPORT ArrayData { ArrayData() {} ArrayData(const std::shared_ptr<DataType>& type, int64_t length, + int64_t null_count = kUnknownNullCount, int64_t offset = 0) + : type(type), length(length), null_count(null_count), offset(offset) {} + + ArrayData(const std::shared_ptr<DataType>& type, int64_t length, const std::vector<std::shared_ptr<Buffer>>& buffers, int64_t null_count = kUnknownNullCount, int64_t offset = 0) - : type(type), - length(length), - buffers(buffers), - null_count(null_count), - offset(offset) {} + : ArrayData(type, length, null_count, offset) { + this->buffers = buffers; + } ArrayData(const std::shared_ptr<DataType>& type, int64_t length, std::vector<std::shared_ptr<Buffer>>&& buffers, int64_t null_count = kUnknownNullCount, int64_t offset = 0) - : type(type), - length(length), - buffers(std::move(buffers)), - null_count(null_count), - offset(offset) {} + : ArrayData(type, length, null_count, offset) { + this->buffers = std::move(buffers); + } // Move constructor ArrayData(ArrayData&& other) noexcept : type(std::move(other.type)), length(other.length), - buffers(std::move(other.buffers)), null_count(other.null_count), offset(other.offset), + buffers(std::move(other.buffers)), child_data(std::move(other.child_data)) {} ArrayData(const ArrayData& other) noexcept : type(other.type), length(other.length), - buffers(other.buffers), null_count(other.null_count), offset(other.offset), + buffers(other.buffers), child_data(other.child_data) {} // Move assignment ArrayData& operator=(ArrayData&& other) { type = std::move(other.type); length = other.length; - buffers = std::move(other.buffers); null_count = other.null_count; offset = other.offset; + buffers = std::move(other.buffers); child_data = std::move(other.child_data); return *this; } @@ -139,9 +139,9 @@ struct ARROW_EXPORT ArrayData { std::shared_ptr<DataType> type; int64_t length; - std::vector<std::shared_ptr<Buffer>> buffers; int64_t null_count; int64_t offset; + std::vector<std::shared_ptr<Buffer>> buffers; std::vector<std::shared_ptr<ArrayData>> child_data; }; http://git-wip-us.apache.org/repos/asf/arrow/blob/de2edc8d/cpp/src/arrow/compute/CMakeLists.txt ---------------------------------------------------------------------- diff --git a/cpp/src/arrow/compute/CMakeLists.txt b/cpp/src/arrow/compute/CMakeLists.txt index a154c47..fa475ca 100644 --- a/cpp/src/arrow/compute/CMakeLists.txt +++ b/cpp/src/arrow/compute/CMakeLists.txt @@ -17,8 +17,10 @@ # Headers: top level install(FILES + api.h cast.h context.h + kernel.h DESTINATION "${CMAKE_INSTALL_INCLUDEDIR}/arrow/compute") ####################################### http://git-wip-us.apache.org/repos/asf/arrow/blob/de2edc8d/cpp/src/arrow/compute/api.h ---------------------------------------------------------------------- diff --git a/cpp/src/arrow/compute/api.h b/cpp/src/arrow/compute/api.h new file mode 100644 index 0000000..da7df1c --- /dev/null +++ b/cpp/src/arrow/compute/api.h @@ -0,0 +1,25 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_COMPUTE_API_H +#define ARROW_COMPUTE_API_H + +#include "arrow/compute/cast.h" +#include "arrow/compute/context.h" +#include "arrow/compute/kernel.h" + +#endif // ARROW_COMPUTE_API_H http://git-wip-us.apache.org/repos/asf/arrow/blob/de2edc8d/cpp/src/arrow/compute/cast.cc ---------------------------------------------------------------------- diff --git a/cpp/src/arrow/compute/cast.cc b/cpp/src/arrow/compute/cast.cc index f610f6b..3885fdf 100644 --- a/cpp/src/arrow/compute/cast.cc +++ b/cpp/src/arrow/compute/cast.cc @@ -18,40 +18,66 @@ #include "arrow/compute/cast.h" #include <cstdint> +#include <cstring> #include <functional> #include <limits> #include <memory> #include <sstream> +#include <string> #include <type_traits> +#include "arrow/array.h" +#include "arrow/buffer.h" +#include "arrow/type.h" #include "arrow/type_traits.h" +#include "arrow/util/bit-util.h" #include "arrow/util/logging.h" +#include "arrow/util/macros.h" #include "arrow/compute/context.h" +#include "arrow/compute/kernel.h" namespace arrow { namespace compute { -struct CastContext { - FunctionContext* func_ctx; - CastOptions options; +// ---------------------------------------------------------------------- +// Zero copy casts + +template <typename O, typename I, typename Enable = void> +struct is_zero_copy_cast { + static constexpr bool value = false; }; -typedef std::function<void(CastContext*, const ArrayData&, ArrayData*)> CastFunction; +template <typename O, typename I> +struct is_zero_copy_cast<O, I, typename std::enable_if<std::is_same<I, O>::value>::type> { + static constexpr bool value = true; +}; + +// From integers to date/time types with zero copy +template <typename O, typename I> +struct is_zero_copy_cast< + O, I, typename std::enable_if<std::is_base_of<Integer, I>::value && + (std::is_base_of<TimeType, O>::value || + std::is_base_of<DateType, O>::value || + std::is_base_of<TimestampType, O>::value)>::type> { + using O_T = typename O::c_type; + using I_T = typename I::c_type; + + static constexpr bool value = sizeof(O_T) == sizeof(I_T); +}; template <typename OutType, typename InType, typename Enable = void> struct CastFunctor {}; -// Type is the same, no computation required +// Indicated no computation required template <typename O, typename I> -struct CastFunctor<O, I, typename std::enable_if<std::is_same<I, O>::value>::type> { - void operator()(CastContext* ctx, const ArrayData& input, ArrayData* output) { - output->type = input.type; - output->buffers = input.buffers; - output->length = input.length; - output->offset = input.offset; - output->null_count = input.null_count; - output->child_data = input.child_data; +struct CastFunctor<O, I, typename std::enable_if<is_zero_copy_cast<O, I>::value>::type> { + void operator()(FunctionContext* ctx, const CastOptions& options, const Array& input, + ArrayData* output) { + auto in_data = input.data(); + output->null_count = input.null_count(); + output->buffers = in_data->buffers; + output->child_data = in_data->child_data; } }; @@ -59,10 +85,13 @@ struct CastFunctor<O, I, typename std::enable_if<std::is_same<I, O>::value>::typ // Null to other things template <typename T> -struct CastFunctor<T, NullType, - typename std::enable_if<!std::is_same<T, NullType>::value>::type> { - void operator()(CastContext* ctx, const ArrayData& input, ArrayData* output) { - ctx->func_ctx->SetStatus(Status::NotImplemented("NullType")); +struct CastFunctor<T, NullType, typename std::enable_if< + std::is_base_of<FixedWidthType, T>::value>::type> { + void operator()(FunctionContext* ctx, const CastOptions& options, const Array& input, + ArrayData* output) { + // Simply initialize data to 0 + auto buf = output->buffers[1]; + memset(buf->mutable_data(), 0, buf->size()); } }; @@ -73,13 +102,14 @@ struct CastFunctor<T, NullType, template <typename T> struct CastFunctor<T, BooleanType, typename std::enable_if<std::is_base_of<Number, T>::value>::type> { - void operator()(CastContext* ctx, const ArrayData& input, ArrayData* output) { + void operator()(FunctionContext* ctx, const CastOptions& options, const Array& input, + ArrayData* output) { using c_type = typename T::c_type; - const uint8_t* data = input.buffers[1]->data(); + const uint8_t* data = input.data()->buffers[1]->data(); auto out = reinterpret_cast<c_type*>(output->buffers[1]->mutable_data()); constexpr auto kOne = static_cast<c_type>(1); constexpr auto kZero = static_cast<c_type>(0); - for (int64_t i = 0; i < input.length; ++i) { + for (int64_t i = 0; i < input.length(); ++i) { *out++ = BitUtil::GetBit(data, i) ? kOne : kZero; } } @@ -122,11 +152,12 @@ template <typename O, typename I> struct CastFunctor<O, I, typename std::enable_if<std::is_same<BooleanType, O>::value && std::is_base_of<Number, I>::value && !std::is_same<O, I>::value>::type> { - void operator()(CastContext* ctx, const ArrayData& input, ArrayData* output) { + void operator()(FunctionContext* ctx, const CastOptions& options, const Array& input, + ArrayData* output) { using in_type = typename I::c_type; - auto in_data = reinterpret_cast<const in_type*>(input.buffers[1]->data()); + auto in_data = reinterpret_cast<const in_type*>(input.data()->buffers[1]->data()); uint8_t* out_data = reinterpret_cast<uint8_t*>(output->buffers[1]->mutable_data()); - for (int64_t i = 0; i < input.length; ++i) { + for (int64_t i = 0; i < input.length(); ++i) { BitUtil::SetBitTo(out_data, i, (*in_data++) != 0); } } @@ -135,39 +166,42 @@ struct CastFunctor<O, I, typename std::enable_if<std::is_same<BooleanType, O>::v template <typename O, typename I> struct CastFunctor<O, I, typename std::enable_if<is_integer_downcast<O, I>::value>::type> { - void operator()(CastContext* ctx, const ArrayData& input, ArrayData* output) { + void operator()(FunctionContext* ctx, const CastOptions& options, const Array& input, + ArrayData* output) { using in_type = typename I::c_type; using out_type = typename O::c_type; - auto in_offset = input.offset; + auto in_offset = input.offset(); - auto in_data = reinterpret_cast<const in_type*>(input.buffers[1]->data()) + in_offset; + const auto& input_buffers = input.data()->buffers; + + auto in_data = reinterpret_cast<const in_type*>(input_buffers[1]->data()) + in_offset; auto out_data = reinterpret_cast<out_type*>(output->buffers[1]->mutable_data()); - if (!ctx->options.allow_int_overflow) { + if (!options.allow_int_overflow) { constexpr in_type kMax = static_cast<in_type>(std::numeric_limits<out_type>::max()); constexpr in_type kMin = static_cast<in_type>(std::numeric_limits<out_type>::min()); - if (input.null_count > 0) { - const uint8_t* is_valid = input.buffers[0]->data(); + if (input.null_count() > 0) { + const uint8_t* is_valid = input_buffers[0]->data(); int64_t is_valid_offset = in_offset; - for (int64_t i = 0; i < input.length; ++i) { + for (int64_t i = 0; i < input.length(); ++i) { if (ARROW_PREDICT_FALSE(BitUtil::GetBit(is_valid, is_valid_offset++) && (*in_data > kMax || *in_data < kMin))) { - ctx->func_ctx->SetStatus(Status::Invalid("Integer value out of bounds")); + ctx->SetStatus(Status::Invalid("Integer value out of bounds")); } *out_data++ = static_cast<out_type>(*in_data++); } } else { - for (int64_t i = 0; i < input.length; ++i) { + for (int64_t i = 0; i < input.length(); ++i) { if (ARROW_PREDICT_FALSE(*in_data > kMax || *in_data < kMin)) { - ctx->func_ctx->SetStatus(Status::Invalid("Integer value out of bounds")); + ctx->SetStatus(Status::Invalid("Integer value out of bounds")); } *out_data++ = static_cast<out_type>(*in_data++); } } } else { - for (int64_t i = 0; i < input.length; ++i) { + for (int64_t i = 0; i < input.length(); ++i) { *out_data++ = static_cast<out_type>(*in_data++); } } @@ -178,13 +212,14 @@ template <typename O, typename I> struct CastFunctor<O, I, typename std::enable_if<is_numeric_cast<O, I>::value && !is_integer_downcast<O, I>::value>::type> { - void operator()(CastContext* ctx, const ArrayData& input, ArrayData* output) { + void operator()(FunctionContext* ctx, const CastOptions& options, const Array& input, + ArrayData* output) { using in_type = typename I::c_type; using out_type = typename O::c_type; - auto in_data = reinterpret_cast<const in_type*>(input.buffers[1]->data()); + auto in_data = reinterpret_cast<const in_type*>(input.data()->buffers[1]->data()); auto out_data = reinterpret_cast<out_type*>(output->buffers[1]->mutable_data()); - for (int64_t i = 0; i < input.length; ++i) { + for (int64_t i = 0; i < input.length(); ++i) { *out_data++ = static_cast<out_type>(*in_data++); } } @@ -192,12 +227,90 @@ struct CastFunctor<O, I, // ---------------------------------------------------------------------- -#define CAST_CASE(InType, OutType) \ - case OutType::type_id: \ - return [type](CastContext* ctx, const ArrayData& input, ArrayData* out) { \ - CastFunctor<OutType, InType> func; \ - func(ctx, input, out); \ +typedef std::function<void(FunctionContext*, const CastOptions& options, const Array&, + ArrayData*)> + CastFunction; + +static Status AllocateIfNotPreallocated(FunctionContext* ctx, const Array& input, + ArrayData* out) { + if (!is_primitive(out->type->id())) { + return Status::NotImplemented(out->type->ToString()); + } + + const auto& fw_type = static_cast<const FixedWidthType&>(*out->type); + + const int64_t length = input.length(); + + out->null_count = input.null_count(); + + // Propagate bitmap unless we are null type + std::shared_ptr<Buffer> validity_bitmap = input.data()->buffers[0]; + if (input.type_id() == Type::NA) { + int64_t bitmap_size = BitUtil::BytesForBits(length); + RETURN_NOT_OK(ctx->Allocate(bitmap_size, &validity_bitmap)); + memset(validity_bitmap->mutable_data(), 0, bitmap_size); + } + + if (out->buffers.size() == 2) { + // Assuming preallocated, propagage bitmap and move on + out->buffers[0] = validity_bitmap; + return Status::OK(); + } else { + DCHECK_EQ(0, out->buffers.size()); + } + + out->buffers.push_back(validity_bitmap); + + std::shared_ptr<Buffer> out_data; + + int bit_width = fw_type.bit_width(); + int64_t buffer_size = 0; + + if (bit_width == 1) { + buffer_size = BitUtil::BytesForBits(length); + } else if (bit_width % 8 == 0) { + buffer_size = length * fw_type.bit_width() / 8; + } else { + DCHECK(false); + } + + RETURN_NOT_OK(ctx->Allocate(buffer_size, &out_data)); + memset(out_data->mutable_data(), 0, buffer_size); + + out->buffers.push_back(out_data); + + return Status::OK(); +} + +class CastKernel : public UnaryKernel { + public: + CastKernel(const CastOptions& options, const CastFunction& func, bool is_zero_copy) + : options_(options), func_(func), is_zero_copy_(is_zero_copy) {} + + Status Call(FunctionContext* ctx, const Array& input, ArrayData* out) override { + if (!is_zero_copy_) { + RETURN_NOT_OK(AllocateIfNotPreallocated(ctx, input, out)); } + func_(ctx, options_, input, out); + RETURN_IF_ERROR(ctx); + return Status::OK(); + } + + private: + CastOptions options_; + CastFunction func_; + bool is_zero_copy_; +}; + +#define CAST_CASE(InType, OutType) \ + case OutType::type_id: \ + is_zero_copy = is_zero_copy_cast<OutType, InType>::value; \ + func = [](FunctionContext* ctx, const CastOptions& options, const Array& input, \ + ArrayData* out) { \ + CastFunctor<OutType, InType> func; \ + func(ctx, options, input, out); \ + }; \ + break; #define NUMERIC_CASES(FN, IN_TYPE) \ FN(IN_TYPE, BooleanType); \ @@ -212,37 +325,78 @@ struct CastFunctor<O, I, FN(IN_TYPE, FloatType); \ FN(IN_TYPE, DoubleType); -#define GET_CAST_FUNCTION(CapType) \ - static CastFunction Get##CapType##CastFunc(const std::shared_ptr<DataType>& type) { \ - switch (type->id()) { \ - NUMERIC_CASES(CAST_CASE, CapType); \ - default: \ - break; \ - } \ - return nullptr; \ +#define NULL_CASES(FN, IN_TYPE) \ + NUMERIC_CASES(FN, IN_TYPE) \ + FN(NullType, Time32Type); \ + FN(NullType, Date32Type); \ + FN(NullType, TimestampType); \ + FN(NullType, Time64Type); \ + FN(NullType, Date64Type); + +#define INT32_CASES(FN, IN_TYPE) \ + NUMERIC_CASES(FN, IN_TYPE) \ + FN(Int32Type, Time32Type); \ + FN(Int32Type, Date32Type); + +#define INT64_CASES(FN, IN_TYPE) \ + NUMERIC_CASES(FN, IN_TYPE) \ + FN(Int64Type, TimestampType); \ + FN(Int64Type, Time64Type); \ + FN(Int64Type, Date64Type); + +#define DATE32_CASES(FN, IN_TYPE) FN(Date32Type, Date32Type); + +#define DATE64_CASES(FN, IN_TYPE) FN(Date64Type, Date64Type); + +#define TIME32_CASES(FN, IN_TYPE) FN(Time32Type, Time32Type); + +#define TIME64_CASES(FN, IN_TYPE) FN(Time64Type, Time64Type); + +#define TIMESTAMP_CASES(FN, IN_TYPE) FN(TimestampType, TimestampType); + +#define GET_CAST_FUNCTION(CASE_GENERATOR, InType) \ + static std::unique_ptr<UnaryKernel> Get##InType##CastFunc( \ + const std::shared_ptr<DataType>& out_type, const CastOptions& options) { \ + CastFunction func; \ + bool is_zero_copy = false; \ + switch (out_type->id()) { \ + CASE_GENERATOR(CAST_CASE, InType); \ + default: \ + break; \ + } \ + if (func != nullptr) { \ + return std::unique_ptr<UnaryKernel>(new CastKernel(options, func, is_zero_copy)); \ + } \ + return nullptr; \ } -#define CAST_FUNCTION_CASE(CapType) \ - case CapType::type_id: \ - *out = Get##CapType##CastFunc(out_type); \ +GET_CAST_FUNCTION(NULL_CASES, NullType); +GET_CAST_FUNCTION(NUMERIC_CASES, BooleanType); +GET_CAST_FUNCTION(NUMERIC_CASES, UInt8Type); +GET_CAST_FUNCTION(NUMERIC_CASES, Int8Type); +GET_CAST_FUNCTION(NUMERIC_CASES, UInt16Type); +GET_CAST_FUNCTION(NUMERIC_CASES, Int16Type); +GET_CAST_FUNCTION(NUMERIC_CASES, UInt32Type); +GET_CAST_FUNCTION(INT32_CASES, Int32Type); +GET_CAST_FUNCTION(NUMERIC_CASES, UInt64Type); +GET_CAST_FUNCTION(INT64_CASES, Int64Type); +GET_CAST_FUNCTION(NUMERIC_CASES, FloatType); +GET_CAST_FUNCTION(NUMERIC_CASES, DoubleType); +GET_CAST_FUNCTION(DATE32_CASES, Date32Type); +GET_CAST_FUNCTION(DATE64_CASES, Date64Type); +GET_CAST_FUNCTION(TIME32_CASES, Time32Type); +GET_CAST_FUNCTION(TIME64_CASES, Time64Type); +GET_CAST_FUNCTION(TIMESTAMP_CASES, TimestampType); + +#define CAST_FUNCTION_CASE(InType) \ + case InType::type_id: \ + *kernel = Get##InType##CastFunc(out_type, options); \ break -GET_CAST_FUNCTION(BooleanType); -GET_CAST_FUNCTION(UInt8Type); -GET_CAST_FUNCTION(Int8Type); -GET_CAST_FUNCTION(UInt16Type); -GET_CAST_FUNCTION(Int16Type); -GET_CAST_FUNCTION(UInt32Type); -GET_CAST_FUNCTION(Int32Type); -GET_CAST_FUNCTION(UInt64Type); -GET_CAST_FUNCTION(Int64Type); -GET_CAST_FUNCTION(FloatType); -GET_CAST_FUNCTION(DoubleType); - -static Status GetCastFunction(const DataType& in_type, - const std::shared_ptr<DataType>& out_type, - CastFunction* out) { +Status GetCastFunction(const DataType& in_type, const std::shared_ptr<DataType>& out_type, + const CastOptions& options, std::unique_ptr<UnaryKernel>* kernel) { switch (in_type.id()) { + CAST_FUNCTION_CASE(NullType); CAST_FUNCTION_CASE(BooleanType); CAST_FUNCTION_CASE(UInt8Type); CAST_FUNCTION_CASE(Int8Type); @@ -254,10 +408,15 @@ static Status GetCastFunction(const DataType& in_type, CAST_FUNCTION_CASE(Int64Type); CAST_FUNCTION_CASE(FloatType); CAST_FUNCTION_CASE(DoubleType); + CAST_FUNCTION_CASE(Date32Type); + CAST_FUNCTION_CASE(Date64Type); + CAST_FUNCTION_CASE(Time32Type); + CAST_FUNCTION_CASE(Time64Type); + CAST_FUNCTION_CASE(TimestampType); default: break; } - if (*out == nullptr) { + if (*kernel == nullptr) { std::stringstream ss; ss << "No cast implemented from " << in_type.ToString() << " to " << out_type->ToString(); @@ -266,64 +425,19 @@ static Status GetCastFunction(const DataType& in_type, return Status::OK(); } -static Status AllocateLike(FunctionContext* ctx, const Array& array, - const std::shared_ptr<DataType>& out_type, - std::shared_ptr<ArrayData>* out) { - if (!is_primitive(out_type->id())) { - return Status::NotImplemented(out_type->ToString()); - } - - const auto& fw_type = static_cast<const FixedWidthType&>(*out_type); - - auto result = std::make_shared<ArrayData>(); - result->type = out_type; - result->length = array.length(); - result->offset = 0; - result->null_count = array.null_count(); - - // Propagate null bitmap - // TODO(wesm): handling null bitmap when input type is NullType - result->buffers.push_back(array.data()->buffers[0]); - - std::shared_ptr<Buffer> out_data; - - int bit_width = fw_type.bit_width(); - - if (bit_width == 1) { - RETURN_NOT_OK(ctx->Allocate(BitUtil::BytesForBits(array.length()), &out_data)); - } else if (bit_width % 8 == 0) { - RETURN_NOT_OK(ctx->Allocate(array.length() * fw_type.bit_width() / 8, &out_data)); - } else { - DCHECK(false); - } - result->buffers.push_back(out_data); - - *out = result; - return Status::OK(); -} - -static Status Cast(CastContext* cast_ctx, const Array& array, - const std::shared_ptr<DataType>& out_type, - std::shared_ptr<Array>* out) { +Status Cast(FunctionContext* ctx, const Array& array, + const std::shared_ptr<DataType>& out_type, const CastOptions& options, + std::shared_ptr<Array>* out) { // Dynamic dispatch to obtain right cast function - CastFunction func; - RETURN_NOT_OK(GetCastFunction(*array.type(), out_type, &func)); + std::unique_ptr<UnaryKernel> func; + RETURN_NOT_OK(GetCastFunction(*array.type(), out_type, options, &func)); - // Allocate memory for output - std::shared_ptr<ArrayData> out_data; - RETURN_NOT_OK(AllocateLike(cast_ctx->func_ctx, array, out_type, &out_data)); + // Data structure for output + auto out_data = std::make_shared<ArrayData>(out_type, array.length()); - func(cast_ctx, *array.data(), out_data.get()); - RETURN_IF_ERROR(cast_ctx->func_ctx); + RETURN_NOT_OK(func->Call(ctx, array, out_data.get())); return internal::MakeArray(out_data, out); } -Status Cast(FunctionContext* ctx, const Array& array, - const std::shared_ptr<DataType>& out_type, const CastOptions& options, - std::shared_ptr<Array>* out) { - CastContext cast_ctx{ctx, options}; - return Cast(&cast_ctx, array, out_type, out); -} - } // namespace compute } // namespace arrow http://git-wip-us.apache.org/repos/asf/arrow/blob/de2edc8d/cpp/src/arrow/compute/cast.h ---------------------------------------------------------------------- diff --git a/cpp/src/arrow/compute/cast.h b/cpp/src/arrow/compute/cast.h index 9ca70aa..081cdd9 100644 --- a/cpp/src/arrow/compute/cast.h +++ b/cpp/src/arrow/compute/cast.h @@ -20,21 +20,31 @@ #include <memory> -#include "arrow/array.h" +#include "arrow/status.h" #include "arrow/util/visibility.h" namespace arrow { -using internal::ArrayData; +class Array; +class DataType; namespace compute { class FunctionContext; +class UnaryKernel; struct CastOptions { + CastOptions() : allow_int_overflow(false) {} + bool allow_int_overflow; }; +/// \since 0.7.0 +/// \note API not yet finalized +ARROW_EXPORT +Status GetCastFunction(const DataType& in_type, const std::shared_ptr<DataType>& to_type, + const CastOptions& options, std::unique_ptr<UnaryKernel>* kernel); + /// \brief Cast from one array type to another /// \param[in] context /// \param[in] array http://git-wip-us.apache.org/repos/asf/arrow/blob/de2edc8d/cpp/src/arrow/compute/compute-test.cc ---------------------------------------------------------------------- diff --git a/cpp/src/arrow/compute/compute-test.cc b/cpp/src/arrow/compute/compute-test.cc index cda5755..ba645c2 100644 --- a/cpp/src/arrow/compute/compute-test.cc +++ b/cpp/src/arrow/compute/compute-test.cc @@ -39,10 +39,14 @@ #include "arrow/compute/cast.h" #include "arrow/compute/context.h" +#include "arrow/compute/kernel.h" using std::vector; namespace arrow { + +using internal::ArrayData; + namespace compute { void AssertArraysEqual(const Array& left, const Array& right) { @@ -72,6 +76,11 @@ class ComputeFixture { // ---------------------------------------------------------------------- // Cast +static void AssertBufferSame(const Array& left, const Array& right, int buffer_index) { + ASSERT_EQ(left.data()->buffers[buffer_index].get(), + right.data()->buffers[buffer_index].get()); +} + class TestCast : public ComputeFixture, public ::testing::Test { public: void CheckPass(const Array& input, const Array& expected, @@ -94,6 +103,13 @@ class TestCast : public ComputeFixture, public ::testing::Test { ASSERT_RAISES(Invalid, Cast(&ctx_, *input, out_type, options, &result)); } + void CheckZeroCopy(const Array& input, const std::shared_ptr<DataType>& out_type) { + std::shared_ptr<Array> result; + ASSERT_OK(Cast(&ctx_, input, out_type, {}, &result)); + AssertBufferSame(input, *result, 0); + AssertBufferSame(input, *result, 1); + } + template <typename InType, typename I_TYPE, typename OutType, typename O_TYPE> void CheckCase(const std::shared_ptr<DataType>& in_type, const std::vector<I_TYPE>& in_values, const std::vector<bool>& is_valid, @@ -121,12 +137,8 @@ TEST_F(TestCast, SameTypeZeroCopy) { std::shared_ptr<Array> result; ASSERT_OK(Cast(&this->ctx_, *arr, int32(), {}, &result)); - const auto& lbuffers = arr->data()->buffers; - const auto& rbuffers = result->data()->buffers; - - // Buffers are the same - ASSERT_EQ(lbuffers[0].get(), rbuffers[0].get()); - ASSERT_EQ(lbuffers[1].get(), rbuffers[1].get()); + AssertBufferSame(*arr, *result, 0); + AssertBufferSame(*arr, *result, 1); } TEST_F(TestCast, ToBoolean) { @@ -311,5 +323,77 @@ TEST_F(TestCast, UnsupportedTarget) { ASSERT_RAISES(NotImplemented, Cast(&this->ctx_, *arr, utf8(), {}, &result)); } +TEST_F(TestCast, DateTimeZeroCopy) { + vector<bool> is_valid = {true, false, true, true, true}; + + std::shared_ptr<Array> arr; + vector<int32_t> v1 = {0, 70000, 2000, 1000, 0}; + ArrayFromVector<Int32Type, int32_t>(int32(), is_valid, v1, &arr); + + CheckZeroCopy(*arr, time32(TimeUnit::SECOND)); + CheckZeroCopy(*arr, date32()); + + vector<int64_t> v2 = {0, 70000, 2000, 1000, 0}; + ArrayFromVector<Int64Type, int64_t>(int64(), is_valid, v2, &arr); + + CheckZeroCopy(*arr, time64(TimeUnit::MICRO)); + CheckZeroCopy(*arr, date64()); + CheckZeroCopy(*arr, timestamp(TimeUnit::NANO)); +} + +TEST_F(TestCast, FromNull) { + // Null casts to everything + const int length = 10; + + NullArray arr(length); + + std::shared_ptr<Array> result; + ASSERT_OK(Cast(&ctx_, arr, int32(), {}, &result)); + + ASSERT_EQ(length, result->length()); + ASSERT_EQ(length, result->null_count()); + + // OK to look at bitmaps + AssertArraysEqual(*result, *result); +} + +TEST_F(TestCast, PreallocatedMemory) { + CastOptions options; + options.allow_int_overflow = false; + + vector<bool> is_valid = {true, false, true, true, true}; + + const int64_t length = 5; + + std::shared_ptr<Array> arr; + vector<int32_t> v1 = {0, 70000, 2000, 1000, 0}; + vector<int64_t> e1 = {0, 70000, 2000, 1000, 0}; + ArrayFromVector<Int32Type, int32_t>(int32(), is_valid, v1, &arr); + + auto out_type = int64(); + + std::unique_ptr<UnaryKernel> kernel; + ASSERT_OK(GetCastFunction(*int32(), out_type, options, &kernel)); + + auto out_data = std::make_shared<ArrayData>(out_type, length); + + std::shared_ptr<Buffer> out_values; + ASSERT_OK(this->ctx_.Allocate(length * sizeof(int64_t), &out_values)); + + out_data->buffers.push_back(nullptr); + out_data->buffers.push_back(out_values); + + ASSERT_OK(kernel->Call(&this->ctx_, *arr, out_data.get())); + + // Buffer address unchanged + ASSERT_EQ(out_values.get(), out_data->buffers[1].get()); + + std::shared_ptr<Array> result, expected; + ASSERT_OK(MakeArray(out_data, &result)); + ArrayFromVector<Int64Type, int64_t>(int64(), is_valid, e1, &expected); + + AssertArraysEqual(*expected, *result); +} + } // namespace compute } // namespace arrow http://git-wip-us.apache.org/repos/asf/arrow/blob/de2edc8d/cpp/src/arrow/compute/context.h ---------------------------------------------------------------------- diff --git a/cpp/src/arrow/compute/context.h b/cpp/src/arrow/compute/context.h index caff2e2..051c91b 100644 --- a/cpp/src/arrow/compute/context.h +++ b/cpp/src/arrow/compute/context.h @@ -18,6 +18,7 @@ #ifndef ARROW_COMPUTE_CONTEXT_H #define ARROW_COMPUTE_CONTEXT_H +#include "arrow/memory_pool.h" #include "arrow/status.h" #include "arrow/type_fwd.h" #include "arrow/util/visibility.h" @@ -35,7 +36,7 @@ namespace compute { /// \brief Container for variables and options used by function evaluation class ARROW_EXPORT FunctionContext { public: - explicit FunctionContext(MemoryPool* pool); + explicit FunctionContext(MemoryPool* pool ARROW_MEMORY_POOL_DEFAULT); MemoryPool* memory_pool() const; /// \brief Allocate buffer from the context's memory pool http://git-wip-us.apache.org/repos/asf/arrow/blob/de2edc8d/cpp/src/arrow/compute/kernel.h ---------------------------------------------------------------------- diff --git a/cpp/src/arrow/compute/kernel.h b/cpp/src/arrow/compute/kernel.h new file mode 100644 index 0000000..c72d467 --- /dev/null +++ b/cpp/src/arrow/compute/kernel.h @@ -0,0 +1,49 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_COMPUTE_KERNEL_H +#define ARROW_COMPUTE_KERNEL_H + +#include "arrow/util/visibility.h" + +namespace arrow { + +class Array; +using internal::ArrayData; + +namespace compute { + +class FunctionContext; + +/// \class OpKernel +/// \brief Base class for operator kernels +class ARROW_EXPORT OpKernel { + public: + virtual ~OpKernel() = default; +}; + +/// \class UnaryKernel +/// \brief An array-valued function of a single input argument +class ARROW_EXPORT UnaryKernel : public OpKernel { + public: + virtual Status Call(FunctionContext* ctx, const Array& input, ArrayData* out) = 0; +}; + +} // namespace compute +} // namespace arrow + +#endif // ARROW_COMPUTE_KERNEL_H http://git-wip-us.apache.org/repos/asf/arrow/blob/de2edc8d/cpp/src/arrow/python/numpy_convert.h ---------------------------------------------------------------------- diff --git a/cpp/src/arrow/python/numpy_convert.h b/cpp/src/arrow/python/numpy_convert.h index 7b3b3b7..93c4848 100644 --- a/cpp/src/arrow/python/numpy_convert.h +++ b/cpp/src/arrow/python/numpy_convert.h @@ -57,10 +57,7 @@ bool is_contiguous(PyObject* array); ARROW_EXPORT Status NumPyDtypeToArrow(PyObject* dtype, std::shared_ptr<DataType>* out); -ARROW_EXPORT Status GetTensorType(PyObject* dtype, std::shared_ptr<DataType>* out); - -ARROW_EXPORT Status GetNumPyType(const DataType& type, int* type_num); ARROW_EXPORT Status NdarrayToTensor(MemoryPool* pool, PyObject* ao, http://git-wip-us.apache.org/repos/asf/arrow/blob/de2edc8d/cpp/src/arrow/python/pandas_to_arrow.cc ---------------------------------------------------------------------- diff --git a/cpp/src/arrow/python/pandas_to_arrow.cc b/cpp/src/arrow/python/pandas_to_arrow.cc index 2493779..1f96f4e 100644 --- a/cpp/src/arrow/python/pandas_to_arrow.cc +++ b/cpp/src/arrow/python/pandas_to_arrow.cc @@ -44,6 +44,9 @@ #include "arrow/util/macros.h" #include "arrow/visitor_inline.h" +#include "arrow/compute/cast.h" +#include "arrow/compute/context.h" + #include "arrow/python/builtin_convert.h" #include "arrow/python/common.h" #include "arrow/python/config.h" @@ -347,8 +350,8 @@ class PandasConverter { } BufferVector buffers = {null_bitmap_, data}; - auto arr_data = std::make_shared<ArrayData>(type_, length_, std::move(buffers), - null_count, 0); + auto arr_data = + std::make_shared<ArrayData>(type_, length_, std::move(buffers), null_count, 0); return PushArray(arr_data); } @@ -419,9 +422,11 @@ class PandasConverter { Status ConvertLists(const std::shared_ptr<DataType>& type); Status ConvertLists(const std::shared_ptr<DataType>& type, ListBuilder* builder, PyObject* list); - Status ConvertObjects(); Status ConvertDecimals(); Status ConvertTimes(); + Status ConvertObjects(); + Status ConvertObjectsInfer(); + Status ConvertObjectsInferAndCast(); protected: MemoryPool* pool_; @@ -460,22 +465,32 @@ void CopyStrided<PyObject*, PyObject*>(PyObject** input_data, int64_t length, } } +static Status CastBuffer(const std::shared_ptr<Buffer>& input, const int64_t length, + const std::shared_ptr<DataType>& in_type, + const std::shared_ptr<DataType>& out_type, MemoryPool* pool, + std::shared_ptr<Buffer>* out) { + // Must cast + std::vector<std::shared_ptr<Buffer>> buffers = {nullptr, input}; + auto tmp_data = std::make_shared<ArrayData>(in_type, length, buffers, 0); + + std::shared_ptr<Array> tmp_array, casted_array; + RETURN_NOT_OK(MakeArray(tmp_data, &tmp_array)); + + compute::FunctionContext context(pool); + compute::CastOptions cast_options; + cast_options.allow_int_overflow = false; + + RETURN_NOT_OK( + compute::Cast(&context, *tmp_array, out_type, cast_options, &casted_array)); + *out = casted_array->data()->buffers[1]; + return Status::OK(); +} + template <typename ArrowType> inline Status PandasConverter::ConvertData(std::shared_ptr<Buffer>* data) { using traits = internal::arrow_traits<ArrowType::type_id>; using T = typename traits::T; - // Handle LONGLONG->INT64 and other fun things - int type_num_compat = cast_npy_type_compat(PyArray_DESCR(arr_)->type_num); - - if (NumPyTypeSize(traits::npy_type) != NumPyTypeSize(type_num_compat)) { - std::stringstream ss; - ss << "NumPy type casts not yet implemented, type sizes differ: "; - ss << NumPyTypeSize(traits::npy_type) << " compared to " - << NumPyTypeSize(type_num_compat); - return Status::NotImplemented(ss.str()); - } - if (is_strided()) { // Strided, must copy into new contiguous memory const int64_t stride = PyArray_STRIDES(arr_)[0]; @@ -490,6 +505,15 @@ inline Status PandasConverter::ConvertData(std::shared_ptr<Buffer>* data) { // Can zero-copy *data = std::make_shared<NumPyBuffer>(reinterpret_cast<PyObject*>(arr_)); } + + std::shared_ptr<DataType> input_type; + RETURN_NOT_OK( + NumPyDtypeToArrow(reinterpret_cast<PyObject*>(PyArray_DESCR(arr_)), &input_type)); + + if (!input_type->Equals(*type_)) { + RETURN_NOT_OK(CastBuffer(*data, length_, input_type, type_, pool_, data)); + } + return Status::OK(); } @@ -845,6 +869,74 @@ Status PandasConverter::ConvertBooleans() { return Status::OK(); } +Status PandasConverter::ConvertObjectsInfer() { + Ndarray1DIndexer<PyObject*> objects; + + PyAcquireGIL lock; + objects.Init(arr_); + PyDateTime_IMPORT; + + OwnedRef decimal; + OwnedRef Decimal; + RETURN_NOT_OK(ImportModule("decimal", &decimal)); + RETURN_NOT_OK(ImportFromModule(decimal, "Decimal", &Decimal)); + + for (int64_t i = 0; i < length_; ++i) { + PyObject* obj = objects[i]; + if (PandasObjectIsNull(obj)) { + continue; + } else if (PyObject_is_string(obj)) { + return ConvertObjectStrings(); + } else if (PyObject_is_float(obj)) { + return ConvertObjectFloats(); + } else if (PyBool_Check(obj)) { + return ConvertBooleans(); + } else if (PyObject_is_integer(obj)) { + return ConvertObjectIntegers(); + } else if (PyDate_CheckExact(obj)) { + // We could choose Date32 or Date64 + return ConvertDates<Date32Type>(); + } else if (PyTime_Check(obj)) { + return ConvertTimes(); + } else if (PyObject_IsInstance(const_cast<PyObject*>(obj), Decimal.obj())) { + return ConvertDecimals(); + } else if (PyList_Check(obj) || PyArray_Check(obj)) { + std::shared_ptr<DataType> inferred_type; + RETURN_NOT_OK(InferArrowType(obj, &inferred_type)); + return ConvertLists(inferred_type); + } else { + const std::string supported_types = + "string, bool, float, int, date, time, decimal, list, array"; + std::stringstream ss; + ss << "Error inferring Arrow type for Python object array. "; + RETURN_NOT_OK(InvalidConversion(obj, supported_types, &ss)); + return Status::Invalid(ss.str()); + } + } + out_arrays_.push_back(std::make_shared<NullArray>(length_)); + return Status::OK(); +} + +Status PandasConverter::ConvertObjectsInferAndCast() { + size_t position = out_arrays_.size(); + RETURN_NOT_OK(ConvertObjectsInfer()); + + std::shared_ptr<Array> arr = out_arrays_[position]; + + // Perform cast + compute::FunctionContext context(pool_); + compute::CastOptions options; + options.allow_int_overflow = false; + + std::shared_ptr<Array> casted; + RETURN_NOT_OK(compute::Cast(&context, *arr, type_, options, &casted)); + + // Replace with casted values + out_arrays_[position] = casted; + + return Status::OK(); +} + Status PandasConverter::ConvertObjects() { // Python object arrays are annoying, since we could have one of: // @@ -858,13 +950,6 @@ Status PandasConverter::ConvertObjects() { RETURN_NOT_OK(InitNullBitmap()); - Ndarray1DIndexer<PyObject*> objects; - - PyAcquireGIL lock; - objects.Init(arr_); - PyDateTime_IMPORT; - lock.release(); - // This means we received an explicit type from the user if (type_) { switch (type_->id()) { @@ -885,53 +970,12 @@ Status PandasConverter::ConvertObjects() { case Type::DECIMAL: return ConvertDecimals(); default: - return Status::TypeError("No known conversion to Arrow type"); + return ConvertObjectsInferAndCast(); } } else { // Re-acquire GIL - lock.acquire(); - - OwnedRef decimal; - OwnedRef Decimal; - RETURN_NOT_OK(ImportModule("decimal", &decimal)); - RETURN_NOT_OK(ImportFromModule(decimal, "Decimal", &Decimal)); - - for (int64_t i = 0; i < length_; ++i) { - PyObject* obj = objects[i]; - if (PandasObjectIsNull(obj)) { - continue; - } else if (PyObject_is_string(obj)) { - return ConvertObjectStrings(); - } else if (PyObject_is_float(obj)) { - return ConvertObjectFloats(); - } else if (PyBool_Check(obj)) { - return ConvertBooleans(); - } else if (PyObject_is_integer(obj)) { - return ConvertObjectIntegers(); - } else if (PyDate_CheckExact(obj)) { - // We could choose Date32 or Date64 - return ConvertDates<Date32Type>(); - } else if (PyTime_Check(obj)) { - return ConvertTimes(); - } else if (PyObject_IsInstance(const_cast<PyObject*>(obj), Decimal.obj())) { - return ConvertDecimals(); - } else if (PyList_Check(obj) || PyArray_Check(obj)) { - std::shared_ptr<DataType> inferred_type; - RETURN_NOT_OK(InferArrowType(obj, &inferred_type)); - return ConvertLists(inferred_type); - } else { - const std::string supported_types = - "string, bool, float, int, date, time, decimal, list, array"; - std::stringstream ss; - ss << "Error inferring Arrow type for Python object array. "; - RETURN_NOT_OK(InvalidConversion(obj, supported_types, &ss)); - return Status::Invalid(ss.str()); - } - } + return ConvertObjectsInfer(); } - - out_arrays_.push_back(std::make_shared<NullArray>(length_)); - return Status::OK(); } template <typename T> http://git-wip-us.apache.org/repos/asf/arrow/blob/de2edc8d/python/pyarrow/array.pxi ---------------------------------------------------------------------- diff --git a/python/pyarrow/array.pxi b/python/pyarrow/array.pxi index a693f45..eec6180 100644 --- a/python/pyarrow/array.pxi +++ b/python/pyarrow/array.pxi @@ -88,6 +88,19 @@ def _normalize_slice(object arrow_obj, slice key): return arrow_obj.slice(start, stop - start) +cdef class _FunctionContext: + cdef: + unique_ptr[CFunctionContext] ctx + + def __cinit__(self): + self.ctx.reset(new CFunctionContext(c_default_memory_pool())) + +cdef _FunctionContext _global_ctx = _FunctionContext() + +cdef CFunctionContext* _context() nogil: + return _global_ctx.ctx.get() + + cdef class Array: cdef void init(self, const shared_ptr[CArray]& sp_array): @@ -99,6 +112,34 @@ cdef class Array: with nogil: check_status(DebugPrint(deref(self.ap), 0)) + def cast(self, DataType target_type, safe=True): + """ + Cast array values to another data type + + Parameters + ---------- + target_type : DataType + Type to cast to + safe : boolean, default True + Check for overflows or other unsafe conversions + + Returns + ------- + casted : Array + """ + cdef: + CCastOptions options + shared_ptr[CArray] result + + if not safe: + options.allow_int_overflow = 1 + + with nogil: + check_status(Cast(_context(), self.ap[0], target_type.sp_type, + options, &result)) + + return pyarrow_wrap_array(result) + @staticmethod def from_pandas(obj, mask=None, DataType type=None, timestamps_to_ms=False, http://git-wip-us.apache.org/repos/asf/arrow/blob/de2edc8d/python/pyarrow/includes/libarrow.pxd ---------------------------------------------------------------------- diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index cc35684..3b7ddcf 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -706,9 +706,6 @@ cdef extern from "arrow/ipc/api.h" namespace "arrow::ipc" nogil: InputStream* stream, shared_ptr[CRecordBatch]* out) - -cdef extern from "arrow/ipc/feather.h" namespace "arrow::ipc::feather" nogil: - cdef cppclass CFeatherWriter" arrow::ipc::feather::TableWriter": @staticmethod CStatus Open(const shared_ptr[OutputStream]& stream, @@ -737,6 +734,21 @@ cdef extern from "arrow/ipc/feather.h" namespace "arrow::ipc::feather" nogil: c_string GetColumnName(int i) +cdef extern from "arrow/compute/api.h" namespace "arrow::compute" nogil: + + cdef cppclass CFunctionContext" arrow::compute::FunctionContext": + CFunctionContext() + CFunctionContext(CMemoryPool* pool) + + cdef cppclass CCastOptions" arrow::compute::CastOptions": + c_bool allow_int_overflow + + CStatus Cast(CFunctionContext* context, const CArray& array, + const shared_ptr[CDataType]& to_type, + const CCastOptions& options, + shared_ptr[CArray]* out) + + cdef extern from "arrow/python/api.h" namespace "arrow::py" nogil: shared_ptr[CDataType] GetPrimitiveType(Type type) shared_ptr[CDataType] GetTimestampType(TimeUnit unit) http://git-wip-us.apache.org/repos/asf/arrow/blob/de2edc8d/python/pyarrow/tests/test_array.py ---------------------------------------------------------------------- diff --git a/python/pyarrow/tests/test_array.py b/python/pyarrow/tests/test_array.py index 5a373b4..f316417 100644 --- a/python/pyarrow/tests/test_array.py +++ b/python/pyarrow/tests/test_array.py @@ -211,6 +211,73 @@ def test_list_from_arrays(): assert result.equals(expected) +def _check_cast_case(case, safe=True): + in_data, in_type, out_data, out_type = case + + in_arr = pa.Array.from_pandas(in_data, type=in_type) + + casted = in_arr.cast(out_type, safe=safe) + expected = pa.Array.from_pandas(out_data, type=out_type) + assert casted.equals(expected) + + +def test_cast_integers_safe(): + safe_cases = [ + (np.array([0, 1, 2, 3], dtype='i1'), pa.int8(), + np.array([0, 1, 2, 3], dtype='i4'), pa.int32()), + (np.array([0, 1, 2, 3], dtype='i1'), pa.int8(), + np.array([0, 1, 2, 3], dtype='u4'), pa.uint16()), + (np.array([0, 1, 2, 3], dtype='i1'), pa.int8(), + np.array([0, 1, 2, 3], dtype='u1'), pa.uint8()), + (np.array([0, 1, 2, 3], dtype='i1'), pa.int8(), + np.array([0, 1, 2, 3], dtype='f8'), pa.float64()) + ] + + for case in safe_cases: + _check_cast_case(case) + + unsafe_cases = [ + (np.array([50000], dtype='i4'), pa.int32(), pa.int16()), + (np.array([70000], dtype='i4'), pa.int32(), pa.uint16()), + (np.array([-1], dtype='i4'), pa.int32(), pa.uint16()), + (np.array([50000], dtype='u2'), pa.uint16(), pa.int16()) + ] + for in_data, in_type, out_type in unsafe_cases: + in_arr = pa.Array.from_pandas(in_data, type=in_type) + + with pytest.raises(pa.ArrowInvalid): + in_arr.cast(out_type) + + +def test_cast_integers_unsafe(): + # We let NumPy do the unsafe casting + unsafe_cases = [ + (np.array([50000], dtype='i4'), pa.int32(), + np.array([50000], dtype='i2'), pa.int16()), + (np.array([70000], dtype='i4'), pa.int32(), + np.array([70000], dtype='u2'), pa.uint16()), + (np.array([-1], dtype='i4'), pa.int32(), + np.array([-1], dtype='u2'), pa.uint16()), + (np.array([50000], dtype='u2'), pa.uint16(), + np.array([50000], dtype='i2'), pa.int16()) + ] + + for case in unsafe_cases: + _check_cast_case(case, safe=False) + + +def test_cast_signed_to_unsigned(): + safe_cases = [ + (np.array([0, 1, 2, 3], dtype='i1'), pa.uint8(), + np.array([0, 1, 2, 3], dtype='u1'), pa.uint8()), + (np.array([0, 1, 2, 3], dtype='i2'), pa.uint16(), + np.array([0, 1, 2, 3], dtype='u2'), pa.uint16()) + ] + + for case in safe_cases: + _check_cast_case(case) + + def test_simple_type_construction(): result = pa.lib.TimestampType() with pytest.raises(TypeError): http://git-wip-us.apache.org/repos/asf/arrow/blob/de2edc8d/python/pyarrow/tests/test_convert_pandas.py ---------------------------------------------------------------------- diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index 6442434..e98e83d 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -239,6 +239,15 @@ class TestPandasConversion(unittest.TestCase): tm.assert_frame_equal(result, ex_frame) + def test_array_from_pandas_type_cast(self): + arr = np.arange(10, dtype='int64') + + target_type = pa.int8() + + result = pa.Array.from_pandas(arr, type=target_type) + expected = pa.Array.from_pandas(arr.astype('int8')) + assert result.equals(expected) + def test_boolean_no_nulls(self): num_values = 100 @@ -279,6 +288,17 @@ class TestPandasConversion(unittest.TestCase): schema = pa.schema([field]) self._check_pandas_roundtrip(df, expected_schema=schema) + def test_all_nulls_cast_numeric(self): + arr = np.array([None], dtype=object) + + def _check_type(t): + a2 = pa.Array.from_pandas(arr, type=t) + assert a2.type == t + assert a2[0].as_py() is None + + _check_type(pa.int32()) + _check_type(pa.float64()) + def test_unicode(self): repeats = 1000 values = [u'foo', None, u'bar', u'mañana', np.nan]
