[GitHub] [arrow] cyb70289 commented on pull request #7603: ARROW-9206: [C++][Flight] Add latency benchmark
cyb70289 commented on pull request #7603: URL: https://github.com/apache/arrow/pull/7603#issuecomment-652237692 CI failure reproduces [[Python][C++] Non-deterministic segfault in "AMD64 MacOS 10.15 Python 3.7" build](https://issues.apache.org/jira/browse/ARROW-8999) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on a change in pull request #7544: ARROW-7285: [C++] ensure C++ implementation meets clarified dictionary spec
liyafan82 commented on a change in pull request #7544: URL: https://github.com/apache/arrow/pull/7544#discussion_r448166294 ## File path: cpp/src/arrow/ipc/reader.cc ## @@ -684,7 +685,19 @@ Status ReadDictionary(const Buffer& metadata, DictionaryMemo* dictionary_memo, return Status::Invalid("Dictionary record batch must only contain one field"); } auto dictionary = batch->column(0); - return dictionary_memo->AddDictionary(id, dictionary); + if (dictionary_batch->isDelta()) { +std::shared_ptr originalDict, combinedDict; +RETURN_NOT_OK(dictionary_memo->GetDictionary(id, &originalDict)); +ArrayVector dictsToCombine{originalDict, dictionary}; +ARROW_ASSIGN_OR_RAISE(combinedDict, Concatenate(dictsToCombine, options.memory_pool)); Review comment: Good suggestion. I have extracted it into the AddDirectoryDelta method. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on a change in pull request #7544: ARROW-7285: [C++] ensure C++ implementation meets clarified dictionary spec
liyafan82 commented on a change in pull request #7544: URL: https://github.com/apache/arrow/pull/7544#discussion_r448166650 ## File path: cpp/src/arrow/ipc/reader.cc ## @@ -684,7 +685,19 @@ Status ReadDictionary(const Buffer& metadata, DictionaryMemo* dictionary_memo, return Status::Invalid("Dictionary record batch must only contain one field"); } auto dictionary = batch->column(0); - return dictionary_memo->AddDictionary(id, dictionary); + if (dictionary_batch->isDelta()) { +std::shared_ptr originalDict, combinedDict; +RETURN_NOT_OK(dictionary_memo->GetDictionary(id, &originalDict)); +ArrayVector dictsToCombine{originalDict, dictionary}; +ARROW_ASSIGN_OR_RAISE(combinedDict, Concatenate(dictsToCombine, options.memory_pool)); +return dictionary_memo->UpdateDictionary(id, combinedDict); + } + + if (dictionary_memo->HasDictionary(id)) { Review comment: Thanks for the good suggestion. I have added the AddOrReplaceDictionary method. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on a change in pull request #7544: ARROW-7285: [C++] ensure C++ implementation meets clarified dictionary spec
liyafan82 commented on a change in pull request #7544: URL: https://github.com/apache/arrow/pull/7544#discussion_r448167615 ## File path: cpp/src/arrow/ipc/read_write_test.cc ## @@ -1228,6 +1228,152 @@ TEST_P(TestFileFormat, RoundTrip) { TestZeroLengthRoundTrip(*GetParam(), options); } +Status MakeDictionaryBatch(std::shared_ptr* out) { + const int64_t length = 6; + + std::vector is_valid = {true, true, false, true, true, true}; + + auto dict_ty = utf8(); + + auto dict = ArrayFromJSON(dict_ty, "[\"foo\", \"bar\", \"baz\"]"); + + auto f0_type = arrow::dictionary(arrow::int32(), dict_ty); + auto f1_type = arrow::dictionary(arrow::int8(), dict_ty); + + std::shared_ptr indices0, indices1; + std::vector indices0_values = {1, 2, -1, 0, 2, 0}; + std::vector indices1_values = {0, 0, 2, 2, 1, 1}; + + ArrayFromVector(is_valid, indices0_values, &indices0); + ArrayFromVector(is_valid, indices1_values, &indices1); + + auto a0 = std::make_shared(f0_type, indices0, dict); + auto a1 = std::make_shared(f1_type, indices1, dict); + + // construct batch + auto schema = ::arrow::schema({field("dict1", f0_type), field("dict2", f1_type)}); + + *out = RecordBatch::Make(schema, length, {a0, a1}); + return Status::OK(); +} + +// A record batch writer implementation that supports manually specifying dictionaries. +class TestRecordBatchWriter : public RecordBatchWriter { Review comment: You are right. This is not a RecordBatchWriter, as it performs both read and write. I have renamed it to DictionaryBatchHelper, and make it no longer inherit from RecordBatchWriter. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on a change in pull request #7544: ARROW-7285: [C++] ensure C++ implementation meets clarified dictionary spec
liyafan82 commented on a change in pull request #7544: URL: https://github.com/apache/arrow/pull/7544#discussion_r448168519 ## File path: cpp/src/arrow/ipc/metadata_internal.h ## @@ -198,7 +198,7 @@ Status WriteDictionaryMessage( const int64_t id, const int64_t length, const int64_t body_length, const std::shared_ptr& custom_metadata, const std::vector& nodes, const std::vector& buffers, -const IpcWriteOptions& options, std::shared_ptr* out); +const IpcWriteOptions& options, std::shared_ptr* out, bool isDelta); Review comment: Sounds reasonable. I have revised the code to place it after the id argument. (I have also revised other APIs accordingly, to make it consistent) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on a change in pull request #7544: ARROW-7285: [C++] ensure C++ implementation meets clarified dictionary spec
liyafan82 commented on a change in pull request #7544: URL: https://github.com/apache/arrow/pull/7544#discussion_r448168067 ## File path: cpp/src/arrow/ipc/read_write_test.cc ## @@ -1228,6 +1228,152 @@ TEST_P(TestFileFormat, RoundTrip) { TestZeroLengthRoundTrip(*GetParam(), options); } +Status MakeDictionaryBatch(std::shared_ptr* out) { + const int64_t length = 6; + + std::vector is_valid = {true, true, false, true, true, true}; + + auto dict_ty = utf8(); + + auto dict = ArrayFromJSON(dict_ty, "[\"foo\", \"bar\", \"baz\"]"); + + auto f0_type = arrow::dictionary(arrow::int32(), dict_ty); + auto f1_type = arrow::dictionary(arrow::int8(), dict_ty); + + std::shared_ptr indices0, indices1; + std::vector indices0_values = {1, 2, -1, 0, 2, 0}; + std::vector indices1_values = {0, 0, 2, 2, 1, 1}; + + ArrayFromVector(is_valid, indices0_values, &indices0); + ArrayFromVector(is_valid, indices1_values, &indices1); Review comment: Thanks for the good suggestion. I have reivsed the code by ArrayFromJSON, and the code looks simpler now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on a change in pull request #7544: ARROW-7285: [C++] ensure C++ implementation meets clarified dictionary spec
liyafan82 commented on a change in pull request #7544: URL: https://github.com/apache/arrow/pull/7544#discussion_r448168805 ## File path: cpp/src/arrow/ipc/writer.h ## @@ -341,6 +343,29 @@ class ARROW_EXPORT IpcPayloadWriter { virtual Status Close() = 0; }; +/// Create a new IPC payload stream writer from stream sink. User is +/// responsible for closing the actual OutputStream. +/// +/// \param[in] sink output stream to write to +/// \param[in] options options for serialization +/// \return Result> +ARROW_EXPORT +Result> NewPayloadStreamWriter( +io::OutputStream* sink, const IpcWriteOptions& options = IpcWriteOptions::Defaults()); + +/// Create a new IPC payload file writer from stream sink. +/// +/// \param[in] sink output stream to write to +/// \param[in] schema the schema of the record batches to be written +/// \param[in] options options for serialization, optional +/// \param[in] metadata custom metadata for File Footer, optional +/// \return Status +ARROW_EXPORT +Result> NewPayloadFileWriter( Review comment: Revised. Thank you. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on a change in pull request #7544: ARROW-7285: [C++] ensure C++ implementation meets clarified dictionary spec
liyafan82 commented on a change in pull request #7544: URL: https://github.com/apache/arrow/pull/7544#discussion_r448168706 ## File path: cpp/src/arrow/ipc/writer.h ## @@ -341,6 +343,29 @@ class ARROW_EXPORT IpcPayloadWriter { virtual Status Close() = 0; }; +/// Create a new IPC payload stream writer from stream sink. User is +/// responsible for closing the actual OutputStream. +/// +/// \param[in] sink output stream to write to +/// \param[in] options options for serialization +/// \return Result> +ARROW_EXPORT +Result> NewPayloadStreamWriter( Review comment: Revised. Thank you. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] xhochy commented on pull request #7593: ARROW-9160: [C++] Implement contains for exact matches
xhochy commented on pull request #7593: URL: https://github.com/apache/arrow/pull/7593#issuecomment-652244376 > You match a regex, you don't contain it. That is one of the name clashes, we already have a match kernel. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7593: ARROW-9160: [C++] Implement contains for exact matches
pitrou commented on pull request #7593: URL: https://github.com/apache/arrow/pull/7593#issuecomment-652244689 `match_regex` then? :-) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings
jorisvandenbossche commented on pull request #7519: URL: https://github.com/apache/arrow/pull/7519#issuecomment-652245303 > the null equality tests look like a nuisance for regular Python usage (`__eq__` should return a boolean) The reason that those scalars return null on equality check and not True/False, it to ensure consistent behaviour between arrays and scalars (to ensure `(arr1 == arr2)[0]` and `arr1[0] == arr2[0]`gives the same answer). Our element-wise equality propagates null, and so IMO also the scalar should do. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings
pitrou commented on pull request #7519: URL: https://github.com/apache/arrow/pull/7519#issuecomment-652247947 The problem is that it breaks Python semantics in potentially annoying places: ```python >>> import pyarrow as pa >>> na = pa.scalar(None) >>> na in [5] True ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings
jorisvandenbossche commented on pull request #7519: URL: https://github.com/apache/arrow/pull/7519#issuecomment-652249133 We could "fix" that one by raising in `__bool__` (meaning: it will at least give an error instead of silently returning a wrong answer) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] maartenbreddels commented on pull request #7593: ARROW-9160: [C++] Implement contains for exact matches
maartenbreddels commented on pull request #7593: URL: https://github.com/apache/arrow/pull/7593#issuecomment-652254096 I like the prefixing by `string`. I'm a big fan of ordering 'words' in snake or camel casing for good tab completion and alphabetic ordering, so I agree with @wesm 's proposal. As a user, I first think of the type I work with ('string'), next, what I want to do with it (upper/lower casing). Tab-completion/ordering would then reveal 2 variants, very intuitive. To make this concrete: * string_lower_utf8 * string_lower_ascii * binary_contains_exact * string_contains_regex_ascii * string_contains_regex_utf8 (or s/contains/match 😄 ) When working with strings, you need to check There might be kernels that work on binary data, but do not work well with utf8, e.g. a `binary_slice`, which would slice each binary-string on a byte basis. It could cut a multibyte encoded codepoint to result in an invalid utf8 string. I'd say that's acceptable, we could check that depending on the type we get it. Regarding case insensitive variants, I think we should expose functions to do normalization (eg replace the single codepoint ë by the letter e and the diaeresis combining character(the dots above the ë)), case folding, and removal of combining characters. That allows users to remove e.g. the diaeresis from ë to do pattern matching without diacritics. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] maartenbreddels edited a comment on pull request #7593: ARROW-9160: [C++] Implement contains for exact matches
maartenbreddels edited a comment on pull request #7593: URL: https://github.com/apache/arrow/pull/7593#issuecomment-652254096 I like the prefixing by `string`. I'm a big fan of ordering 'words' in snake or camel casing for good tab completion and alphabetic ordering, so I agree with @wesm 's proposal. As a user, I first think of the type I work with ('string'), next, what I want to do with it (upper/lower casing). Tab-completion/ordering would then reveal 2 variants, very intuitive. To make this concrete: * string_lower_utf8 * string_lower_ascii * binary_contains_exact * string_contains_regex_ascii * string_contains_regex_utf8 (or s/contains/match 😄 ) There might be kernels that work on binary data, but do not work well with utf8, e.g. a `binary_slice`, which would slice each binary-string on a byte basis. It could cut a multibyte encoded codepoint to result in an invalid utf8 string. I'd say that's acceptable, we could check that depending on the type we get it. Regarding case insensitive variants, I think we should expose functions to do normalization (eg replace the single codepoint ë by the letter e and the diaeresis combining character(the dots above the ë)), case folding, and removal of combining characters. That allows users to remove e.g. the diaeresis from ë to do pattern matching without diacritics. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings
pitrou commented on pull request #7519: URL: https://github.com/apache/arrow/pull/7519#issuecomment-652256345 Yes, we could. That may have other annoying implications, though (such as `__contains__` not working anymore). I've started a ML discussion. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou closed pull request #7594: ARROW-7654: [Python] Ability to set column_types to a Schema in csv.ConvertOptions is undocumented
pitrou closed pull request #7594: URL: https://github.com/apache/arrow/pull/7594 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on pull request #7604: ARROW-9223: [Python] Propagate Timzone information in pandas conversion
emkornfield commented on pull request #7604: URL: https://github.com/apache/arrow/pull/7604#issuecomment-652259406 CC @wesm @BryanCutler Not sure if doing this correctly will break spark in unintentional ways. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield opened a new pull request #7604: ARROW-9223: [Python] Propagate Timzone information in pandas conversion
emkornfield opened a new pull request #7604: URL: https://github.com/apache/arrow/pull/7604 - Ports string_to_timezone to C++ - Causes nested timestamp columns within structs to use conversion to object path. - Copy timezone on to_object path. Open to other suggestions on how to solve this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7603: ARROW-9206: [C++][Flight] Add latency benchmark
pitrou commented on pull request #7603: URL: https://github.com/apache/arrow/pull/7603#issuecomment-652260499 You can't compute latency like that if the test is multithreaded. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on a change in pull request #7589: ARROW-9276: [Release] Enforce CUDA device for updating the api documentations
kou commented on a change in pull request #7589: URL: https://github.com/apache/arrow/pull/7589#discussion_r448188490 ## File path: dev/release/post-09-docs.sh ## @@ -42,20 +47,20 @@ popd pushd "${ARROW_DIR}" git checkout "${release_tag}" Review comment: FYI: We don't need a CUDA machine if we don't need to run CUDA enabled programs. We're building CUDA enabled deb packages on a no CUDA machine. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7604: ARROW-9223: [Python] Propagate Timzone information in pandas conversion
github-actions[bot] commented on pull request #7604: URL: https://github.com/apache/arrow/pull/7604#issuecomment-652262316 https://issues.apache.org/jira/browse/ARROW-9223 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] maartenbreddels commented on pull request #7593: ARROW-9160: [C++] Implement contains for exact matches
maartenbreddels commented on pull request #7593: URL: https://github.com/apache/arrow/pull/7593#issuecomment-652263794 While at the difficult topic of naming, is there a conversion (agreed or emerging) for naming the functors/ArrayKernelExec implementations? I see in `scalar_string.cc` we use Transform, which is probably redundant (as I understand scalar kernels basically map/transform elementwise). But I think it would be useful to include the type name, e.g. * Utf8ToUtf8 (which is now Utf8Transform) (hence `Utf8Upper : Utf8ToUtf8>`) * Utf8ToBool / Utf8Predicate (which is now StringBoolTransform) * Utf8ToScalar * BinaryToBool This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cyb70289 commented on pull request #7603: ARROW-9206: [C++][Flight] Add latency benchmark
cyb70289 commented on pull request #7603: URL: https://github.com/apache/arrow/pull/7603#issuecomment-652289408 > You can't compute latency like that if the test is multithreaded. Assume two threads, one finishes `n1` IO in `t1` time, another finishes `n2` IO in `t2` time. So the average latency should be `(t1/n1 + t2/n2) / 2`, not `(t1+t2) / (n1+n2)`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rymurr commented on pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory
rymurr commented on pull request #7347: URL: https://github.com/apache/arrow/pull/7347#issuecomment-652291140 Thanks a lot @liyafan82 I have addressed your suggestions and rebased This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++
rymurr commented on a change in pull request #7275: URL: https://github.com/apache/arrow/pull/7275#discussion_r448222516 ## File path: java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java ## @@ -73,6 +87,28 @@ public static void setBit(ArrowBuf validityBuffer, int index) { validityBuffer.setByte(byteIndex, currentByte); } + /** + * Set the bit at provided index to 1. + * + * @param validityBuffer validity buffer of the vector + * @param index index to be set + */ + public static void setBit(ArrowBuf validityBuffer, long index) { +// it can be observed that some logic is duplicate of the logic in setValidityBit. Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] tobim commented on pull request #7315: ARROW-7605: [C++] Bundle jemalloc into static libarrow
tobim commented on pull request #7315: URL: https://github.com/apache/arrow/pull/7315#issuecomment-652292544 > CMake 3.9 is a bit problematic since we've tried to support CMake >= 3.2 and definitely >= 3.7 In that case the other approach should be pursued. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++
rymurr commented on a change in pull request #7275: URL: https://github.com/apache/arrow/pull/7275#discussion_r448224180 ## File path: java/vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java ## @@ -0,0 +1,991 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.vector.complex; + +import static java.util.Collections.singletonList; +import static org.apache.arrow.memory.util.LargeMemoryUtil.capAtMaxInt; +import static org.apache.arrow.memory.util.LargeMemoryUtil.checkedCastToInt; +import static org.apache.arrow.util.Preconditions.checkNotNull; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; + +import org.apache.arrow.memory.ArrowBuf; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.OutOfMemoryException; +import org.apache.arrow.memory.util.ArrowBufPointer; +import org.apache.arrow.memory.util.ByteFunctionHelpers; +import org.apache.arrow.memory.util.CommonUtil; +import org.apache.arrow.memory.util.hash.ArrowBufHasher; +import org.apache.arrow.util.Preconditions; +import org.apache.arrow.vector.AddOrGetResult; +import org.apache.arrow.vector.BaseFixedWidthVector; +import org.apache.arrow.vector.BaseValueVector; +import org.apache.arrow.vector.BaseVariableWidthVector; +import org.apache.arrow.vector.BitVectorHelper; +import org.apache.arrow.vector.BufferBacked; +import org.apache.arrow.vector.DensityAwareVector; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.NullVector; +import org.apache.arrow.vector.UInt4Vector; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.ZeroVector; +import org.apache.arrow.vector.compare.VectorVisitor; +import org.apache.arrow.vector.complex.impl.ComplexCopier; +import org.apache.arrow.vector.complex.impl.UnionLargeListReader; +import org.apache.arrow.vector.complex.impl.UnionLargeListWriter; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.ipc.message.ArrowFieldNode; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.FieldType; +import org.apache.arrow.vector.util.CallBack; +import org.apache.arrow.vector.util.JsonStringArrayList; +import org.apache.arrow.vector.util.OversizedAllocationException; +import org.apache.arrow.vector.util.SchemaChangeRuntimeException; +import org.apache.arrow.vector.util.TransferPair; + +/** + * A list vector contains lists of a specific type of elements. Its structure contains 3 elements. + * + * A validity buffer. + * An offset buffer, that denotes lists boundaries. + * A child data vector that contains the elements of lists. + * + * + * This is the LargeList variant of list, it has a 64-bit wide offset + * + * + * todo review checkedCastToInt usage in this class. + * Once int64 indexed vectors are supported these checks aren't needed. + * + */ +public class LargeListVector extends BaseValueVector implements RepeatedValueVector, BaseListVector, PromotableVector { + + public static LargeListVector empty(String name, BufferAllocator allocator) { +return new LargeListVector(name, allocator, FieldType.nullable(ArrowType.LargeList.INSTANCE), null); + } + + public static final FieldVector DEFAULT_DATA_VECTOR = ZeroVector.INSTANCE; + public static final String DATA_VECTOR_NAME = "$data$"; + + public static final byte OFFSET_WIDTH = 8; + protected ArrowBuf offsetBuffer; + protected FieldVector vector; + protected final CallBack callBack; + protected int valueCount; + protected long offsetAllocationSizeInBytes = INITIAL_VALUE_ALLOCATION * OFFSET_WIDTH; + private final String name; + + protected String defaultDataVectorName = DATA_VECTOR_NAME; + protected ArrowBuf validityBuffer; + protected UnionLargeListReader reader; + private final FieldType fieldType; + private int validityAllocationSizeInBytes; + + /** + * The maximum index that is actually set. + */ + private long lastSet; + + /** + * Constructs a new instance. + * + * @param name The name of
[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++
rymurr commented on a change in pull request #7275: URL: https://github.com/apache/arrow/pull/7275#discussion_r448223616 ## File path: java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java ## @@ -37,6 +37,20 @@ private BitVectorHelper() {} + /** + * Get the index of byte corresponding to bit index in validity buffer. + */ + public static long byteIndex(long absoluteBitIndex) { +return absoluteBitIndex >> 3; + } + + /** + * Get the relative index of bit within the byte in validity buffer. + */ + public static long bitIndex(long absoluteBitIndex) { Review comment: done ## File path: java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java ## @@ -57,12 +71,12 @@ public static int bitIndex(int absoluteBitIndex) { * @param validityBuffer validity buffer of the vector * @param index index to be set */ - public static void setBit(ArrowBuf validityBuffer, int index) { + public static void setBit(ArrowBuf validityBuffer, long index) { // it can be observed that some logic is duplicate of the logic in setValidityBit. // this is because JIT cannot always remove the if branch in setValidityBit, // so we give a dedicated implementation for setting bits. -final int byteIndex = byteIndex(index); -final int bitIndex = bitIndex(index); +final long byteIndex = byteIndex(index); +final long bitIndex = bitIndex(index); Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++
rymurr commented on a change in pull request #7275: URL: https://github.com/apache/arrow/pull/7275#discussion_r448225368 ## File path: java/vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java ## @@ -0,0 +1,991 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.vector.complex; + +import static java.util.Collections.singletonList; +import static org.apache.arrow.memory.util.LargeMemoryUtil.capAtMaxInt; +import static org.apache.arrow.memory.util.LargeMemoryUtil.checkedCastToInt; +import static org.apache.arrow.util.Preconditions.checkNotNull; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; + +import org.apache.arrow.memory.ArrowBuf; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.OutOfMemoryException; +import org.apache.arrow.memory.util.ArrowBufPointer; +import org.apache.arrow.memory.util.ByteFunctionHelpers; +import org.apache.arrow.memory.util.CommonUtil; +import org.apache.arrow.memory.util.hash.ArrowBufHasher; +import org.apache.arrow.util.Preconditions; +import org.apache.arrow.vector.AddOrGetResult; +import org.apache.arrow.vector.BaseFixedWidthVector; +import org.apache.arrow.vector.BaseValueVector; +import org.apache.arrow.vector.BaseVariableWidthVector; +import org.apache.arrow.vector.BitVectorHelper; +import org.apache.arrow.vector.BufferBacked; +import org.apache.arrow.vector.DensityAwareVector; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.NullVector; +import org.apache.arrow.vector.UInt4Vector; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.ZeroVector; +import org.apache.arrow.vector.compare.VectorVisitor; +import org.apache.arrow.vector.complex.impl.ComplexCopier; +import org.apache.arrow.vector.complex.impl.UnionLargeListReader; +import org.apache.arrow.vector.complex.impl.UnionLargeListWriter; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.ipc.message.ArrowFieldNode; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.FieldType; +import org.apache.arrow.vector.util.CallBack; +import org.apache.arrow.vector.util.JsonStringArrayList; +import org.apache.arrow.vector.util.OversizedAllocationException; +import org.apache.arrow.vector.util.SchemaChangeRuntimeException; +import org.apache.arrow.vector.util.TransferPair; + +/** + * A list vector contains lists of a specific type of elements. Its structure contains 3 elements. + * + * A validity buffer. + * An offset buffer, that denotes lists boundaries. + * A child data vector that contains the elements of lists. + * + * + * This is the LargeList variant of list, it has a 64-bit wide offset + * + * + * todo review checkedCastToInt usage in this class. + * Once int64 indexed vectors are supported these checks aren't needed. + * + */ +public class LargeListVector extends BaseValueVector implements RepeatedValueVector, BaseListVector, PromotableVector { + + public static LargeListVector empty(String name, BufferAllocator allocator) { +return new LargeListVector(name, allocator, FieldType.nullable(ArrowType.LargeList.INSTANCE), null); + } + + public static final FieldVector DEFAULT_DATA_VECTOR = ZeroVector.INSTANCE; + public static final String DATA_VECTOR_NAME = "$data$"; + + public static final byte OFFSET_WIDTH = 8; + protected ArrowBuf offsetBuffer; + protected FieldVector vector; + protected final CallBack callBack; + protected int valueCount; + protected long offsetAllocationSizeInBytes = INITIAL_VALUE_ALLOCATION * OFFSET_WIDTH; + private final String name; + + protected String defaultDataVectorName = DATA_VECTOR_NAME; + protected ArrowBuf validityBuffer; + protected UnionLargeListReader reader; + private final FieldType fieldType; + private int validityAllocationSizeInBytes; + + /** + * The maximum index that is actually set. + */ + private long lastSet; + + /** + * Constructs a new instance. + * + * @param name The name of
[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++
rymurr commented on a change in pull request #7275: URL: https://github.com/apache/arrow/pull/7275#discussion_r448225742 ## File path: java/vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java ## @@ -0,0 +1,991 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.vector.complex; + +import static java.util.Collections.singletonList; +import static org.apache.arrow.memory.util.LargeMemoryUtil.capAtMaxInt; +import static org.apache.arrow.memory.util.LargeMemoryUtil.checkedCastToInt; +import static org.apache.arrow.util.Preconditions.checkNotNull; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; + +import org.apache.arrow.memory.ArrowBuf; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.OutOfMemoryException; +import org.apache.arrow.memory.util.ArrowBufPointer; +import org.apache.arrow.memory.util.ByteFunctionHelpers; +import org.apache.arrow.memory.util.CommonUtil; +import org.apache.arrow.memory.util.hash.ArrowBufHasher; +import org.apache.arrow.util.Preconditions; +import org.apache.arrow.vector.AddOrGetResult; +import org.apache.arrow.vector.BaseFixedWidthVector; +import org.apache.arrow.vector.BaseValueVector; +import org.apache.arrow.vector.BaseVariableWidthVector; +import org.apache.arrow.vector.BitVectorHelper; +import org.apache.arrow.vector.BufferBacked; +import org.apache.arrow.vector.DensityAwareVector; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.NullVector; +import org.apache.arrow.vector.UInt4Vector; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.ZeroVector; +import org.apache.arrow.vector.compare.VectorVisitor; +import org.apache.arrow.vector.complex.impl.ComplexCopier; +import org.apache.arrow.vector.complex.impl.UnionLargeListReader; +import org.apache.arrow.vector.complex.impl.UnionLargeListWriter; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.ipc.message.ArrowFieldNode; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.FieldType; +import org.apache.arrow.vector.util.CallBack; +import org.apache.arrow.vector.util.JsonStringArrayList; +import org.apache.arrow.vector.util.OversizedAllocationException; +import org.apache.arrow.vector.util.SchemaChangeRuntimeException; +import org.apache.arrow.vector.util.TransferPair; + +/** + * A list vector contains lists of a specific type of elements. Its structure contains 3 elements. + * + * A validity buffer. + * An offset buffer, that denotes lists boundaries. + * A child data vector that contains the elements of lists. + * + * + * This is the LargeList variant of list, it has a 64-bit wide offset + * + * + * todo review checkedCastToInt usage in this class. + * Once int64 indexed vectors are supported these checks aren't needed. + * + */ +public class LargeListVector extends BaseValueVector implements RepeatedValueVector, BaseListVector, PromotableVector { + + public static LargeListVector empty(String name, BufferAllocator allocator) { +return new LargeListVector(name, allocator, FieldType.nullable(ArrowType.LargeList.INSTANCE), null); + } + + public static final FieldVector DEFAULT_DATA_VECTOR = ZeroVector.INSTANCE; + public static final String DATA_VECTOR_NAME = "$data$"; + + public static final byte OFFSET_WIDTH = 8; + protected ArrowBuf offsetBuffer; + protected FieldVector vector; + protected final CallBack callBack; + protected int valueCount; + protected long offsetAllocationSizeInBytes = INITIAL_VALUE_ALLOCATION * OFFSET_WIDTH; + private final String name; + + protected String defaultDataVectorName = DATA_VECTOR_NAME; + protected ArrowBuf validityBuffer; + protected UnionLargeListReader reader; + private final FieldType fieldType; + private int validityAllocationSizeInBytes; + + /** + * The maximum index that is actually set. + */ + private long lastSet; + + /** + * Constructs a new instance. + * + * @param name The name of
[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++
rymurr commented on a change in pull request #7275: URL: https://github.com/apache/arrow/pull/7275#discussion_r448226183 ## File path: java/vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java ## @@ -0,0 +1,991 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.vector.complex; + +import static java.util.Collections.singletonList; +import static org.apache.arrow.memory.util.LargeMemoryUtil.capAtMaxInt; +import static org.apache.arrow.memory.util.LargeMemoryUtil.checkedCastToInt; +import static org.apache.arrow.util.Preconditions.checkNotNull; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; + +import org.apache.arrow.memory.ArrowBuf; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.OutOfMemoryException; +import org.apache.arrow.memory.util.ArrowBufPointer; +import org.apache.arrow.memory.util.ByteFunctionHelpers; +import org.apache.arrow.memory.util.CommonUtil; +import org.apache.arrow.memory.util.hash.ArrowBufHasher; +import org.apache.arrow.util.Preconditions; +import org.apache.arrow.vector.AddOrGetResult; +import org.apache.arrow.vector.BaseFixedWidthVector; +import org.apache.arrow.vector.BaseValueVector; +import org.apache.arrow.vector.BaseVariableWidthVector; +import org.apache.arrow.vector.BitVectorHelper; +import org.apache.arrow.vector.BufferBacked; +import org.apache.arrow.vector.DensityAwareVector; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.NullVector; +import org.apache.arrow.vector.UInt4Vector; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.ZeroVector; +import org.apache.arrow.vector.compare.VectorVisitor; +import org.apache.arrow.vector.complex.impl.ComplexCopier; +import org.apache.arrow.vector.complex.impl.UnionLargeListReader; +import org.apache.arrow.vector.complex.impl.UnionLargeListWriter; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.ipc.message.ArrowFieldNode; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.FieldType; +import org.apache.arrow.vector.util.CallBack; +import org.apache.arrow.vector.util.JsonStringArrayList; +import org.apache.arrow.vector.util.OversizedAllocationException; +import org.apache.arrow.vector.util.SchemaChangeRuntimeException; +import org.apache.arrow.vector.util.TransferPair; + +/** + * A list vector contains lists of a specific type of elements. Its structure contains 3 elements. + * + * A validity buffer. + * An offset buffer, that denotes lists boundaries. + * A child data vector that contains the elements of lists. + * + * + * This is the LargeList variant of list, it has a 64-bit wide offset + * + * + * todo review checkedCastToInt usage in this class. + * Once int64 indexed vectors are supported these checks aren't needed. + * + */ +public class LargeListVector extends BaseValueVector implements RepeatedValueVector, BaseListVector, PromotableVector { + + public static LargeListVector empty(String name, BufferAllocator allocator) { +return new LargeListVector(name, allocator, FieldType.nullable(ArrowType.LargeList.INSTANCE), null); + } + + public static final FieldVector DEFAULT_DATA_VECTOR = ZeroVector.INSTANCE; + public static final String DATA_VECTOR_NAME = "$data$"; + + public static final byte OFFSET_WIDTH = 8; + protected ArrowBuf offsetBuffer; + protected FieldVector vector; + protected final CallBack callBack; + protected int valueCount; + protected long offsetAllocationSizeInBytes = INITIAL_VALUE_ALLOCATION * OFFSET_WIDTH; + private final String name; + + protected String defaultDataVectorName = DATA_VECTOR_NAME; + protected ArrowBuf validityBuffer; + protected UnionLargeListReader reader; + private final FieldType fieldType; + private int validityAllocationSizeInBytes; + + /** + * The maximum index that is actually set. + */ + private long lastSet; + + /** + * Constructs a new instance. + * + * @param name The name of
[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++
rymurr commented on a change in pull request #7275: URL: https://github.com/apache/arrow/pull/7275#discussion_r448226339 ## File path: java/vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java ## @@ -0,0 +1,991 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.vector.complex; + +import static java.util.Collections.singletonList; +import static org.apache.arrow.memory.util.LargeMemoryUtil.capAtMaxInt; +import static org.apache.arrow.memory.util.LargeMemoryUtil.checkedCastToInt; +import static org.apache.arrow.util.Preconditions.checkNotNull; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; + +import org.apache.arrow.memory.ArrowBuf; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.OutOfMemoryException; +import org.apache.arrow.memory.util.ArrowBufPointer; +import org.apache.arrow.memory.util.ByteFunctionHelpers; +import org.apache.arrow.memory.util.CommonUtil; +import org.apache.arrow.memory.util.hash.ArrowBufHasher; +import org.apache.arrow.util.Preconditions; +import org.apache.arrow.vector.AddOrGetResult; +import org.apache.arrow.vector.BaseFixedWidthVector; +import org.apache.arrow.vector.BaseValueVector; +import org.apache.arrow.vector.BaseVariableWidthVector; +import org.apache.arrow.vector.BitVectorHelper; +import org.apache.arrow.vector.BufferBacked; +import org.apache.arrow.vector.DensityAwareVector; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.NullVector; +import org.apache.arrow.vector.UInt4Vector; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.ZeroVector; +import org.apache.arrow.vector.compare.VectorVisitor; +import org.apache.arrow.vector.complex.impl.ComplexCopier; +import org.apache.arrow.vector.complex.impl.UnionLargeListReader; +import org.apache.arrow.vector.complex.impl.UnionLargeListWriter; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.ipc.message.ArrowFieldNode; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.FieldType; +import org.apache.arrow.vector.util.CallBack; +import org.apache.arrow.vector.util.JsonStringArrayList; +import org.apache.arrow.vector.util.OversizedAllocationException; +import org.apache.arrow.vector.util.SchemaChangeRuntimeException; +import org.apache.arrow.vector.util.TransferPair; + +/** + * A list vector contains lists of a specific type of elements. Its structure contains 3 elements. + * + * A validity buffer. + * An offset buffer, that denotes lists boundaries. + * A child data vector that contains the elements of lists. + * + * + * This is the LargeList variant of list, it has a 64-bit wide offset + * + * + * todo review checkedCastToInt usage in this class. + * Once int64 indexed vectors are supported these checks aren't needed. + * + */ +public class LargeListVector extends BaseValueVector implements RepeatedValueVector, BaseListVector, PromotableVector { + + public static LargeListVector empty(String name, BufferAllocator allocator) { +return new LargeListVector(name, allocator, FieldType.nullable(ArrowType.LargeList.INSTANCE), null); + } + + public static final FieldVector DEFAULT_DATA_VECTOR = ZeroVector.INSTANCE; + public static final String DATA_VECTOR_NAME = "$data$"; + + public static final byte OFFSET_WIDTH = 8; + protected ArrowBuf offsetBuffer; + protected FieldVector vector; + protected final CallBack callBack; + protected int valueCount; + protected long offsetAllocationSizeInBytes = INITIAL_VALUE_ALLOCATION * OFFSET_WIDTH; + private final String name; + + protected String defaultDataVectorName = DATA_VECTOR_NAME; + protected ArrowBuf validityBuffer; + protected UnionLargeListReader reader; + private final FieldType fieldType; + private int validityAllocationSizeInBytes; + + /** + * The maximum index that is actually set. + */ + private long lastSet; + + /** + * Constructs a new instance. + * + * @param name The name of
[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++
rymurr commented on a change in pull request #7275: URL: https://github.com/apache/arrow/pull/7275#discussion_r448226970 ## File path: java/vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java ## @@ -0,0 +1,991 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.vector.complex; + +import static java.util.Collections.singletonList; +import static org.apache.arrow.memory.util.LargeMemoryUtil.capAtMaxInt; +import static org.apache.arrow.memory.util.LargeMemoryUtil.checkedCastToInt; +import static org.apache.arrow.util.Preconditions.checkNotNull; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; + +import org.apache.arrow.memory.ArrowBuf; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.OutOfMemoryException; +import org.apache.arrow.memory.util.ArrowBufPointer; +import org.apache.arrow.memory.util.ByteFunctionHelpers; +import org.apache.arrow.memory.util.CommonUtil; +import org.apache.arrow.memory.util.hash.ArrowBufHasher; +import org.apache.arrow.util.Preconditions; +import org.apache.arrow.vector.AddOrGetResult; +import org.apache.arrow.vector.BaseFixedWidthVector; +import org.apache.arrow.vector.BaseValueVector; +import org.apache.arrow.vector.BaseVariableWidthVector; +import org.apache.arrow.vector.BitVectorHelper; +import org.apache.arrow.vector.BufferBacked; +import org.apache.arrow.vector.DensityAwareVector; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.NullVector; +import org.apache.arrow.vector.UInt4Vector; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.ZeroVector; +import org.apache.arrow.vector.compare.VectorVisitor; +import org.apache.arrow.vector.complex.impl.ComplexCopier; +import org.apache.arrow.vector.complex.impl.UnionLargeListReader; +import org.apache.arrow.vector.complex.impl.UnionLargeListWriter; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.ipc.message.ArrowFieldNode; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.FieldType; +import org.apache.arrow.vector.util.CallBack; +import org.apache.arrow.vector.util.JsonStringArrayList; +import org.apache.arrow.vector.util.OversizedAllocationException; +import org.apache.arrow.vector.util.SchemaChangeRuntimeException; +import org.apache.arrow.vector.util.TransferPair; + +/** + * A list vector contains lists of a specific type of elements. Its structure contains 3 elements. + * + * A validity buffer. + * An offset buffer, that denotes lists boundaries. + * A child data vector that contains the elements of lists. + * + * + * This is the LargeList variant of list, it has a 64-bit wide offset + * + * + * todo review checkedCastToInt usage in this class. + * Once int64 indexed vectors are supported these checks aren't needed. + * + */ +public class LargeListVector extends BaseValueVector implements RepeatedValueVector, BaseListVector, PromotableVector { + + public static LargeListVector empty(String name, BufferAllocator allocator) { +return new LargeListVector(name, allocator, FieldType.nullable(ArrowType.LargeList.INSTANCE), null); + } + + public static final FieldVector DEFAULT_DATA_VECTOR = ZeroVector.INSTANCE; + public static final String DATA_VECTOR_NAME = "$data$"; + + public static final byte OFFSET_WIDTH = 8; + protected ArrowBuf offsetBuffer; + protected FieldVector vector; + protected final CallBack callBack; + protected int valueCount; + protected long offsetAllocationSizeInBytes = INITIAL_VALUE_ALLOCATION * OFFSET_WIDTH; + private final String name; + + protected String defaultDataVectorName = DATA_VECTOR_NAME; + protected ArrowBuf validityBuffer; + protected UnionLargeListReader reader; + private final FieldType fieldType; + private int validityAllocationSizeInBytes; + + /** + * The maximum index that is actually set. + */ + private long lastSet; + + /** + * Constructs a new instance. + * + * @param name The name of
[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++
rymurr commented on a change in pull request #7275: URL: https://github.com/apache/arrow/pull/7275#discussion_r448226505 ## File path: java/vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java ## @@ -0,0 +1,991 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.vector.complex; + +import static java.util.Collections.singletonList; +import static org.apache.arrow.memory.util.LargeMemoryUtil.capAtMaxInt; +import static org.apache.arrow.memory.util.LargeMemoryUtil.checkedCastToInt; +import static org.apache.arrow.util.Preconditions.checkNotNull; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; + +import org.apache.arrow.memory.ArrowBuf; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.OutOfMemoryException; +import org.apache.arrow.memory.util.ArrowBufPointer; +import org.apache.arrow.memory.util.ByteFunctionHelpers; +import org.apache.arrow.memory.util.CommonUtil; +import org.apache.arrow.memory.util.hash.ArrowBufHasher; +import org.apache.arrow.util.Preconditions; +import org.apache.arrow.vector.AddOrGetResult; +import org.apache.arrow.vector.BaseFixedWidthVector; +import org.apache.arrow.vector.BaseValueVector; +import org.apache.arrow.vector.BaseVariableWidthVector; +import org.apache.arrow.vector.BitVectorHelper; +import org.apache.arrow.vector.BufferBacked; +import org.apache.arrow.vector.DensityAwareVector; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.NullVector; +import org.apache.arrow.vector.UInt4Vector; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.ZeroVector; +import org.apache.arrow.vector.compare.VectorVisitor; +import org.apache.arrow.vector.complex.impl.ComplexCopier; +import org.apache.arrow.vector.complex.impl.UnionLargeListReader; +import org.apache.arrow.vector.complex.impl.UnionLargeListWriter; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.ipc.message.ArrowFieldNode; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.FieldType; +import org.apache.arrow.vector.util.CallBack; +import org.apache.arrow.vector.util.JsonStringArrayList; +import org.apache.arrow.vector.util.OversizedAllocationException; +import org.apache.arrow.vector.util.SchemaChangeRuntimeException; +import org.apache.arrow.vector.util.TransferPair; + +/** + * A list vector contains lists of a specific type of elements. Its structure contains 3 elements. + * + * A validity buffer. + * An offset buffer, that denotes lists boundaries. + * A child data vector that contains the elements of lists. + * + * + * This is the LargeList variant of list, it has a 64-bit wide offset + * + * + * todo review checkedCastToInt usage in this class. + * Once int64 indexed vectors are supported these checks aren't needed. + * + */ +public class LargeListVector extends BaseValueVector implements RepeatedValueVector, BaseListVector, PromotableVector { + + public static LargeListVector empty(String name, BufferAllocator allocator) { +return new LargeListVector(name, allocator, FieldType.nullable(ArrowType.LargeList.INSTANCE), null); + } + + public static final FieldVector DEFAULT_DATA_VECTOR = ZeroVector.INSTANCE; + public static final String DATA_VECTOR_NAME = "$data$"; + + public static final byte OFFSET_WIDTH = 8; + protected ArrowBuf offsetBuffer; + protected FieldVector vector; + protected final CallBack callBack; + protected int valueCount; + protected long offsetAllocationSizeInBytes = INITIAL_VALUE_ALLOCATION * OFFSET_WIDTH; + private final String name; + + protected String defaultDataVectorName = DATA_VECTOR_NAME; + protected ArrowBuf validityBuffer; + protected UnionLargeListReader reader; + private final FieldType fieldType; + private int validityAllocationSizeInBytes; + + /** + * The maximum index that is actually set. + */ + private long lastSet; + + /** + * Constructs a new instance. + * + * @param name The name of
[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++
rymurr commented on a change in pull request #7275: URL: https://github.com/apache/arrow/pull/7275#discussion_r448227452 ## File path: java/vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java ## @@ -0,0 +1,991 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.vector.complex; + +import static java.util.Collections.singletonList; +import static org.apache.arrow.memory.util.LargeMemoryUtil.capAtMaxInt; +import static org.apache.arrow.memory.util.LargeMemoryUtil.checkedCastToInt; +import static org.apache.arrow.util.Preconditions.checkNotNull; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; + +import org.apache.arrow.memory.ArrowBuf; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.OutOfMemoryException; +import org.apache.arrow.memory.util.ArrowBufPointer; +import org.apache.arrow.memory.util.ByteFunctionHelpers; +import org.apache.arrow.memory.util.CommonUtil; +import org.apache.arrow.memory.util.hash.ArrowBufHasher; +import org.apache.arrow.util.Preconditions; +import org.apache.arrow.vector.AddOrGetResult; +import org.apache.arrow.vector.BaseFixedWidthVector; +import org.apache.arrow.vector.BaseValueVector; +import org.apache.arrow.vector.BaseVariableWidthVector; +import org.apache.arrow.vector.BitVectorHelper; +import org.apache.arrow.vector.BufferBacked; +import org.apache.arrow.vector.DensityAwareVector; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.NullVector; +import org.apache.arrow.vector.UInt4Vector; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.ZeroVector; +import org.apache.arrow.vector.compare.VectorVisitor; +import org.apache.arrow.vector.complex.impl.ComplexCopier; +import org.apache.arrow.vector.complex.impl.UnionLargeListReader; +import org.apache.arrow.vector.complex.impl.UnionLargeListWriter; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.ipc.message.ArrowFieldNode; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.FieldType; +import org.apache.arrow.vector.util.CallBack; +import org.apache.arrow.vector.util.JsonStringArrayList; +import org.apache.arrow.vector.util.OversizedAllocationException; +import org.apache.arrow.vector.util.SchemaChangeRuntimeException; +import org.apache.arrow.vector.util.TransferPair; + +/** + * A list vector contains lists of a specific type of elements. Its structure contains 3 elements. + * + * A validity buffer. + * An offset buffer, that denotes lists boundaries. + * A child data vector that contains the elements of lists. + * + * + * This is the LargeList variant of list, it has a 64-bit wide offset + * + * + * todo review checkedCastToInt usage in this class. + * Once int64 indexed vectors are supported these checks aren't needed. + * + */ +public class LargeListVector extends BaseValueVector implements RepeatedValueVector, BaseListVector, PromotableVector { + + public static LargeListVector empty(String name, BufferAllocator allocator) { +return new LargeListVector(name, allocator, FieldType.nullable(ArrowType.LargeList.INSTANCE), null); + } + + public static final FieldVector DEFAULT_DATA_VECTOR = ZeroVector.INSTANCE; + public static final String DATA_VECTOR_NAME = "$data$"; + + public static final byte OFFSET_WIDTH = 8; + protected ArrowBuf offsetBuffer; + protected FieldVector vector; + protected final CallBack callBack; + protected int valueCount; + protected long offsetAllocationSizeInBytes = INITIAL_VALUE_ALLOCATION * OFFSET_WIDTH; + private final String name; + + protected String defaultDataVectorName = DATA_VECTOR_NAME; + protected ArrowBuf validityBuffer; + protected UnionLargeListReader reader; + private final FieldType fieldType; + private int validityAllocationSizeInBytes; + + /** + * The maximum index that is actually set. + */ + private long lastSet; + + /** + * Constructs a new instance. + * + * @param name The name of
[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++
rymurr commented on a change in pull request #7275: URL: https://github.com/apache/arrow/pull/7275#discussion_r448227944 ## File path: java/vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java ## @@ -0,0 +1,991 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.vector.complex; + +import static java.util.Collections.singletonList; +import static org.apache.arrow.memory.util.LargeMemoryUtil.capAtMaxInt; +import static org.apache.arrow.memory.util.LargeMemoryUtil.checkedCastToInt; +import static org.apache.arrow.util.Preconditions.checkNotNull; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; + +import org.apache.arrow.memory.ArrowBuf; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.OutOfMemoryException; +import org.apache.arrow.memory.util.ArrowBufPointer; +import org.apache.arrow.memory.util.ByteFunctionHelpers; +import org.apache.arrow.memory.util.CommonUtil; +import org.apache.arrow.memory.util.hash.ArrowBufHasher; +import org.apache.arrow.util.Preconditions; +import org.apache.arrow.vector.AddOrGetResult; +import org.apache.arrow.vector.BaseFixedWidthVector; +import org.apache.arrow.vector.BaseValueVector; +import org.apache.arrow.vector.BaseVariableWidthVector; +import org.apache.arrow.vector.BitVectorHelper; +import org.apache.arrow.vector.BufferBacked; +import org.apache.arrow.vector.DensityAwareVector; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.NullVector; +import org.apache.arrow.vector.UInt4Vector; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.ZeroVector; +import org.apache.arrow.vector.compare.VectorVisitor; +import org.apache.arrow.vector.complex.impl.ComplexCopier; +import org.apache.arrow.vector.complex.impl.UnionLargeListReader; +import org.apache.arrow.vector.complex.impl.UnionLargeListWriter; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.ipc.message.ArrowFieldNode; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.FieldType; +import org.apache.arrow.vector.util.CallBack; +import org.apache.arrow.vector.util.JsonStringArrayList; +import org.apache.arrow.vector.util.OversizedAllocationException; +import org.apache.arrow.vector.util.SchemaChangeRuntimeException; +import org.apache.arrow.vector.util.TransferPair; + +/** + * A list vector contains lists of a specific type of elements. Its structure contains 3 elements. + * + * A validity buffer. + * An offset buffer, that denotes lists boundaries. + * A child data vector that contains the elements of lists. + * + * + * This is the LargeList variant of list, it has a 64-bit wide offset + * + * + * todo review checkedCastToInt usage in this class. + * Once int64 indexed vectors are supported these checks aren't needed. + * + */ +public class LargeListVector extends BaseValueVector implements RepeatedValueVector, BaseListVector, PromotableVector { + + public static LargeListVector empty(String name, BufferAllocator allocator) { +return new LargeListVector(name, allocator, FieldType.nullable(ArrowType.LargeList.INSTANCE), null); + } + + public static final FieldVector DEFAULT_DATA_VECTOR = ZeroVector.INSTANCE; + public static final String DATA_VECTOR_NAME = "$data$"; + + public static final byte OFFSET_WIDTH = 8; + protected ArrowBuf offsetBuffer; + protected FieldVector vector; + protected final CallBack callBack; + protected int valueCount; + protected long offsetAllocationSizeInBytes = INITIAL_VALUE_ALLOCATION * OFFSET_WIDTH; + private final String name; + + protected String defaultDataVectorName = DATA_VECTOR_NAME; + protected ArrowBuf validityBuffer; + protected UnionLargeListReader reader; + private final FieldType fieldType; + private int validityAllocationSizeInBytes; + + /** + * The maximum index that is actually set. + */ + private long lastSet; + + /** + * Constructs a new instance. + * + * @param name The name of
[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++
rymurr commented on a change in pull request #7275: URL: https://github.com/apache/arrow/pull/7275#discussion_r448227771 ## File path: java/vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java ## @@ -0,0 +1,991 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.vector.complex; + +import static java.util.Collections.singletonList; +import static org.apache.arrow.memory.util.LargeMemoryUtil.capAtMaxInt; +import static org.apache.arrow.memory.util.LargeMemoryUtil.checkedCastToInt; +import static org.apache.arrow.util.Preconditions.checkNotNull; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; + +import org.apache.arrow.memory.ArrowBuf; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.OutOfMemoryException; +import org.apache.arrow.memory.util.ArrowBufPointer; +import org.apache.arrow.memory.util.ByteFunctionHelpers; +import org.apache.arrow.memory.util.CommonUtil; +import org.apache.arrow.memory.util.hash.ArrowBufHasher; +import org.apache.arrow.util.Preconditions; +import org.apache.arrow.vector.AddOrGetResult; +import org.apache.arrow.vector.BaseFixedWidthVector; +import org.apache.arrow.vector.BaseValueVector; +import org.apache.arrow.vector.BaseVariableWidthVector; +import org.apache.arrow.vector.BitVectorHelper; +import org.apache.arrow.vector.BufferBacked; +import org.apache.arrow.vector.DensityAwareVector; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.NullVector; +import org.apache.arrow.vector.UInt4Vector; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.ZeroVector; +import org.apache.arrow.vector.compare.VectorVisitor; +import org.apache.arrow.vector.complex.impl.ComplexCopier; +import org.apache.arrow.vector.complex.impl.UnionLargeListReader; +import org.apache.arrow.vector.complex.impl.UnionLargeListWriter; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.ipc.message.ArrowFieldNode; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.FieldType; +import org.apache.arrow.vector.util.CallBack; +import org.apache.arrow.vector.util.JsonStringArrayList; +import org.apache.arrow.vector.util.OversizedAllocationException; +import org.apache.arrow.vector.util.SchemaChangeRuntimeException; +import org.apache.arrow.vector.util.TransferPair; + +/** + * A list vector contains lists of a specific type of elements. Its structure contains 3 elements. + * + * A validity buffer. + * An offset buffer, that denotes lists boundaries. + * A child data vector that contains the elements of lists. + * + * + * This is the LargeList variant of list, it has a 64-bit wide offset + * + * + * todo review checkedCastToInt usage in this class. + * Once int64 indexed vectors are supported these checks aren't needed. + * + */ +public class LargeListVector extends BaseValueVector implements RepeatedValueVector, BaseListVector, PromotableVector { + + public static LargeListVector empty(String name, BufferAllocator allocator) { +return new LargeListVector(name, allocator, FieldType.nullable(ArrowType.LargeList.INSTANCE), null); + } + + public static final FieldVector DEFAULT_DATA_VECTOR = ZeroVector.INSTANCE; + public static final String DATA_VECTOR_NAME = "$data$"; + + public static final byte OFFSET_WIDTH = 8; + protected ArrowBuf offsetBuffer; + protected FieldVector vector; + protected final CallBack callBack; + protected int valueCount; + protected long offsetAllocationSizeInBytes = INITIAL_VALUE_ALLOCATION * OFFSET_WIDTH; + private final String name; + + protected String defaultDataVectorName = DATA_VECTOR_NAME; + protected ArrowBuf validityBuffer; + protected UnionLargeListReader reader; + private final FieldType fieldType; + private int validityAllocationSizeInBytes; + + /** + * The maximum index that is actually set. + */ + private long lastSet; + + /** + * Constructs a new instance. + * + * @param name The name of
[GitHub] [arrow] mrkn commented on pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module
mrkn commented on pull request #7539: URL: https://github.com/apache/arrow/pull/7539#issuecomment-652298958 I wrote a benchmark code that measures the performance of conversion from Tensor to SparseTensor. And I run this code with `--repetitions=10` and got the following result. - Converting to SparseCOOTensor is 1.4x-1.8x slower than the original. - Converting to SparseCSRMatrix and SparseCSCMatrix is 1.0x-1.2x faster than the original. - Converting to SparseCSFTensor is 1.0-1.3x slower than the original. I don't think this result, especially SparseCOOTensor's case can be acceptable. Now I'm trying to resolve this performance regression. The full result is shown below: Format IndexType ValueType Change % Baseline Contender COO Int32Type Int8 56.426 2459.397090 3847.138775 COO Int16Type Int8 43.062 2619.848028 3748.014603 COO Int64Type Int8 40.137 2850.826136 3995.057821 COO Int8Type Int8 35.478 2776.716694 3761.850287 COO Int16Type Int16 66.812 2394.011928 3993.489611 COO Int32Type Int16 59.281 2728.577337 4346.096797 COO Int64Type Int16 52.043 2596.404005 3947.663313 COO Int8Type Int16 44.510 2768.567992 4000.853968 COO Int32Type Float 93.283 2287.343291 4421.039474 COO Int16Type Float 71.100 2577.689569 4410.437858 COO Int64Type Float 61.035 2821.300996 4543.277655 COO Int8Type Float 53.303 2666.801559 4088.291362 COO Int64Type Double 88.230 2616.000738 4924.096764 COO Int16Type Double 70.188 2698.335495 4592.253023 COO Int32Type Double 69.368 2579.753902 4369.267270 COO Int8Type Double 62.781 2747.179806 4471.895533 CSR Int8Type Int8 -1.660 4626.181791 4549.389657 CSR Int32Type Int8 -4.718 4708.044556 4485.938002 CSR Int64Type Int8 -4.905 4687.961132 4458.024004 CSR Int16Type Int8 -7.164 4851.803479 4504.201161 CSR Int32Type Int16 -1.481 4722.676795 4652.711030 CSR Int64Type Int16 -5.853 4573.124314 4305.462839 CSR Int8Type Int16 -11.890 4631.290389 4080.628990 CSR Int16Type Int16 -12.372 4737.320600 4151.197839 CSR Int64Type Float -4.999 4711.575565 4476.032868 CSR Int16Type Float -5.062 4811.457037 4567.909440 CSR Int32Type Float -7.651 4814.813084 4446.432341 CSR Int8Type Float -12.196 5158.868330 4529.676101 CSR Int32Type Double 0.384 4631.005506 4648.789489 CSR Int64Type Double -0.273 4816.669765 4803.522209 CSR Int8Type Double -3.453 4646.652774 4486.217718 CSR Int16Type Double -8.963 4952.032110 4508.161865 CSC Int8Type Int8 -4.207 4871.793338 4666.848429 CSC Int64Type Int8 -10.582 4571.251613 4087.509465 CSC Int16Type Int8 -13.189 4865.666295 4223.934008 CSC Int32Type Int8 -16.546 4999.577506 4172.360711 CSC Int32Type Int16 -0.619 4842.515715 4812.531287 CSC Int8Type Int16 -4.128 4689.737235 4496.135413 CSC Int16Type Int16 -8.110 4643.413441 4266.836346 CSC Int64Type Int16 -10.302 4971.879449 4459.696887 CSC Int8Type Float -0.547 4694.347970 4668.667817 CSC Int32Type Float -0.619 4718.207593 4688.994670 CSC Int64Type Float -3.360 4847.753893 4684.849545 CSC Int16Type Float -7.408 4956.795348 4589.607722 CSC Int32Type Double 0.491 4917.172760 4941.321519 CSC Int64Type Double -0.754 5069.589290 5031.376333 CSC Int16Type Double -2.023 4762.425910 4666.071367 CSC Int8Type Double -5.987 4934.016268 4638.636384 CSF Int16Type Int8 25.394 7885.986312 9888.571309 CSF Int64Type Int8 22.605 9083.444559 11136.716151 CSF Int32Type Int8 13.486 8228.377064 9338.031561 CSF Int8Type Int8 11.900 7687.347967 8602.168534 CSF Int64Type Int16 28.094 8247.255772 10564.280045 CSF Int8Type Int16 20.276 8172.411486 9829.454135 CSF Int32Type In
[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++
rymurr commented on a change in pull request #7275: URL: https://github.com/apache/arrow/pull/7275#discussion_r448232201 ## File path: java/vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java ## @@ -0,0 +1,991 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.vector.complex; + +import static java.util.Collections.singletonList; +import static org.apache.arrow.memory.util.LargeMemoryUtil.capAtMaxInt; +import static org.apache.arrow.memory.util.LargeMemoryUtil.checkedCastToInt; +import static org.apache.arrow.util.Preconditions.checkNotNull; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; + +import org.apache.arrow.memory.ArrowBuf; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.OutOfMemoryException; +import org.apache.arrow.memory.util.ArrowBufPointer; +import org.apache.arrow.memory.util.ByteFunctionHelpers; +import org.apache.arrow.memory.util.CommonUtil; +import org.apache.arrow.memory.util.hash.ArrowBufHasher; +import org.apache.arrow.util.Preconditions; +import org.apache.arrow.vector.AddOrGetResult; +import org.apache.arrow.vector.BaseFixedWidthVector; +import org.apache.arrow.vector.BaseValueVector; +import org.apache.arrow.vector.BaseVariableWidthVector; +import org.apache.arrow.vector.BitVectorHelper; +import org.apache.arrow.vector.BufferBacked; +import org.apache.arrow.vector.DensityAwareVector; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.NullVector; +import org.apache.arrow.vector.UInt4Vector; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.ZeroVector; +import org.apache.arrow.vector.compare.VectorVisitor; +import org.apache.arrow.vector.complex.impl.ComplexCopier; +import org.apache.arrow.vector.complex.impl.UnionLargeListReader; +import org.apache.arrow.vector.complex.impl.UnionLargeListWriter; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.ipc.message.ArrowFieldNode; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.FieldType; +import org.apache.arrow.vector.util.CallBack; +import org.apache.arrow.vector.util.JsonStringArrayList; +import org.apache.arrow.vector.util.OversizedAllocationException; +import org.apache.arrow.vector.util.SchemaChangeRuntimeException; +import org.apache.arrow.vector.util.TransferPair; + +/** + * A list vector contains lists of a specific type of elements. Its structure contains 3 elements. + * + * A validity buffer. + * An offset buffer, that denotes lists boundaries. + * A child data vector that contains the elements of lists. + * + * + * This is the LargeList variant of list, it has a 64-bit wide offset + * + * + * todo review checkedCastToInt usage in this class. + * Once int64 indexed vectors are supported these checks aren't needed. + * + */ +public class LargeListVector extends BaseValueVector implements RepeatedValueVector, BaseListVector, PromotableVector { + + public static LargeListVector empty(String name, BufferAllocator allocator) { +return new LargeListVector(name, allocator, FieldType.nullable(ArrowType.LargeList.INSTANCE), null); + } + + public static final FieldVector DEFAULT_DATA_VECTOR = ZeroVector.INSTANCE; + public static final String DATA_VECTOR_NAME = "$data$"; + + public static final byte OFFSET_WIDTH = 8; + protected ArrowBuf offsetBuffer; + protected FieldVector vector; + protected final CallBack callBack; + protected int valueCount; + protected long offsetAllocationSizeInBytes = INITIAL_VALUE_ALLOCATION * OFFSET_WIDTH; + private final String name; + + protected String defaultDataVectorName = DATA_VECTOR_NAME; + protected ArrowBuf validityBuffer; + protected UnionLargeListReader reader; + private final FieldType fieldType; + private int validityAllocationSizeInBytes; + + /** + * The maximum index that is actually set. + */ + private long lastSet; + + /** + * Constructs a new instance. + * + * @param name The name of
[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++
rymurr commented on a change in pull request #7275: URL: https://github.com/apache/arrow/pull/7275#discussion_r448233329 ## File path: java/vector/src/main/java/org/apache/arrow/vector/ipc/JsonFileReader.java ## @@ -564,8 +564,10 @@ private ArrowBuf readIntoBuffer(BufferAllocator allocator, BufferType bufferType } else if (bufferType.equals(OFFSET)) { if (type == Types.MinorType.LARGEVARCHAR || type == Types.MinorType.LARGEVARBINARY) { Review comment: thanks, done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cyb70289 edited a comment on pull request #7603: ARROW-9206: [C++][Flight] Add latency benchmark
cyb70289 edited a comment on pull request #7603: URL: https://github.com/apache/arrow/pull/7603#issuecomment-652289408 > You can't compute latency like that if the test is multithreaded. Assume two threads, one finishes `n1` IO in `t1` time, another finishes `n2` IO in `t2` time. So the average latency should be `(t1/n1 + t2/n2) / 2`, not `(t1+t2) / (n1+n2)`? Or something else? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7603: ARROW-9206: [C++][Flight] Add latency benchmark
pitrou commented on pull request #7603: URL: https://github.com/apache/arrow/pull/7603#issuecomment-652307944 IMHO it should be `(t1+t2) / (n1+n2)`. Is it what the code is doing? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] cyb70289 commented on pull request #7603: ARROW-9206: [C++][Flight] Add latency benchmark
cyb70289 commented on pull request #7603: URL: https://github.com/apache/arrow/pull/7603#issuecomment-652311392 > IMHO it should be `(t1+t2) / (n1+n2)`. Is it what the code is doing? Yes, this is what the code is doing. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7603: ARROW-9206: [C++][Flight] Add latency benchmark
pitrou commented on pull request #7603: URL: https://github.com/apache/arrow/pull/7603#issuecomment-652316126 Ok, then I was mistaken. Sorry :-) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory
liyafan82 commented on pull request #7347: URL: https://github.com/apache/arrow/pull/7347#issuecomment-652319908 > Thanks a lot @liyafan82 I have addressed your suggestions and rebased @rymurr Thanks for your work. Will merge when it turns green. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++
rymurr commented on a change in pull request #7275: URL: https://github.com/apache/arrow/pull/7275#discussion_r448255547 ## File path: java/vector/src/main/codegen/templates/UnionLargeListWriter.java ## @@ -0,0 +1,232 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +import org.apache.arrow.memory.ArrowBuf; +import org.apache.arrow.vector.complex.writer.DecimalWriter; +import org.apache.arrow.vector.holders.DecimalHolder; + +import java.lang.UnsupportedOperationException; +import java.math.BigDecimal; + +<@pp.dropOutputFile /> +<@pp.changeOutputFile name="/org/apache/arrow/vector/complex/impl/UnionLargeListWriter.java" /> + + +<#include "/@includes/license.ftl" /> + +package org.apache.arrow.vector.complex.impl; + +import static org.apache.arrow.memory.util.LargeMemoryUtil.checkedCastToInt; +<#include "/@includes/vv_imports.ftl" /> + +/* + * This class is generated using freemarker and the ${.template_name} template. + */ + +@SuppressWarnings("unused") +public class UnionLargeListWriter extends AbstractFieldWriter { Review comment: agreed, I fixed this...was just being lazy ;-) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rymurr commented on pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++
rymurr commented on pull request #7275: URL: https://github.com/apache/arrow/pull/7275#issuecomment-652323175 > Thanks for working on this @rymurr ! Apologies for taking so long to review.. It looks pretty good, but I saw what looked like inconsistencies in the `LargeListVector` APIs using ints vs longs to me, and otherwise only minor things to fix up. Thanks a lot for the thorough review. I have fixed up everything you mentioned. It appears some of the confusion was related to changes for 64-bit allocations that were recently merged and the rest was my ignorance! I have pushed a change with all your recommended fixes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou opened a new pull request #7605: ARROW-9283: [Python] Expose build info
pitrou opened a new pull request #7605: URL: https://github.com/apache/arrow/pull/7605 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7605: ARROW-9283: [Python] Expose build info
pitrou commented on pull request #7605: URL: https://github.com/apache/arrow/pull/7605#issuecomment-652330730 There's a problem where we already generate `__version__` and it ends up different, for example: ```python >>> import pyarrow as pa >>> pa.__version__ '0.17.1.dev572+gf25a014ab.d20200701' >>> pa.version '1.0.0' >>> pa.version_info VersionInfo(major=1, minor=0, patch=0) >>> pa.build_info BuildInfo(version='1.0.0', version_info=VersionInfo(major=1, minor=0, patch=0), so_version='100', full_so_version='100.0.0', git_id='5f1800a4dfaf38754362914faf2776d6c3ec9eb9', git_description='apache-arrow-0.17.0-568-g5f1800a4d', package_kind='') ``` @xhochy @kou What is your advice on this? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7605: ARROW-9283: [Python] Expose build info
github-actions[bot] commented on pull request #7605: URL: https://github.com/apache/arrow/pull/7605#issuecomment-652332966 https://issues.apache.org/jira/browse/ARROW-9283 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on pull request #7543: ARROW-9221: [Java] account for big-endian buffers in ArrowBuf.setBytes
lidavidm commented on pull request #7543: URL: https://github.com/apache/arrow/pull/7543#issuecomment-652379935 @liyafan82 The problem actually isn't with big-endian platforms! It's because Java's ByteBuffer [defaults to big-endian](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/nio/ByteBuffer.html#wrap(byte%5B%5D)), and some libraries, like Protobuf, inherit that behavior. We discovered this when developing with Flight internally; there are Protobuf APIs that return a ByteBuffer that may be big-endian; when Flight tries to copy the buffer into an ArrowBuf, it corrupts the data. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7544: ARROW-7285: [C++] ensure C++ implementation meets clarified dictionary spec
pitrou commented on pull request #7544: URL: https://github.com/apache/arrow/pull/7544#issuecomment-652383485 Thanks for the update. I will merge this PR once CI is green. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #7593: ARROW-9160: [C++] Implement contains for exact matches
pitrou commented on a change in pull request #7593: URL: https://github.com/apache/arrow/pull/7593#discussion_r448326108 ## File path: cpp/src/arrow/compute/kernels/scalar_string.cc ## @@ -297,6 +297,116 @@ void AddAsciiLength(FunctionRegistry* registry) { DCHECK_OK(registry->AddFunction(std::move(func))); } +// -- +// exact pattern detection + +using StrToBoolTransformFunc = +std::function; + +// Apply `transform` to input character data- this function cannot change the +// length +template +void StringBoolTransform(KernelContext* ctx, const ExecBatch& batch, + StrToBoolTransformFunc transform, Datum* out) { + using offset_type = typename Type::offset_type; + + if (batch[0].kind() == Datum::ARRAY) { +const ArrayData& input = *batch[0].array(); +ArrayData* out_arr = out->mutable_array(); +if (input.length > 0) { + transform( + reinterpret_cast(input.buffers[1]->data()) + input.offset, + input.buffers[2]->data(), input.length, out_arr->offset, + out_arr->buffers[1]->mutable_data()); +} + } else { +const auto& input = checked_cast(*batch[0].scalar()); +auto result = checked_pointer_cast(MakeNullScalar(out->type())); +uint8_t result_value = 0; +if (input.is_valid) { + result->is_valid = true; + std::array offsets{0, + static_cast(input.value->size())}; + transform(offsets.data(), input.value->data(), 1, /*output_offset=*/0, +&result_value); + out->value = std::make_shared(result_value > 0); +} + } +} + +template +void TransformContainsExact(const uint8_t* pattern, int64_t pattern_length, +const offset_type* offsets, const uint8_t* data, +int64_t length, int64_t output_offset, uint8_t* output) { + // This is an implementation of the Knuth-Morris-Pratt algorithm + + // Phase 1: Build the prefix table + std::vector prefix_table(pattern_length + 1); + offset_type prefix_length = -1; + prefix_table[0] = -1; + for (offset_type pos = 0; pos < pattern_length; ++pos) { +// The prefix cannot be expanded, reset. +if (prefix_length >= 0 && pattern[pos] != pattern[prefix_length]) { + prefix_length = prefix_table[prefix_length]; +} +prefix_length++; +prefix_table[pos + 1] = prefix_length; + } + + // Phase 2: Find the prefix in the data + FirstTimeBitmapWriter bitmap_writer(output, output_offset, length); + for (int64_t i = 0; i < length; ++i) { +const uint8_t* current_data = data + offsets[i]; +int64_t current_length = offsets[i + 1] - offsets[i]; + +int64_t pattern_pos = 0; +for (int64_t k = 0; k < current_length; k++) { + if (pattern[pattern_pos] == current_data[k]) { +pattern_pos++; +if (pattern_pos == pattern_length) { + bitmap_writer.Set(); + break; +} + } else { +pattern_pos = std::max(0, prefix_table[pattern_pos]); + } +} +bitmap_writer.Next(); + } + bitmap_writer.Finish(); Review comment: It is me, or you're never clearing the output bit when the pattern isn't found? ## File path: python/pyarrow/compute.py ## @@ -113,6 +113,24 @@ def func(left, right): multiply = _simple_binary_function('multiply') +def contains_exact(array, pattern): +""" +Check whether a pattern occurs as part of the values of the array. Review comment: That's rather vague. Does `contains_exact(pa.array([1,2,3]), pa.array([1,2]))` return true? I don't think so. ## File path: cpp/src/arrow/compute/api_scalar.h ## @@ -259,6 +259,15 @@ Result IsValid(const Datum& values, ExecContext* ctx = NULLPTR); ARROW_EXPORT Result IsNull(const Datum& values, ExecContext* ctx = NULLPTR); +// -- +// String functions + +struct ARROW_EXPORT ContainsExactOptions : public FunctionOptions { + explicit ContainsExactOptions(std::string pattern = "") : pattern(pattern) {} Review comment: I don't think defaulting to the empty string makes sense here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on pull request #7544: ARROW-7285: [C++] ensure C++ implementation meets clarified dictionary spec
liyafan82 commented on pull request #7544: URL: https://github.com/apache/arrow/pull/7544#issuecomment-652395253 @pitrou Thanks a lot for your effort. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on pull request #7543: ARROW-9221: [Java] account for big-endian buffers in ArrowBuf.setBytes
liyafan82 commented on pull request #7543: URL: https://github.com/apache/arrow/pull/7543#issuecomment-652395634 > @liyafan82 The problem actually isn't with big-endian platforms! It's because Java's ByteBuffer [defaults to big-endian](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/nio/ByteBuffer.html#wrap(byte%5B%5D)), and some libraries, like Protobuf, inherit that behavior. We discovered this when developing with Flight internally; there are Protobuf APIs that return a ByteBuffer that may be big-endian; when Flight tries to copy the buffer into an ArrowBuf, it corrupts the data. @lidavidm Thanks a lot for your clarification. This is really a problem that needs to be fixed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on a change in pull request #7543: ARROW-9221: [Java] account for big-endian buffers in ArrowBuf.setBytes
liyafan82 commented on a change in pull request #7543: URL: https://github.com/apache/arrow/pull/7543#discussion_r448338460 ## File path: java/memory/src/main/java/org/apache/arrow/memory/ArrowBuf.java ## @@ -872,6 +874,7 @@ public void setBytes(long index, ByteBuffer src) { --length; ++dstAddress; } +src.order(originalByteOrder); Review comment: should we place this in a finally block? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #7543: ARROW-9221: [Java] account for big-endian buffers in ArrowBuf.setBytes
lidavidm commented on a change in pull request #7543: URL: https://github.com/apache/arrow/pull/7543#discussion_r448345297 ## File path: java/memory/src/main/java/org/apache/arrow/memory/ArrowBuf.java ## @@ -872,6 +874,7 @@ public void setBytes(long index, ByteBuffer src) { --length; ++dstAddress; } +src.order(originalByteOrder); Review comment: Good idea, I've updated the PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou opened a new pull request #7606: ARROW-8434: [C++] Avoid multiple schema deserializations in RecordBatchFileReader
pitrou opened a new pull request #7606: URL: https://github.com/apache/arrow/pull/7606 This doesn't seem to make a difference in the included benchmark. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jianxind opened a new pull request #7607: ARROW-8996: [C++] Add AVX version for aggregate sum/mean
jianxind opened a new pull request #7607: URL: https://github.com/apache/arrow/pull/7607 1. Add AVX2/AVX512 build version of aggregate sum/mean function. Use set_source_files_properties to append the SIMD build option. Register the SIMD path at runtime by CPU feature. Move the sum/mean implementation to header file thus SIMD version can reuse. Also rework the implementation based on bit_block_counter. 2. Introduce ARROW_USER_SIMD_LEVEL env for test scope. ex "ARROW_USER_SIMD_LEVEL=sse4_2 make unittest" on a AVX2 capable device. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jianxind commented on pull request #7607: ARROW-8996: [C++] Add AVX version for aggregate sum/mean
jianxind commented on pull request #7607: URL: https://github.com/apache/arrow/pull/7607#issuecomment-652425725 @ursabot benchmark --suite-filter=arrow-compute-aggregate-benchmark This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jianxind commented on pull request #7607: ARROW-8996: [C++] Add AVX version for aggregate sum/mean
jianxind commented on pull request #7607: URL: https://github.com/apache/arrow/pull/7607#issuecomment-652431931 @emkornfield This is the new version for sum aggregate without intrinsic, could you help to review? The dense part nearly get the same scores with intrinsic for AVX2 on clang, gcc result is little low. Below is the benchmark results(null_percent 0 and 0.01%) on a AVX2(i7-8700) device. Before ``` SumKernelFloat/1048576/197.5 us 97.3 us 7150 bytes_per_second=10.0336G/s null_percent=0.01 size=1048.58k SumKernelFloat/1048576/062.1 us 62.0 us11292 bytes_per_second=15.7443G/s null_percent=0 size=1048.58k SumKernelDouble/1048576/1 35.4 us 35.4 us19781 bytes_per_second=27.5977G/s null_percent=0.01 size=1048.58k SumKernelDouble/1048576/0 32.5 us 32.5 us21534 bytes_per_second=30.0657G/s null_percent=0 size=1048.58k SumKernelInt8/1048576/1 183 us 183 us 3832 bytes_per_second=5.34627G/s null_percent=0.01 size=1048.58k SumKernelInt8/1048576/0 133 us 132 us 5285 bytes_per_second=7.37317G/s null_percent=0 size=1048.58k SumKernelInt16/1048576/193.3 us 93.2 us 7505 bytes_per_second=10.4762G/s null_percent=0.01 size=1048.58k SumKernelInt16/1048576/068.4 us 68.3 us10249 bytes_per_second=14.2887G/s null_percent=0 size=1048.58k SumKernelInt32/1048576/149.2 us 49.1 us14255 bytes_per_second=19.874G/s null_percent=0.01 size=1048.58k SumKernelInt32/1048576/040.3 us 40.2 us17654 bytes_per_second=24.2688G/s null_percent=0 size=1048.58k SumKernelInt64/1048576/135.3 us 35.3 us19870 bytes_per_second=27.6826G/s null_percent=0.01 size=1048.58k SumKernelInt64/1048576/032.4 us 32.3 us21628 bytes_per_second=30.1902G/s null_percent=0 size=1048.58k ``` After ``` SumKernelFloat/1048576/141.1 us 41.0 us17004 bytes_per_second=23.7947G/s null_percent=0.01 size=1048.58k SumKernelFloat/1048576/025.1 us 25.0 us27884 bytes_per_second=38.9922G/s null_percent=0 size=1048.58k SumKernelDouble/1048576/1 24.6 us 24.5 us28423 bytes_per_second=39.8205G/s null_percent=0.01 size=1048.58k SumKernelDouble/1048576/0 17.1 us 17.1 us40881 bytes_per_second=57.1186G/s null_percent=0 size=1048.58k SumKernelInt8/1048576/1 116 us 115 us 6073 bytes_per_second=8.46685G/s null_percent=0.01 size=1048.58k SumKernelInt8/1048576/0 61.0 us 60.9 us11501 bytes_per_second=16.0293G/s null_percent=0 size=1048.58k SumKernelInt16/1048576/162.2 us 62.2 us11250 bytes_per_second=15.7108G/s null_percent=0.01 size=1048.58k SumKernelInt16/1048576/037.0 us 37.0 us19883 bytes_per_second=26.4204G/s null_percent=0 size=1048.58k SumKernelInt32/1048576/138.4 us 38.4 us18217 bytes_per_second=25.4367G/s null_percent=0.01 size=1048.58k SumKernelInt32/1048576/024.6 us 24.5 us28531 bytes_per_second=39.8216G/s null_percent=0 size=1048.58k SumKernelInt64/1048576/124.1 us 24.1 us29069 bytes_per_second=40.5531G/s null_percent=0.01 size=1048.58k SumKernelInt64/1048576/016.7 us 16.7 us41887 bytes_per_second=58.3943G/s null_percent=0 size=1048.58k ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7607: ARROW-8996: [C++] Add AVX version for aggregate sum/mean
github-actions[bot] commented on pull request #7607: URL: https://github.com/apache/arrow/pull/7607#issuecomment-652433109 https://issues.apache.org/jira/browse/ARROW-8996 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7606: ARROW-8434: [C++] Avoid multiple schema deserializations in RecordBatchFileReader
github-actions[bot] commented on pull request #7606: URL: https://github.com/apache/arrow/pull/7606#issuecomment-652433108 https://issues.apache.org/jira/browse/ARROW-8434 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7593: ARROW-9160: [C++] Implement contains for exact matches
wesm commented on pull request #7593: URL: https://github.com/apache/arrow/pull/7593#issuecomment-652450464 > Could you elaborate? Why is this not a problem with the lower/upper kernels? The data preallocation is only for fixed size outputs (eg boolean, integers, floating point, etc). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #7604: ARROW-9223: [Python] Propagate Timzone information in pandas conversion
emkornfield commented on a change in pull request #7604: URL: https://github.com/apache/arrow/pull/7604#discussion_r448413419 ## File path: cpp/src/arrow/python/datetime.cc ## @@ -262,6 +265,42 @@ int64_t PyDate_to_days(PyDateTime_Date* pydate) { PyDateTime_GET_DAY(pydate)); } +// GIL must be held when calling this function. +Status StringToTzinfo(const std::string& tz, PyObject** tzinfo) { + static const std::regex kFixedOffset("([+-])(0[0-9]|1[0-9]|2[0-3]):([0-5][0-9])$"); + OwnedRef pytz; + RETURN_NOT_OK(internal::ImportModule("pytz", &pytz)); + + std::cmatch match; + if (std::regex_match(tz.c_str(), match, kFixedOffset)) { Review comment: Should avoid c_str here This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche opened a new pull request #7608: ARROW-9288: [C++][Dataset] Fix PartitioningFactory with dictionary encoding for HivePartioning
jorisvandenbossche opened a new pull request #7608: URL: https://github.com/apache/arrow/pull/7608 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7608: ARROW-9288: [C++][Dataset] Fix PartitioningFactory with dictionary encoding for HivePartioning
github-actions[bot] commented on pull request #7608: URL: https://github.com/apache/arrow/pull/7608#issuecomment-652480774 https://issues.apache.org/jira/browse/ARROW-9288 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kiszk commented on pull request #7507: ARROW-8797: [C++] Create test to receive RecordBatch for different endian
kiszk commented on pull request #7507: URL: https://github.com/apache/arrow/pull/7507#issuecomment-652482871 @wesm Thank you for your suggestion. I will pursue the approach that you suggested. I will check the integration test command line tool and the integration test with the JSON_TO_ARROW mode. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7608: ARROW-9288: [C++][Dataset] Fix PartitioningFactory with dictionary encoding for HivePartioning
jorisvandenbossche commented on a change in pull request #7608: URL: https://github.com/apache/arrow/pull/7608#discussion_r448438477 ## File path: cpp/src/arrow/dataset/partition.cc ## @@ -646,15 +657,26 @@ class HivePartitioningFactory : public PartitioningFactory { } } -return impl.Finish(&dictionaries_); +auto schema_result = impl.Finish(&dictionaries_); +field_names_ = impl.FieldNames(); +return schema_result; } Result> Finish( const std::shared_ptr& schema) const override { -return std::shared_ptr(new HivePartitioning(schema, dictionaries_)); +for (FieldRef ref : field_names_) { Review comment: I should probably guard here against the case that `field_names_` was not yet updated (if `Finish` is called without `Inspect` being called), with empty vector? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #7604: ARROW-9223: [Python] Propagate Timzone information in pandas conversion
emkornfield commented on a change in pull request #7604: URL: https://github.com/apache/arrow/pull/7604#discussion_r448445229 ## File path: cpp/src/arrow/python/arrow_to_pandas.cc ## @@ -951,8 +951,21 @@ struct ObjectWriterVisitor { template enable_if_timestamp Visit(const Type& type) { const TimeUnit::type unit = type.unit(); -auto WrapValue = [unit](typename Type::c_type value, PyObject** out) { +PyObject* tzinfo = nullptr; +if (!type.timezone().empty()) { + RETURN_NOT_OK(internal::StringToTzinfo(type.timezone(), &tzinfo)); +} +auto WrapValue = [unit, tzinfo](typename Type::c_type value, PyObject** out) { + OwnedRef tz(tzinfo); RETURN_NOT_OK(internal::PyDateTime_from_int(value, unit, out)); + RETURN_IF_PYERROR(); + if (tzinfo != nullptr) { +PyObject* with_tz = PyObject_CallMethod(*out, "astimezone", "O", tzinfo); Review comment: should always set UTC here first. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type
jorisvandenbossche commented on pull request #7536: URL: https://github.com/apache/arrow/pull/7536#issuecomment-652493993 @bkietz thanks for the update ensuring all uniques as dictionary values! Testing this out, I ran into an issue with HivePartitioning -> ARROW-9288 / #7608 Further, a usability issue: this now creates partition expressions that use a dictionary type. Which means that doing something like `dataset.to_table(filter=ds.field("part") == "A")` to filter on the partition field with a plain string expression doesn't work, limiting the usability of this option (and even with the new Python scalar stuff, it would not be easy to construct the correct expression): ``` In [9]: part = ds.HivePartitioning.discover(max_partition_dictionary_size=2) In [10]: dataset = ds.dataset("test_partitioned_filter/", format="parquet", partitioning=part) In [11]: fragment = list(dataset.get_fragments())[0] In [12]: fragment.partition_expression Out[12]: )> In [13]: dataset.to_table(filter=ds.field("part") == "A") ... ArrowNotImplementedError: cast from string ``` It might also be an option to keep the `partition_expression` use the dictionary *value type* instead of dictionary type? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on pull request #7546: ARROW-8733: [C++][Dataset][Python] Expose RowGroupInfo statistics values
jorisvandenbossche commented on pull request #7546: URL: https://github.com/apache/arrow/pull/7546#issuecomment-652497706 @rjzamora `num_rows` is already available on the RowGroupInfo object (https://github.com/apache/arrow/blob/cd3ed605857994575326c072bbfcf995541fa80e/python/pyarrow/_dataset.pyx#L845-L847) For the `total_byte_size`, can you open a JIRA for this? (it should be similar as `num_rows` to get / cache from the parquet row group, I think) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson opened a new pull request #7609: ARROW-9289: [R] Remove deprecated functions
nealrichardson opened a new pull request #7609: URL: https://github.com/apache/arrow/pull/7609 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson closed pull request #7602: ARROW-9083: [R] collect int64, uint32, uint64 as R integer type if not out of bounds
nealrichardson closed pull request #7602: URL: https://github.com/apache/arrow/pull/7602 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on pull request #7478: ARROW-9055: [C++] Add sum/mean/minmax kernels for Boolean type
jorisvandenbossche commented on pull request #7478: URL: https://github.com/apache/arrow/pull/7478#issuecomment-652504934 > We will have to resolve the sum([]) -> null/0 by introducing a "minimum valid values" option. Do we already have a JIRA to track this? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche edited a comment on pull request #7478: ARROW-9055: [C++] Add sum/mean/minmax kernels for Boolean type
jorisvandenbossche edited a comment on pull request #7478: URL: https://github.com/apache/arrow/pull/7478#issuecomment-652504934 > We will have to resolve the sum([]) -> null/0 by introducing a "minimum valid values" option. Do we already have a JIRA to track this? EDIT -> it is mentioned in ARROW-9054 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kiszk commented on pull request #7596: ARROW-9163: [C++] Validate UTF8 contents of a StringArray
kiszk commented on pull request #7596: URL: https://github.com/apache/arrow/pull/7596#issuecomment-652505301 Looks good This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7609: ARROW-9289: [R] Remove deprecated functions
github-actions[bot] commented on pull request #7609: URL: https://github.com/apache/arrow/pull/7609#issuecomment-652507951 https://issues.apache.org/jira/browse/ARROW-9289 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rymurr commented on pull request #7290: ARROW-1692: [Java] UnionArray round trip not working
rymurr commented on pull request #7290: URL: https://github.com/apache/arrow/pull/7290#issuecomment-652516529 This has been modified to incorporate the changes to Unions as proposed on the mailing list This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector
nealrichardson commented on a change in pull request #7514: URL: https://github.com/apache/arrow/pull/7514#discussion_r448483516 ## File path: r/src/array_from_vector.cpp ## @@ -918,6 +923,97 @@ class Time64Converter : public TimeConverter { } }; +template +class BinaryVectorConverter : public VectorConverter { + public: + ~BinaryVectorConverter() {} + + Status Init(ArrayBuilder* builder) { +typed_builder_ = checked_cast(builder); +return Status::OK(); + } + + Status Ingest(SEXP obj) { +ARROW_RETURN_IF(TYPEOF(obj) != VECSXP, Status::RError("Expecting a list")); +R_xlen_t n = XLENGTH(obj); + +// Reserve enough space before appending +int64_t size = 0; +for (R_xlen_t i = 0; i < n; i++) { + SEXP obj_i = VECTOR_ELT(obj, i); + if (!Rf_isNull(obj_i)) { +ARROW_RETURN_IF(TYPEOF(obj_i) != RAWSXP, +Status::RError("Expecting a raw vector")); +size += XLENGTH(obj_i); + } +} +RETURN_NOT_OK(typed_builder_->Reserve(size)); + +// append +for (R_xlen_t i = 0; i < n; i++) { + SEXP obj_i = VECTOR_ELT(obj, i); + if (Rf_isNull(obj_i)) { +RETURN_NOT_OK(typed_builder_->AppendNull()); + } else { +RETURN_NOT_OK(typed_builder_->Append(RAW(obj_i), XLENGTH(obj_i))); + } +} +return Status::OK(); + } + + Status GetResult(std::shared_ptr* result) { +return typed_builder_->Finish(result); + } + + private: + Builder* typed_builder_; +}; + +template +class StringVectorConverter : public VectorConverter { + public: + ~StringVectorConverter() {} + + Status Init(ArrayBuilder* builder) { +typed_builder_ = checked_cast(builder); +return Status::OK(); + } + + Status Ingest(SEXP obj) { +ARROW_RETURN_IF(TYPEOF(obj) != STRSXP, +Status::RError("Expecting a character vector")); +R_xlen_t n = XLENGTH(obj); + +// Reserve enough space before appending +int64_t size = 0; +for (R_xlen_t i = 0; i < n; i++) { + SEXP string_i = STRING_ELT(obj, i); + if (string_i != NA_STRING) { +size += XLENGTH(string_i); + } +} +RETURN_NOT_OK(typed_builder_->Reserve(size)); + +// append +for (R_xlen_t i = 0; i < n; i++) { + SEXP string_i = STRING_ELT(obj, i); + if (string_i == NA_STRING) { +RETURN_NOT_OK(typed_builder_->AppendNull()); + } else { +RETURN_NOT_OK(typed_builder_->Append(CHAR(string_i), XLENGTH(string_i))); Review comment: I fixed this; will deal with efficiency/redundancy in a followup. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector
nealrichardson commented on a change in pull request #7514: URL: https://github.com/apache/arrow/pull/7514#discussion_r448490350 ## File path: r/src/array_to_vector.cpp ## @@ -693,6 +741,9 @@ std::shared_ptr Converter::Make(const std::shared_ptr& type case Type::BOOL: return std::make_shared(std::move(arrays)); +case Type::BINARY: Review comment: Fixed types: https://issues.apache.org/jira/browse/ARROW-9291 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector
nealrichardson commented on a change in pull request #7514: URL: https://github.com/apache/arrow/pull/7514#discussion_r448491760 ## File path: r/src/array_from_vector.cpp ## @@ -918,6 +923,97 @@ class Time64Converter : public TimeConverter { } }; +template +class BinaryVectorConverter : public VectorConverter { + public: + ~BinaryVectorConverter() {} + + Status Init(ArrayBuilder* builder) { +typed_builder_ = checked_cast(builder); +return Status::OK(); + } + + Status Ingest(SEXP obj) { +ARROW_RETURN_IF(TYPEOF(obj) != VECSXP, Status::RError("Expecting a list")); +R_xlen_t n = XLENGTH(obj); + +// Reserve enough space before appending +int64_t size = 0; +for (R_xlen_t i = 0; i < n; i++) { + SEXP obj_i = VECTOR_ELT(obj, i); + if (!Rf_isNull(obj_i)) { +ARROW_RETURN_IF(TYPEOF(obj_i) != RAWSXP, +Status::RError("Expecting a raw vector")); +size += XLENGTH(obj_i); + } +} +RETURN_NOT_OK(typed_builder_->Reserve(size)); + +// append +for (R_xlen_t i = 0; i < n; i++) { + SEXP obj_i = VECTOR_ELT(obj, i); + if (Rf_isNull(obj_i)) { +RETURN_NOT_OK(typed_builder_->AppendNull()); + } else { +RETURN_NOT_OK(typed_builder_->Append(RAW(obj_i), XLENGTH(obj_i))); + } +} +return Status::OK(); + } + + Status GetResult(std::shared_ptr* result) { +return typed_builder_->Finish(result); + } + + private: + Builder* typed_builder_; +}; + +template +class StringVectorConverter : public VectorConverter { + public: + ~StringVectorConverter() {} + + Status Init(ArrayBuilder* builder) { +typed_builder_ = checked_cast(builder); +return Status::OK(); + } + + Status Ingest(SEXP obj) { +ARROW_RETURN_IF(TYPEOF(obj) != STRSXP, +Status::RError("Expecting a character vector")); +R_xlen_t n = XLENGTH(obj); + +// Reserve enough space before appending +int64_t size = 0; +for (R_xlen_t i = 0; i < n; i++) { + SEXP string_i = STRING_ELT(obj, i); + if (string_i != NA_STRING) { +size += XLENGTH(string_i); + } +} +RETURN_NOT_OK(typed_builder_->Reserve(size)); + +// append +for (R_xlen_t i = 0; i < n; i++) { + SEXP string_i = STRING_ELT(obj, i); + if (string_i == NA_STRING) { +RETURN_NOT_OK(typed_builder_->AppendNull()); + } else { +RETURN_NOT_OK(typed_builder_->Append(CHAR(string_i), XLENGTH(string_i))); Review comment: Added a note to https://issues.apache.org/jira/browse/ARROW-7798 about the redundancy. I checked code coverage and it appears that both paths are being called, so that should be addressed (my guess is that factor levels are still going through the old). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] BryanCutler commented on pull request #6316: ARROW-7717: [CI] Have nightly integration test for Spark's latest release
BryanCutler commented on pull request #6316: URL: https://github.com/apache/arrow/pull/6316#issuecomment-652541712 @kszucs the nightly against Spark master have been passing. Do you think you could update this to just add the test against branch-3.0 and remove branch-2.4 for now? I'm not sure if the bot will pick it up from this PR, but I'll try to kick it off a test run. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson closed pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector
nealrichardson closed pull request #7514: URL: https://github.com/apache/arrow/pull/7514 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] BryanCutler commented on pull request #6316: ARROW-7717: [CI] Have nightly integration test for Spark's latest release
BryanCutler commented on pull request #6316: URL: https://github.com/apache/arrow/pull/6316#issuecomment-652542479 @github-actions crossbow submit test-conda-python-3.7-spark-branch-3.0 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] BryanCutler commented on a change in pull request #6316: ARROW-7717: [CI] Have nightly integration test for Spark's latest release
BryanCutler commented on a change in pull request #6316: URL: https://github.com/apache/arrow/pull/6316#discussion_r448503228 ## File path: dev/tasks/tasks.yml ## @@ -1833,12 +1833,32 @@ tasks: HDFS: 2.9.2 run: conda-python-hdfs - test-conda-python-3.7-spark-master: + test-conda-python-3.6-spark-2.4.6: ci: github template: docker-tests/github.linux.yml params: env: -PYTHON: 3.7 +PYTHON: 3.6 +SPARK: "v2.4.6" + # use the master branch of spark, so prevent reusing any layers + run: --no-leaf-cache --env ARROW_PRE_0_15_IPC_FORMAT=1 conda-python-spark + + test-conda-python-3.7-spark-branch-3.0: Review comment: looks like this should be named python-3.8 also This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson closed pull request #7609: ARROW-9289: [R] Remove deprecated functions
nealrichardson closed pull request #7609: URL: https://github.com/apache/arrow/pull/7609 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #6316: ARROW-7717: [CI] Have nightly integration test for Spark's latest release
github-actions[bot] commented on pull request #6316: URL: https://github.com/apache/arrow/pull/6316#issuecomment-652544533 Revision: a914eea4f3ab16e359adee2f37a4fb30a1eba86c Submitted crossbow builds: [ursa-labs/crossbow @ actions-371](https://github.com/ursa-labs/crossbow/branches/all?query=actions-371) |Task|Status| ||--| |test-conda-python-3.7-spark-branch-3.0|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-371-github-test-conda-python-3.7-spark-branch-3.0)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-371-github-test-conda-python-3.7-spark-branch-3.0)| This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kiszk commented on a change in pull request #7593: ARROW-9160: [C++] Implement contains for exact matches
kiszk commented on a change in pull request #7593: URL: https://github.com/apache/arrow/pull/7593#discussion_r448506425 ## File path: cpp/src/arrow/compute/kernels/scalar_string.cc ## @@ -297,6 +297,116 @@ void AddAsciiLength(FunctionRegistry* registry) { DCHECK_OK(registry->AddFunction(std::move(func))); } +// -- +// exact pattern detection + +using StrToBoolTransformFunc = +std::function; + +// Apply `transform` to input character data- this function cannot change the +// length +template +void StringBoolTransform(KernelContext* ctx, const ExecBatch& batch, + StrToBoolTransformFunc transform, Datum* out) { + using offset_type = typename Type::offset_type; + + if (batch[0].kind() == Datum::ARRAY) { +const ArrayData& input = *batch[0].array(); +ArrayData* out_arr = out->mutable_array(); +if (input.length > 0) { + transform( + reinterpret_cast(input.buffers[1]->data()) + input.offset, + input.buffers[2]->data(), input.length, out_arr->offset, + out_arr->buffers[1]->mutable_data()); +} + } else { +const auto& input = checked_cast(*batch[0].scalar()); +auto result = checked_pointer_cast(MakeNullScalar(out->type())); Review comment: We can move this line into then clause before line 327. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] saethlin opened a new pull request #7610: ARROW-9290: [Rust] [Parquet] Add features to allow opting out of dependencies
saethlin opened a new pull request #7610: URL: https://github.com/apache/arrow/pull/7610 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7610: ARROW-9290: [Rust] [Parquet] Add features to allow opting out of dependencies
github-actions[bot] commented on pull request #7610: URL: https://github.com/apache/arrow/pull/7610#issuecomment-652553010 https://issues.apache.org/jira/browse/ARROW-9290 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson opened a new pull request #7611: ARROW-3308: [R] Convert R character vector with data exceeding 2GB to Large type
nealrichardson opened a new pull request #7611: URL: https://github.com/apache/arrow/pull/7611 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7611: ARROW-3308: [R] Convert R character vector with data exceeding 2GB to Large type
github-actions[bot] commented on pull request #7611: URL: https://github.com/apache/arrow/pull/7611#issuecomment-652580102 https://issues.apache.org/jira/browse/ARROW-3308 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou closed pull request #7544: ARROW-7285: [C++] ensure C++ implementation meets clarified dictionary spec
pitrou closed pull request #7544: URL: https://github.com/apache/arrow/pull/7544 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou opened a new pull request #7612: ARROW-7011: [C++] Implement casts from float/double to decimal
pitrou opened a new pull request #7612: URL: https://github.com/apache/arrow/pull/7612 Also naturally available in Python using the Array.cast() method. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings
kszucs commented on a change in pull request #7519: URL: https://github.com/apache/arrow/pull/7519#discussion_r448550619 ## File path: python/pyarrow/scalar.pxi ## @@ -16,1198 +16,748 @@ # under the License. -_NULL = NA = None +import collections cdef class Scalar: """ -The base class for all array elements. -""" - - -cdef class NullType(Scalar): -""" -Singleton for null array elements. +The base class for scalars. """ -# TODO rename this NullValue? - -def __cinit__(self): -global NA -if NA is not None: -raise Exception('Cannot create multiple NAType instances') - -self.type = null() - -def __repr__(self): -return 'NULL' -def as_py(self): -""" -Return None -""" -return None +def __init__(self): +raise TypeError("Do not call {}'s constructor directly, use " +"pa.scalar() instead.".format(self.__class__.__name__)) -def __eq__(self, other): -return NA +cdef void init(self, const shared_ptr[CScalar]& wrapped): +self.wrapped = wrapped +@staticmethod +cdef wrap(const shared_ptr[CScalar]& wrapped): +cdef: +Scalar self +Type type_id = wrapped.get().type.get().id() -_NULL = NA = NullType() +if type_id == _Type_NA: +return _NULL +typ = _scalar_classes[type_id] +self = typ.__new__(typ) +self.init(wrapped) -cdef class ArrayValue(Scalar): -""" -The base class for non-null array elements. -""" +return self -def __init__(self): -raise TypeError("Do not call {}'s constructor directly, use array " -"subscription instead." -.format(self.__class__.__name__)) +cdef inline shared_ptr[CScalar] unwrap(self) nogil: +return self.wrapped -cdef void init(self, DataType type, const shared_ptr[CArray]& sp_array, - int64_t index): -self.type = type -self.index = index -self._set_array(sp_array) +@property +def type(self): +""" +Data type of the Scalar object. +""" +return pyarrow_wrap_data_type(self.wrapped.get().type) -cdef void _set_array(self, const shared_ptr[CArray]& sp_array): -self.sp_array = sp_array +@property +def is_valid(self): +""" +Holds a valid (non-null) value. +""" +return self.wrapped.get().is_valid def __repr__(self): -if hasattr(self, 'as_py'): -return repr(self.as_py()) -else: -return super(Scalar, self).__repr__() +return ''.format( +self.__class__.__name__, self.as_py() +) def __str__(self): -if hasattr(self, 'as_py'): -return str(self.as_py()) -else: -return super(Scalar, self).__str__() +return str(self.as_py()) + +def equals(self, Scalar other): +return self.wrapped.get().Equals(other.unwrap().get()[0]) def __eq__(self, other): -if hasattr(self, 'as_py'): -if isinstance(other, ArrayValue): -other = other.as_py() -return self.as_py() == other -else: -raise NotImplementedError( -"Cannot compare Arrow values that don't support as_py()") +try: +if not isinstance(other, Scalar): +other = scalar(other, type=self.type) +return self.equals(other) +except (TypeError, ValueError, ArrowInvalid): +return NotImplemented def __hash__(self): -return hash(self.as_py()) +cdef CScalarHash hasher +return hasher(self.wrapped) +def as_py(self): +raise NotImplementedError() -cdef class BooleanValue(ArrayValue): -""" -Concrete class for boolean array elements. -""" -def as_py(self): -""" -Return this value as a Python bool. -""" -cdef CBooleanArray* ap = self.sp_array.get() -return ap.Value(self.index) +_NULL = NA = None -cdef class Int8Value(ArrayValue): +cdef class NullScalar(Scalar): """ -Concrete class for int8 array elements. +Concrete class for null scalars. """ -def as_py(self): -""" -Return this value as a Python int. -""" -cdef CInt8Array* ap = self.sp_array.get() -return ap.Value(self.index) +def __cinit__(self): +global NA +if NA is not None: +raise RuntimeError('Cannot create multiple NullScalar instances') +self.init(shared_ptr[CScalar](new CNullScalar())) +def __init__(self): +pass -cdef class UInt8Value(ArrayValue): -""" -Concrete class for uint8 array elements. -""" +def __eq__(self, other): +return NA + +def __hash__(self): +
[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings
kszucs commented on a change in pull request #7519: URL: https://github.com/apache/arrow/pull/7519#discussion_r448550822 ## File path: python/pyarrow/scalar.pxi ## @@ -16,1198 +16,748 @@ # under the License. -_NULL = NA = None +import collections cdef class Scalar: """ -The base class for all array elements. -""" - - -cdef class NullType(Scalar): -""" -Singleton for null array elements. +The base class for scalars. """ -# TODO rename this NullValue? - -def __cinit__(self): -global NA -if NA is not None: -raise Exception('Cannot create multiple NAType instances') - -self.type = null() - -def __repr__(self): -return 'NULL' -def as_py(self): -""" -Return None -""" -return None +def __init__(self): +raise TypeError("Do not call {}'s constructor directly, use " +"pa.scalar() instead.".format(self.__class__.__name__)) -def __eq__(self, other): -return NA +cdef void init(self, const shared_ptr[CScalar]& wrapped): +self.wrapped = wrapped +@staticmethod +cdef wrap(const shared_ptr[CScalar]& wrapped): +cdef: +Scalar self +Type type_id = wrapped.get().type.get().id() -_NULL = NA = NullType() +if type_id == _Type_NA: +return _NULL +typ = _scalar_classes[type_id] +self = typ.__new__(typ) +self.init(wrapped) -cdef class ArrayValue(Scalar): -""" -The base class for non-null array elements. -""" +return self -def __init__(self): -raise TypeError("Do not call {}'s constructor directly, use array " -"subscription instead." -.format(self.__class__.__name__)) +cdef inline shared_ptr[CScalar] unwrap(self) nogil: +return self.wrapped -cdef void init(self, DataType type, const shared_ptr[CArray]& sp_array, - int64_t index): -self.type = type -self.index = index -self._set_array(sp_array) +@property +def type(self): +""" +Data type of the Scalar object. +""" +return pyarrow_wrap_data_type(self.wrapped.get().type) -cdef void _set_array(self, const shared_ptr[CArray]& sp_array): -self.sp_array = sp_array +@property +def is_valid(self): +""" +Holds a valid (non-null) value. +""" +return self.wrapped.get().is_valid def __repr__(self): -if hasattr(self, 'as_py'): -return repr(self.as_py()) -else: -return super(Scalar, self).__repr__() +return ''.format( +self.__class__.__name__, self.as_py() +) def __str__(self): -if hasattr(self, 'as_py'): -return str(self.as_py()) -else: -return super(Scalar, self).__str__() +return str(self.as_py()) + +def equals(self, Scalar other): +return self.wrapped.get().Equals(other.unwrap().get()[0]) def __eq__(self, other): -if hasattr(self, 'as_py'): -if isinstance(other, ArrayValue): -other = other.as_py() -return self.as_py() == other -else: -raise NotImplementedError( -"Cannot compare Arrow values that don't support as_py()") +try: +if not isinstance(other, Scalar): +other = scalar(other, type=self.type) +return self.equals(other) +except (TypeError, ValueError, ArrowInvalid): +return NotImplemented def __hash__(self): -return hash(self.as_py()) +cdef CScalarHash hasher +return hasher(self.wrapped) +def as_py(self): +raise NotImplementedError() -cdef class BooleanValue(ArrayValue): -""" -Concrete class for boolean array elements. -""" -def as_py(self): -""" -Return this value as a Python bool. -""" -cdef CBooleanArray* ap = self.sp_array.get() -return ap.Value(self.index) +_NULL = NA = None -cdef class Int8Value(ArrayValue): +cdef class NullScalar(Scalar): """ -Concrete class for int8 array elements. +Concrete class for null scalars. """ -def as_py(self): -""" -Return this value as a Python int. -""" -cdef CInt8Array* ap = self.sp_array.get() -return ap.Value(self.index) +def __cinit__(self): +global NA +if NA is not None: +raise RuntimeError('Cannot create multiple NullScalar instances') +self.init(shared_ptr[CScalar](new CNullScalar())) +def __init__(self): +pass -cdef class UInt8Value(ArrayValue): -""" -Concrete class for uint8 array elements. -""" +def __eq__(self, other): +return NA + +def __hash__(self): +
[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings
kszucs commented on a change in pull request #7519: URL: https://github.com/apache/arrow/pull/7519#discussion_r448551033 ## File path: python/pyarrow/tests/test_scalars.py ## @@ -16,427 +16,443 @@ # under the License. import datetime +import decimal import pytest -import unittest import numpy as np import pyarrow as pa -class TestScalars(unittest.TestCase): - -def test_null_singleton(self): -with pytest.raises(Exception): -pa.NAType() +@pytest.mark.parametrize(['value', 'ty', 'klass', 'deprecated'], [ +(False, None, pa.BooleanScalar, pa.BooleanValue), +(True, None, pa.BooleanScalar, pa.BooleanValue), +(1, None, pa.Int64Scalar, pa.Int64Value), +(-1, None, pa.Int64Scalar, pa.Int64Value), +(1, pa.int8(), pa.Int8Scalar, pa.Int8Value), +(1, pa.uint8(), pa.UInt8Scalar, pa.UInt8Value), +(1, pa.int16(), pa.Int16Scalar, pa.Int16Value), +(1, pa.uint16(), pa.UInt16Scalar, pa.UInt16Value), +(1, pa.int32(), pa.Int32Scalar, pa.Int32Value), +(1, pa.uint32(), pa.UInt32Scalar, pa.UInt32Value), +(1, pa.int64(), pa.Int64Scalar, pa.Int64Value), +(1, pa.uint64(), pa.UInt64Scalar, pa.UInt64Value), +(1.0, None, pa.DoubleScalar, pa.DoubleValue), +(np.float16(1.0), pa.float16(), pa.HalfFloatScalar, pa.HalfFloatValue), +(1.0, pa.float32(), pa.FloatScalar, pa.FloatValue), +("string", None, pa.StringScalar, pa.StringValue), +(b"bytes", None, pa.BinaryScalar, pa.BinaryValue), +([1, 2, 3], None, pa.ListScalar, pa.ListValue), +([1, 2, 3, 4], pa.large_list(pa.int8()), pa.LargeListScalar, + pa.LargeListValue), +(datetime.date.today(), None, pa.Date32Scalar, pa.Date64Value), +(datetime.datetime.now(), None, pa.TimestampScalar, pa.TimestampValue), +({'a': 1, 'b': [1, 2]}, None, pa.StructScalar, pa.StructValue) +]) +def test_basics(value, ty, klass, deprecated): +s = pa.scalar(value, type=ty) +assert isinstance(s, klass) +assert s == value +assert s == s +assert s != "else" +assert hash(s) == hash(s) +assert s.is_valid is True +with pytest.warns(FutureWarning): +isinstance(s, deprecated) + +s = pa.scalar(None, type=s.type) +assert s.is_valid is False +assert s.as_py() is None + + +def test_null_singleton(): +with pytest.raises(Exception): +pa.NullScalar() + + +def test_nulls(): +null = pa.scalar(None) +assert null is pa.NA +assert null.as_py() is None +assert (null == "something") is pa.NA Review comment: Added. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings
kszucs commented on a change in pull request #7519: URL: https://github.com/apache/arrow/pull/7519#discussion_r448551482 ## File path: python/pyarrow/scalar.pxi ## @@ -16,1198 +16,745 @@ # under the License. -_NULL = NA = None +import collections cdef class Scalar: """ -The base class for all array elements. +The base class for scalars. """ +def __init__(self): +raise TypeError("Do not call {}'s constructor directly, use " +"pa.scalar() instead.".format(self.__class__.__name__)) -cdef class NullType(Scalar): -""" -Singleton for null array elements. -""" -# TODO rename this NullValue? +cdef void init(self, const shared_ptr[CScalar]& wrapped): +self.wrapped = wrapped -def __cinit__(self): -global NA -if NA is not None: -raise Exception('Cannot create multiple NAType instances') +@staticmethod +cdef wrap(const shared_ptr[CScalar]& wrapped): +cdef: +Scalar self +Type type_id = wrapped.get().type.get().id() + +if type_id == _Type_NA: +return _NULL + +typ = _scalar_classes[type_id] +self = typ.__new__(typ) +self.init(wrapped) + +return self + +cdef inline shared_ptr[CScalar] unwrap(self) nogil: +return self.wrapped + +@property +def type(self): +return pyarrow_wrap_data_type(self.wrapped.get().type) -self.type = null() +@property +def is_valid(self): +return self.wrapped.get().is_valid def __repr__(self): -return 'NULL' +return ''.format( +self.__class__.__name__, self.as_py() +) -def as_py(self): -""" -Return None -""" -return None +def __str__(self): +return str(self.as_py()) + +def equals(self, Scalar other): +return self.wrapped.get().Equals(other.unwrap().get()[0]) def __eq__(self, other): -return NA +try: +if not isinstance(other, Scalar): +other = scalar(other, type=self.type) +return self.equals(other) +except (TypeError, ValueError, ArrowInvalid): +return NotImplemented + +def __hash__(self): +cdef CScalarHash hasher +return hasher(self.wrapped) + +def as_py(self): +raise NotImplementedError() -_NULL = NA = NullType() +_NULL = NA = None -cdef class ArrayValue(Scalar): +cdef class NullScalar(Scalar): """ -The base class for non-null array elements. +Concrete class for null scalars. """ -def __init__(self): -raise TypeError("Do not call {}'s constructor directly, use array " -"subscription instead." -.format(self.__class__.__name__)) +def __cinit__(self): +global NA +if NA is not None: +raise Exception('Cannot create multiple NAType instances') +self.init(shared_ptr[CScalar](new CNullScalar())) -cdef void init(self, DataType type, const shared_ptr[CArray]& sp_array, - int64_t index): -self.type = type -self.index = index -self._set_array(sp_array) +def __init__(self): +pass -cdef void _set_array(self, const shared_ptr[CArray]& sp_array): -self.sp_array = sp_array +def __eq__(self, other): +return NA Review comment: Updated to return False. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org