[GitHub] [arrow] cyb70289 commented on pull request #7603: ARROW-9206: [C++][Flight] Add latency benchmark

2020-07-01 Thread GitBox


cyb70289 commented on pull request #7603:
URL: https://github.com/apache/arrow/pull/7603#issuecomment-652237692


   CI failure reproduces [[Python][C++] Non-deterministic segfault in "AMD64 
MacOS 10.15 Python 3.7" build](https://issues.apache.org/jira/browse/ARROW-8999)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7544: ARROW-7285: [C++] ensure C++ implementation meets clarified dictionary spec

2020-07-01 Thread GitBox


liyafan82 commented on a change in pull request #7544:
URL: https://github.com/apache/arrow/pull/7544#discussion_r448166294



##
File path: cpp/src/arrow/ipc/reader.cc
##
@@ -684,7 +685,19 @@ Status ReadDictionary(const Buffer& metadata, 
DictionaryMemo* dictionary_memo,
 return Status::Invalid("Dictionary record batch must only contain one 
field");
   }
   auto dictionary = batch->column(0);
-  return dictionary_memo->AddDictionary(id, dictionary);
+  if (dictionary_batch->isDelta()) {
+std::shared_ptr originalDict, combinedDict;
+RETURN_NOT_OK(dictionary_memo->GetDictionary(id, &originalDict));
+ArrayVector dictsToCombine{originalDict, dictionary};
+ARROW_ASSIGN_OR_RAISE(combinedDict, Concatenate(dictsToCombine, 
options.memory_pool));

Review comment:
   Good suggestion. I have extracted it into the AddDirectoryDelta method. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7544: ARROW-7285: [C++] ensure C++ implementation meets clarified dictionary spec

2020-07-01 Thread GitBox


liyafan82 commented on a change in pull request #7544:
URL: https://github.com/apache/arrow/pull/7544#discussion_r448166650



##
File path: cpp/src/arrow/ipc/reader.cc
##
@@ -684,7 +685,19 @@ Status ReadDictionary(const Buffer& metadata, 
DictionaryMemo* dictionary_memo,
 return Status::Invalid("Dictionary record batch must only contain one 
field");
   }
   auto dictionary = batch->column(0);
-  return dictionary_memo->AddDictionary(id, dictionary);
+  if (dictionary_batch->isDelta()) {
+std::shared_ptr originalDict, combinedDict;
+RETURN_NOT_OK(dictionary_memo->GetDictionary(id, &originalDict));
+ArrayVector dictsToCombine{originalDict, dictionary};
+ARROW_ASSIGN_OR_RAISE(combinedDict, Concatenate(dictsToCombine, 
options.memory_pool));
+return dictionary_memo->UpdateDictionary(id, combinedDict);
+  }
+
+  if (dictionary_memo->HasDictionary(id)) {

Review comment:
   Thanks for the good suggestion. I have added the AddOrReplaceDictionary 
method. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7544: ARROW-7285: [C++] ensure C++ implementation meets clarified dictionary spec

2020-07-01 Thread GitBox


liyafan82 commented on a change in pull request #7544:
URL: https://github.com/apache/arrow/pull/7544#discussion_r448167615



##
File path: cpp/src/arrow/ipc/read_write_test.cc
##
@@ -1228,6 +1228,152 @@ TEST_P(TestFileFormat, RoundTrip) {
   TestZeroLengthRoundTrip(*GetParam(), options);
 }
 
+Status MakeDictionaryBatch(std::shared_ptr* out) {
+  const int64_t length = 6;
+
+  std::vector is_valid = {true, true, false, true, true, true};
+
+  auto dict_ty = utf8();
+
+  auto dict = ArrayFromJSON(dict_ty, "[\"foo\", \"bar\", \"baz\"]");
+
+  auto f0_type = arrow::dictionary(arrow::int32(), dict_ty);
+  auto f1_type = arrow::dictionary(arrow::int8(), dict_ty);
+
+  std::shared_ptr indices0, indices1;
+  std::vector indices0_values = {1, 2, -1, 0, 2, 0};
+  std::vector indices1_values = {0, 0, 2, 2, 1, 1};
+
+  ArrayFromVector(is_valid, indices0_values, &indices0);
+  ArrayFromVector(is_valid, indices1_values, &indices1);
+
+  auto a0 = std::make_shared(f0_type, indices0, dict);
+  auto a1 = std::make_shared(f1_type, indices1, dict);
+
+  // construct batch
+  auto schema = ::arrow::schema({field("dict1", f0_type), field("dict2", 
f1_type)});
+
+  *out = RecordBatch::Make(schema, length, {a0, a1});
+  return Status::OK();
+}
+
+// A record batch writer implementation that supports manually specifying 
dictionaries.
+class TestRecordBatchWriter : public RecordBatchWriter {

Review comment:
   You are right. This is not a RecordBatchWriter, as it performs both read 
and write. 
   I have renamed it to DictionaryBatchHelper, and make it no longer inherit 
from RecordBatchWriter. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7544: ARROW-7285: [C++] ensure C++ implementation meets clarified dictionary spec

2020-07-01 Thread GitBox


liyafan82 commented on a change in pull request #7544:
URL: https://github.com/apache/arrow/pull/7544#discussion_r448168519



##
File path: cpp/src/arrow/ipc/metadata_internal.h
##
@@ -198,7 +198,7 @@ Status WriteDictionaryMessage(
 const int64_t id, const int64_t length, const int64_t body_length,
 const std::shared_ptr& custom_metadata,
 const std::vector& nodes, const 
std::vector& buffers,
-const IpcWriteOptions& options, std::shared_ptr* out);
+const IpcWriteOptions& options, std::shared_ptr* out, bool 
isDelta);

Review comment:
   Sounds reasonable. I have revised the code to place it after the id 
argument. (I have also revised other APIs accordingly, to make it consistent)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7544: ARROW-7285: [C++] ensure C++ implementation meets clarified dictionary spec

2020-07-01 Thread GitBox


liyafan82 commented on a change in pull request #7544:
URL: https://github.com/apache/arrow/pull/7544#discussion_r448168067



##
File path: cpp/src/arrow/ipc/read_write_test.cc
##
@@ -1228,6 +1228,152 @@ TEST_P(TestFileFormat, RoundTrip) {
   TestZeroLengthRoundTrip(*GetParam(), options);
 }
 
+Status MakeDictionaryBatch(std::shared_ptr* out) {
+  const int64_t length = 6;
+
+  std::vector is_valid = {true, true, false, true, true, true};
+
+  auto dict_ty = utf8();
+
+  auto dict = ArrayFromJSON(dict_ty, "[\"foo\", \"bar\", \"baz\"]");
+
+  auto f0_type = arrow::dictionary(arrow::int32(), dict_ty);
+  auto f1_type = arrow::dictionary(arrow::int8(), dict_ty);
+
+  std::shared_ptr indices0, indices1;
+  std::vector indices0_values = {1, 2, -1, 0, 2, 0};
+  std::vector indices1_values = {0, 0, 2, 2, 1, 1};
+
+  ArrayFromVector(is_valid, indices0_values, &indices0);
+  ArrayFromVector(is_valid, indices1_values, &indices1);

Review comment:
   Thanks for the good suggestion. 
   I have reivsed the code by ArrayFromJSON, and the code looks simpler now. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7544: ARROW-7285: [C++] ensure C++ implementation meets clarified dictionary spec

2020-07-01 Thread GitBox


liyafan82 commented on a change in pull request #7544:
URL: https://github.com/apache/arrow/pull/7544#discussion_r448168805



##
File path: cpp/src/arrow/ipc/writer.h
##
@@ -341,6 +343,29 @@ class ARROW_EXPORT IpcPayloadWriter {
   virtual Status Close() = 0;
 };
 
+/// Create a new IPC payload stream writer from stream sink. User is
+/// responsible for closing the actual OutputStream.
+///
+/// \param[in] sink output stream to write to
+/// \param[in] options options for serialization
+/// \return Result>
+ARROW_EXPORT
+Result> NewPayloadStreamWriter(
+io::OutputStream* sink, const IpcWriteOptions& options = 
IpcWriteOptions::Defaults());
+
+/// Create a new IPC payload file writer from stream sink.
+///
+/// \param[in] sink output stream to write to
+/// \param[in] schema the schema of the record batches to be written
+/// \param[in] options options for serialization, optional
+/// \param[in] metadata custom metadata for File Footer, optional
+/// \return Status
+ARROW_EXPORT
+Result> NewPayloadFileWriter(

Review comment:
   Revised. Thank you. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7544: ARROW-7285: [C++] ensure C++ implementation meets clarified dictionary spec

2020-07-01 Thread GitBox


liyafan82 commented on a change in pull request #7544:
URL: https://github.com/apache/arrow/pull/7544#discussion_r448168706



##
File path: cpp/src/arrow/ipc/writer.h
##
@@ -341,6 +343,29 @@ class ARROW_EXPORT IpcPayloadWriter {
   virtual Status Close() = 0;
 };
 
+/// Create a new IPC payload stream writer from stream sink. User is
+/// responsible for closing the actual OutputStream.
+///
+/// \param[in] sink output stream to write to
+/// \param[in] options options for serialization
+/// \return Result>
+ARROW_EXPORT
+Result> NewPayloadStreamWriter(

Review comment:
   Revised. Thank you. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] xhochy commented on pull request #7593: ARROW-9160: [C++] Implement contains for exact matches

2020-07-01 Thread GitBox


xhochy commented on pull request #7593:
URL: https://github.com/apache/arrow/pull/7593#issuecomment-652244376


   > You match a regex, you don't contain it.
   
   That is one of the name clashes, we already have a match kernel.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7593: ARROW-9160: [C++] Implement contains for exact matches

2020-07-01 Thread GitBox


pitrou commented on pull request #7593:
URL: https://github.com/apache/arrow/pull/7593#issuecomment-652244689


   `match_regex` then? :-)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-07-01 Thread GitBox


jorisvandenbossche commented on pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#issuecomment-652245303


   > the null equality tests look like a nuisance for regular Python usage 
(`__eq__` should return a boolean)
   
   The reason that those scalars return null on equality check and not 
True/False, it to ensure consistent behaviour between arrays and scalars (to 
ensure `(arr1 == arr2)[0]` and `arr1[0] == arr2[0]`gives the same answer). 
   Our element-wise equality propagates null, and so IMO also the scalar should 
do.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-07-01 Thread GitBox


pitrou commented on pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#issuecomment-652247947


   The problem is that it breaks Python semantics in potentially annoying 
places:
   ```python
   >>> import pyarrow as pa 


   >>> na = pa.scalar(None) 


   >>> na in [5]


   True
   ```
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-07-01 Thread GitBox


jorisvandenbossche commented on pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#issuecomment-652249133


   We could "fix" that one by raising in `__bool__` (meaning: it will at least 
give an error instead of silently returning a wrong answer)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] maartenbreddels commented on pull request #7593: ARROW-9160: [C++] Implement contains for exact matches

2020-07-01 Thread GitBox


maartenbreddels commented on pull request #7593:
URL: https://github.com/apache/arrow/pull/7593#issuecomment-652254096


   I like the prefixing by `string`. I'm a big fan of ordering 'words' in snake 
or camel casing for good tab completion and alphabetic ordering, so I agree 
with @wesm 's proposal. As a user, I first think of the type I work with 
('string'), next, what I want to do with it (upper/lower casing). 
Tab-completion/ordering would then reveal 2 variants, very intuitive.
   To make this concrete:
* string_lower_utf8
* string_lower_ascii
* binary_contains_exact
* string_contains_regex_ascii
* string_contains_regex_utf8
   
   (or s/contains/match 😄 )
   
   When working with strings, you need to check 
   
   There might be kernels that work on binary data, but do not work well with 
utf8, e.g. a `binary_slice`, which would slice each binary-string on a byte 
basis. It could cut a multibyte encoded codepoint to result in an invalid utf8 
string. I'd say that's acceptable, we could check that depending on the type we 
get it.
   
   Regarding case insensitive variants, I think we should expose functions to 
do normalization (eg replace the single codepoint ë by the letter e and the 
diaeresis combining character(the dots above the ë)), case folding, and removal 
of combining characters. That allows users to remove e.g. the diaeresis from ë 
to do pattern matching without diacritics.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] maartenbreddels edited a comment on pull request #7593: ARROW-9160: [C++] Implement contains for exact matches

2020-07-01 Thread GitBox


maartenbreddels edited a comment on pull request #7593:
URL: https://github.com/apache/arrow/pull/7593#issuecomment-652254096


   I like the prefixing by `string`. I'm a big fan of ordering 'words' in snake 
or camel casing for good tab completion and alphabetic ordering, so I agree 
with @wesm 's proposal. As a user, I first think of the type I work with 
('string'), next, what I want to do with it (upper/lower casing). 
Tab-completion/ordering would then reveal 2 variants, very intuitive.
   To make this concrete:
* string_lower_utf8
* string_lower_ascii
* binary_contains_exact
* string_contains_regex_ascii
* string_contains_regex_utf8
   
   (or s/contains/match 😄 )
   
   There might be kernels that work on binary data, but do not work well with 
utf8, e.g. a `binary_slice`, which would slice each binary-string on a byte 
basis. It could cut a multibyte encoded codepoint to result in an invalid utf8 
string. I'd say that's acceptable, we could check that depending on the type we 
get it.
   
   Regarding case insensitive variants, I think we should expose functions to 
do normalization (eg replace the single codepoint ë by the letter e and the 
diaeresis combining character(the dots above the ë)), case folding, and removal 
of combining characters. That allows users to remove e.g. the diaeresis from ë 
to do pattern matching without diacritics.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-07-01 Thread GitBox


pitrou commented on pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#issuecomment-652256345


   Yes, we could. That may have other annoying implications, though (such as 
`__contains__` not working anymore). I've started a ML discussion.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou closed pull request #7594: ARROW-7654: [Python] Ability to set column_types to a Schema in csv.ConvertOptions is undocumented

2020-07-01 Thread GitBox


pitrou closed pull request #7594:
URL: https://github.com/apache/arrow/pull/7594


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on pull request #7604: ARROW-9223: [Python] Propagate Timzone information in pandas conversion

2020-07-01 Thread GitBox


emkornfield commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-652259406


   CC @wesm @BryanCutler 
   
   Not sure if doing this correctly will break spark in unintentional ways.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield opened a new pull request #7604: ARROW-9223: [Python] Propagate Timzone information in pandas conversion

2020-07-01 Thread GitBox


emkornfield opened a new pull request #7604:
URL: https://github.com/apache/arrow/pull/7604


   - Ports string_to_timezone to C++
   - Causes nested timestamp columns within structs to use conversion
 to object path.
   - Copy timezone on to_object path.
   
   Open to other suggestions on how to solve this.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7603: ARROW-9206: [C++][Flight] Add latency benchmark

2020-07-01 Thread GitBox


pitrou commented on pull request #7603:
URL: https://github.com/apache/arrow/pull/7603#issuecomment-652260499


   You can't compute latency like that if the test is multithreaded.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on a change in pull request #7589: ARROW-9276: [Release] Enforce CUDA device for updating the api documentations

2020-07-01 Thread GitBox


kou commented on a change in pull request #7589:
URL: https://github.com/apache/arrow/pull/7589#discussion_r448188490



##
File path: dev/release/post-09-docs.sh
##
@@ -42,20 +47,20 @@ popd
 pushd "${ARROW_DIR}"
 git checkout "${release_tag}"

Review comment:
   FYI: We don't need a CUDA machine if we don't need to run CUDA enabled 
programs. We're building CUDA enabled deb packages on a no CUDA machine.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7604: ARROW-9223: [Python] Propagate Timzone information in pandas conversion

2020-07-01 Thread GitBox


github-actions[bot] commented on pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#issuecomment-652262316


   https://issues.apache.org/jira/browse/ARROW-9223



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] maartenbreddels commented on pull request #7593: ARROW-9160: [C++] Implement contains for exact matches

2020-07-01 Thread GitBox


maartenbreddels commented on pull request #7593:
URL: https://github.com/apache/arrow/pull/7593#issuecomment-652263794


   While at the difficult topic of naming, is there a conversion (agreed or 
emerging) for naming the functors/ArrayKernelExec implementations? I see in 
`scalar_string.cc` we use Transform, which is probably redundant (as I 
understand scalar kernels basically map/transform elementwise). But I think it 
would be useful to include the type name, e.g.
* Utf8ToUtf8 (which is now Utf8Transform) (hence `Utf8Upper : 
Utf8ToUtf8>`)
* Utf8ToBool / Utf8Predicate (which is now StringBoolTransform)
* Utf8ToScalar
* BinaryToBool



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] cyb70289 commented on pull request #7603: ARROW-9206: [C++][Flight] Add latency benchmark

2020-07-01 Thread GitBox


cyb70289 commented on pull request #7603:
URL: https://github.com/apache/arrow/pull/7603#issuecomment-652289408


   > You can't compute latency like that if the test is multithreaded.
   
   Assume two threads, one finishes `n1` IO in `t1` time, another finishes `n2` 
IO in `t2` time. So the average latency should be `(t1/n1 + t2/n2) / 2`, not 
`(t1+t2) / (n1+n2)`?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rymurr commented on pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory

2020-07-01 Thread GitBox


rymurr commented on pull request #7347:
URL: https://github.com/apache/arrow/pull/7347#issuecomment-652291140


   Thanks a lot @liyafan82 I have addressed your suggestions and rebased



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++

2020-07-01 Thread GitBox


rymurr commented on a change in pull request #7275:
URL: https://github.com/apache/arrow/pull/7275#discussion_r448222516



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java
##
@@ -73,6 +87,28 @@ public static void setBit(ArrowBuf validityBuffer, int 
index) {
 validityBuffer.setByte(byteIndex, currentByte);
   }
 
+  /**
+   * Set the bit at provided index to 1.
+   *
+   * @param validityBuffer validity buffer of the vector
+   * @param index index to be set
+   */
+  public static void setBit(ArrowBuf validityBuffer, long index) {
+// it can be observed that some logic is duplicate of the logic in 
setValidityBit.

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] tobim commented on pull request #7315: ARROW-7605: [C++] Bundle jemalloc into static libarrow

2020-07-01 Thread GitBox


tobim commented on pull request #7315:
URL: https://github.com/apache/arrow/pull/7315#issuecomment-652292544


   > CMake 3.9 is a bit problematic since we've tried to support CMake >= 3.2 
and definitely >= 3.7
   
   In that case the other approach should be pursued.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++

2020-07-01 Thread GitBox


rymurr commented on a change in pull request #7275:
URL: https://github.com/apache/arrow/pull/7275#discussion_r448224180



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java
##
@@ -0,0 +1,991 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.complex;
+
+import static java.util.Collections.singletonList;
+import static org.apache.arrow.memory.util.LargeMemoryUtil.capAtMaxInt;
+import static org.apache.arrow.memory.util.LargeMemoryUtil.checkedCastToInt;
+import static org.apache.arrow.util.Preconditions.checkNotNull;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+
+import org.apache.arrow.memory.ArrowBuf;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.OutOfMemoryException;
+import org.apache.arrow.memory.util.ArrowBufPointer;
+import org.apache.arrow.memory.util.ByteFunctionHelpers;
+import org.apache.arrow.memory.util.CommonUtil;
+import org.apache.arrow.memory.util.hash.ArrowBufHasher;
+import org.apache.arrow.util.Preconditions;
+import org.apache.arrow.vector.AddOrGetResult;
+import org.apache.arrow.vector.BaseFixedWidthVector;
+import org.apache.arrow.vector.BaseValueVector;
+import org.apache.arrow.vector.BaseVariableWidthVector;
+import org.apache.arrow.vector.BitVectorHelper;
+import org.apache.arrow.vector.BufferBacked;
+import org.apache.arrow.vector.DensityAwareVector;
+import org.apache.arrow.vector.FieldVector;
+import org.apache.arrow.vector.NullVector;
+import org.apache.arrow.vector.UInt4Vector;
+import org.apache.arrow.vector.ValueVector;
+import org.apache.arrow.vector.ZeroVector;
+import org.apache.arrow.vector.compare.VectorVisitor;
+import org.apache.arrow.vector.complex.impl.ComplexCopier;
+import org.apache.arrow.vector.complex.impl.UnionLargeListReader;
+import org.apache.arrow.vector.complex.impl.UnionLargeListWriter;
+import org.apache.arrow.vector.complex.reader.FieldReader;
+import org.apache.arrow.vector.ipc.message.ArrowFieldNode;
+import org.apache.arrow.vector.types.Types.MinorType;
+import org.apache.arrow.vector.types.pojo.ArrowType;
+import org.apache.arrow.vector.types.pojo.Field;
+import org.apache.arrow.vector.types.pojo.FieldType;
+import org.apache.arrow.vector.util.CallBack;
+import org.apache.arrow.vector.util.JsonStringArrayList;
+import org.apache.arrow.vector.util.OversizedAllocationException;
+import org.apache.arrow.vector.util.SchemaChangeRuntimeException;
+import org.apache.arrow.vector.util.TransferPair;
+
+/**
+ * A list vector contains lists of a specific type of elements.  Its structure 
contains 3 elements.
+ * 
+ * A validity buffer.
+ *  An offset buffer, that denotes lists boundaries. 
+ *  A child data vector that contains the elements of lists. 
+ * 
+ *
+ * This is the LargeList variant of list, it has a 64-bit wide offset
+ *
+ * 
+ *   todo review checkedCastToInt usage in this class.
+ *   Once int64 indexed vectors are supported these checks aren't needed.
+ * 
+ */
+public class LargeListVector extends BaseValueVector implements 
RepeatedValueVector, BaseListVector, PromotableVector {
+
+  public static LargeListVector empty(String name, BufferAllocator allocator) {
+return new LargeListVector(name, allocator, 
FieldType.nullable(ArrowType.LargeList.INSTANCE), null);
+  }
+
+  public static final FieldVector DEFAULT_DATA_VECTOR = ZeroVector.INSTANCE;
+  public static final String DATA_VECTOR_NAME = "$data$";
+
+  public static final byte OFFSET_WIDTH = 8;
+  protected ArrowBuf offsetBuffer;
+  protected FieldVector vector;
+  protected final CallBack callBack;
+  protected int valueCount;
+  protected long offsetAllocationSizeInBytes = INITIAL_VALUE_ALLOCATION * 
OFFSET_WIDTH;
+  private final String name;
+
+  protected String defaultDataVectorName = DATA_VECTOR_NAME;
+  protected ArrowBuf validityBuffer;
+  protected UnionLargeListReader reader;
+  private final FieldType fieldType;
+  private int validityAllocationSizeInBytes;
+
+  /**
+   * The maximum index that is actually set.
+   */
+  private long lastSet;
+
+  /**
+   * Constructs a new instance.
+   *
+   * @param name The name of 

[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++

2020-07-01 Thread GitBox


rymurr commented on a change in pull request #7275:
URL: https://github.com/apache/arrow/pull/7275#discussion_r448223616



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java
##
@@ -37,6 +37,20 @@
 
   private BitVectorHelper() {}
 
+  /**
+   * Get the index of byte corresponding to bit index in validity buffer.
+   */
+  public static long byteIndex(long absoluteBitIndex) {
+return absoluteBitIndex >> 3;
+  }
+
+  /**
+   * Get the relative index of bit within the byte in validity buffer.
+   */
+  public static long bitIndex(long absoluteBitIndex) {

Review comment:
   done

##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java
##
@@ -57,12 +71,12 @@ public static int bitIndex(int absoluteBitIndex) {
* @param validityBuffer validity buffer of the vector
* @param index index to be set
*/
-  public static void setBit(ArrowBuf validityBuffer, int index) {
+  public static void setBit(ArrowBuf validityBuffer, long index) {
 // it can be observed that some logic is duplicate of the logic in 
setValidityBit.
 // this is because JIT cannot always remove the if branch in 
setValidityBit,
 // so we give a dedicated implementation for setting bits.
-final int byteIndex = byteIndex(index);
-final int bitIndex = bitIndex(index);
+final long byteIndex = byteIndex(index);
+final long bitIndex = bitIndex(index);

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++

2020-07-01 Thread GitBox


rymurr commented on a change in pull request #7275:
URL: https://github.com/apache/arrow/pull/7275#discussion_r448225368



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java
##
@@ -0,0 +1,991 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.complex;
+
+import static java.util.Collections.singletonList;
+import static org.apache.arrow.memory.util.LargeMemoryUtil.capAtMaxInt;
+import static org.apache.arrow.memory.util.LargeMemoryUtil.checkedCastToInt;
+import static org.apache.arrow.util.Preconditions.checkNotNull;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+
+import org.apache.arrow.memory.ArrowBuf;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.OutOfMemoryException;
+import org.apache.arrow.memory.util.ArrowBufPointer;
+import org.apache.arrow.memory.util.ByteFunctionHelpers;
+import org.apache.arrow.memory.util.CommonUtil;
+import org.apache.arrow.memory.util.hash.ArrowBufHasher;
+import org.apache.arrow.util.Preconditions;
+import org.apache.arrow.vector.AddOrGetResult;
+import org.apache.arrow.vector.BaseFixedWidthVector;
+import org.apache.arrow.vector.BaseValueVector;
+import org.apache.arrow.vector.BaseVariableWidthVector;
+import org.apache.arrow.vector.BitVectorHelper;
+import org.apache.arrow.vector.BufferBacked;
+import org.apache.arrow.vector.DensityAwareVector;
+import org.apache.arrow.vector.FieldVector;
+import org.apache.arrow.vector.NullVector;
+import org.apache.arrow.vector.UInt4Vector;
+import org.apache.arrow.vector.ValueVector;
+import org.apache.arrow.vector.ZeroVector;
+import org.apache.arrow.vector.compare.VectorVisitor;
+import org.apache.arrow.vector.complex.impl.ComplexCopier;
+import org.apache.arrow.vector.complex.impl.UnionLargeListReader;
+import org.apache.arrow.vector.complex.impl.UnionLargeListWriter;
+import org.apache.arrow.vector.complex.reader.FieldReader;
+import org.apache.arrow.vector.ipc.message.ArrowFieldNode;
+import org.apache.arrow.vector.types.Types.MinorType;
+import org.apache.arrow.vector.types.pojo.ArrowType;
+import org.apache.arrow.vector.types.pojo.Field;
+import org.apache.arrow.vector.types.pojo.FieldType;
+import org.apache.arrow.vector.util.CallBack;
+import org.apache.arrow.vector.util.JsonStringArrayList;
+import org.apache.arrow.vector.util.OversizedAllocationException;
+import org.apache.arrow.vector.util.SchemaChangeRuntimeException;
+import org.apache.arrow.vector.util.TransferPair;
+
+/**
+ * A list vector contains lists of a specific type of elements.  Its structure 
contains 3 elements.
+ * 
+ * A validity buffer.
+ *  An offset buffer, that denotes lists boundaries. 
+ *  A child data vector that contains the elements of lists. 
+ * 
+ *
+ * This is the LargeList variant of list, it has a 64-bit wide offset
+ *
+ * 
+ *   todo review checkedCastToInt usage in this class.
+ *   Once int64 indexed vectors are supported these checks aren't needed.
+ * 
+ */
+public class LargeListVector extends BaseValueVector implements 
RepeatedValueVector, BaseListVector, PromotableVector {
+
+  public static LargeListVector empty(String name, BufferAllocator allocator) {
+return new LargeListVector(name, allocator, 
FieldType.nullable(ArrowType.LargeList.INSTANCE), null);
+  }
+
+  public static final FieldVector DEFAULT_DATA_VECTOR = ZeroVector.INSTANCE;
+  public static final String DATA_VECTOR_NAME = "$data$";
+
+  public static final byte OFFSET_WIDTH = 8;
+  protected ArrowBuf offsetBuffer;
+  protected FieldVector vector;
+  protected final CallBack callBack;
+  protected int valueCount;
+  protected long offsetAllocationSizeInBytes = INITIAL_VALUE_ALLOCATION * 
OFFSET_WIDTH;
+  private final String name;
+
+  protected String defaultDataVectorName = DATA_VECTOR_NAME;
+  protected ArrowBuf validityBuffer;
+  protected UnionLargeListReader reader;
+  private final FieldType fieldType;
+  private int validityAllocationSizeInBytes;
+
+  /**
+   * The maximum index that is actually set.
+   */
+  private long lastSet;
+
+  /**
+   * Constructs a new instance.
+   *
+   * @param name The name of 

[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++

2020-07-01 Thread GitBox


rymurr commented on a change in pull request #7275:
URL: https://github.com/apache/arrow/pull/7275#discussion_r448225742



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java
##
@@ -0,0 +1,991 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.complex;
+
+import static java.util.Collections.singletonList;
+import static org.apache.arrow.memory.util.LargeMemoryUtil.capAtMaxInt;
+import static org.apache.arrow.memory.util.LargeMemoryUtil.checkedCastToInt;
+import static org.apache.arrow.util.Preconditions.checkNotNull;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+
+import org.apache.arrow.memory.ArrowBuf;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.OutOfMemoryException;
+import org.apache.arrow.memory.util.ArrowBufPointer;
+import org.apache.arrow.memory.util.ByteFunctionHelpers;
+import org.apache.arrow.memory.util.CommonUtil;
+import org.apache.arrow.memory.util.hash.ArrowBufHasher;
+import org.apache.arrow.util.Preconditions;
+import org.apache.arrow.vector.AddOrGetResult;
+import org.apache.arrow.vector.BaseFixedWidthVector;
+import org.apache.arrow.vector.BaseValueVector;
+import org.apache.arrow.vector.BaseVariableWidthVector;
+import org.apache.arrow.vector.BitVectorHelper;
+import org.apache.arrow.vector.BufferBacked;
+import org.apache.arrow.vector.DensityAwareVector;
+import org.apache.arrow.vector.FieldVector;
+import org.apache.arrow.vector.NullVector;
+import org.apache.arrow.vector.UInt4Vector;
+import org.apache.arrow.vector.ValueVector;
+import org.apache.arrow.vector.ZeroVector;
+import org.apache.arrow.vector.compare.VectorVisitor;
+import org.apache.arrow.vector.complex.impl.ComplexCopier;
+import org.apache.arrow.vector.complex.impl.UnionLargeListReader;
+import org.apache.arrow.vector.complex.impl.UnionLargeListWriter;
+import org.apache.arrow.vector.complex.reader.FieldReader;
+import org.apache.arrow.vector.ipc.message.ArrowFieldNode;
+import org.apache.arrow.vector.types.Types.MinorType;
+import org.apache.arrow.vector.types.pojo.ArrowType;
+import org.apache.arrow.vector.types.pojo.Field;
+import org.apache.arrow.vector.types.pojo.FieldType;
+import org.apache.arrow.vector.util.CallBack;
+import org.apache.arrow.vector.util.JsonStringArrayList;
+import org.apache.arrow.vector.util.OversizedAllocationException;
+import org.apache.arrow.vector.util.SchemaChangeRuntimeException;
+import org.apache.arrow.vector.util.TransferPair;
+
+/**
+ * A list vector contains lists of a specific type of elements.  Its structure 
contains 3 elements.
+ * 
+ * A validity buffer.
+ *  An offset buffer, that denotes lists boundaries. 
+ *  A child data vector that contains the elements of lists. 
+ * 
+ *
+ * This is the LargeList variant of list, it has a 64-bit wide offset
+ *
+ * 
+ *   todo review checkedCastToInt usage in this class.
+ *   Once int64 indexed vectors are supported these checks aren't needed.
+ * 
+ */
+public class LargeListVector extends BaseValueVector implements 
RepeatedValueVector, BaseListVector, PromotableVector {
+
+  public static LargeListVector empty(String name, BufferAllocator allocator) {
+return new LargeListVector(name, allocator, 
FieldType.nullable(ArrowType.LargeList.INSTANCE), null);
+  }
+
+  public static final FieldVector DEFAULT_DATA_VECTOR = ZeroVector.INSTANCE;
+  public static final String DATA_VECTOR_NAME = "$data$";
+
+  public static final byte OFFSET_WIDTH = 8;
+  protected ArrowBuf offsetBuffer;
+  protected FieldVector vector;
+  protected final CallBack callBack;
+  protected int valueCount;
+  protected long offsetAllocationSizeInBytes = INITIAL_VALUE_ALLOCATION * 
OFFSET_WIDTH;
+  private final String name;
+
+  protected String defaultDataVectorName = DATA_VECTOR_NAME;
+  protected ArrowBuf validityBuffer;
+  protected UnionLargeListReader reader;
+  private final FieldType fieldType;
+  private int validityAllocationSizeInBytes;
+
+  /**
+   * The maximum index that is actually set.
+   */
+  private long lastSet;
+
+  /**
+   * Constructs a new instance.
+   *
+   * @param name The name of 

[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++

2020-07-01 Thread GitBox


rymurr commented on a change in pull request #7275:
URL: https://github.com/apache/arrow/pull/7275#discussion_r448226183



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java
##
@@ -0,0 +1,991 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.complex;
+
+import static java.util.Collections.singletonList;
+import static org.apache.arrow.memory.util.LargeMemoryUtil.capAtMaxInt;
+import static org.apache.arrow.memory.util.LargeMemoryUtil.checkedCastToInt;
+import static org.apache.arrow.util.Preconditions.checkNotNull;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+
+import org.apache.arrow.memory.ArrowBuf;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.OutOfMemoryException;
+import org.apache.arrow.memory.util.ArrowBufPointer;
+import org.apache.arrow.memory.util.ByteFunctionHelpers;
+import org.apache.arrow.memory.util.CommonUtil;
+import org.apache.arrow.memory.util.hash.ArrowBufHasher;
+import org.apache.arrow.util.Preconditions;
+import org.apache.arrow.vector.AddOrGetResult;
+import org.apache.arrow.vector.BaseFixedWidthVector;
+import org.apache.arrow.vector.BaseValueVector;
+import org.apache.arrow.vector.BaseVariableWidthVector;
+import org.apache.arrow.vector.BitVectorHelper;
+import org.apache.arrow.vector.BufferBacked;
+import org.apache.arrow.vector.DensityAwareVector;
+import org.apache.arrow.vector.FieldVector;
+import org.apache.arrow.vector.NullVector;
+import org.apache.arrow.vector.UInt4Vector;
+import org.apache.arrow.vector.ValueVector;
+import org.apache.arrow.vector.ZeroVector;
+import org.apache.arrow.vector.compare.VectorVisitor;
+import org.apache.arrow.vector.complex.impl.ComplexCopier;
+import org.apache.arrow.vector.complex.impl.UnionLargeListReader;
+import org.apache.arrow.vector.complex.impl.UnionLargeListWriter;
+import org.apache.arrow.vector.complex.reader.FieldReader;
+import org.apache.arrow.vector.ipc.message.ArrowFieldNode;
+import org.apache.arrow.vector.types.Types.MinorType;
+import org.apache.arrow.vector.types.pojo.ArrowType;
+import org.apache.arrow.vector.types.pojo.Field;
+import org.apache.arrow.vector.types.pojo.FieldType;
+import org.apache.arrow.vector.util.CallBack;
+import org.apache.arrow.vector.util.JsonStringArrayList;
+import org.apache.arrow.vector.util.OversizedAllocationException;
+import org.apache.arrow.vector.util.SchemaChangeRuntimeException;
+import org.apache.arrow.vector.util.TransferPair;
+
+/**
+ * A list vector contains lists of a specific type of elements.  Its structure 
contains 3 elements.
+ * 
+ * A validity buffer.
+ *  An offset buffer, that denotes lists boundaries. 
+ *  A child data vector that contains the elements of lists. 
+ * 
+ *
+ * This is the LargeList variant of list, it has a 64-bit wide offset
+ *
+ * 
+ *   todo review checkedCastToInt usage in this class.
+ *   Once int64 indexed vectors are supported these checks aren't needed.
+ * 
+ */
+public class LargeListVector extends BaseValueVector implements 
RepeatedValueVector, BaseListVector, PromotableVector {
+
+  public static LargeListVector empty(String name, BufferAllocator allocator) {
+return new LargeListVector(name, allocator, 
FieldType.nullable(ArrowType.LargeList.INSTANCE), null);
+  }
+
+  public static final FieldVector DEFAULT_DATA_VECTOR = ZeroVector.INSTANCE;
+  public static final String DATA_VECTOR_NAME = "$data$";
+
+  public static final byte OFFSET_WIDTH = 8;
+  protected ArrowBuf offsetBuffer;
+  protected FieldVector vector;
+  protected final CallBack callBack;
+  protected int valueCount;
+  protected long offsetAllocationSizeInBytes = INITIAL_VALUE_ALLOCATION * 
OFFSET_WIDTH;
+  private final String name;
+
+  protected String defaultDataVectorName = DATA_VECTOR_NAME;
+  protected ArrowBuf validityBuffer;
+  protected UnionLargeListReader reader;
+  private final FieldType fieldType;
+  private int validityAllocationSizeInBytes;
+
+  /**
+   * The maximum index that is actually set.
+   */
+  private long lastSet;
+
+  /**
+   * Constructs a new instance.
+   *
+   * @param name The name of 

[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++

2020-07-01 Thread GitBox


rymurr commented on a change in pull request #7275:
URL: https://github.com/apache/arrow/pull/7275#discussion_r448226339



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java
##
@@ -0,0 +1,991 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.complex;
+
+import static java.util.Collections.singletonList;
+import static org.apache.arrow.memory.util.LargeMemoryUtil.capAtMaxInt;
+import static org.apache.arrow.memory.util.LargeMemoryUtil.checkedCastToInt;
+import static org.apache.arrow.util.Preconditions.checkNotNull;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+
+import org.apache.arrow.memory.ArrowBuf;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.OutOfMemoryException;
+import org.apache.arrow.memory.util.ArrowBufPointer;
+import org.apache.arrow.memory.util.ByteFunctionHelpers;
+import org.apache.arrow.memory.util.CommonUtil;
+import org.apache.arrow.memory.util.hash.ArrowBufHasher;
+import org.apache.arrow.util.Preconditions;
+import org.apache.arrow.vector.AddOrGetResult;
+import org.apache.arrow.vector.BaseFixedWidthVector;
+import org.apache.arrow.vector.BaseValueVector;
+import org.apache.arrow.vector.BaseVariableWidthVector;
+import org.apache.arrow.vector.BitVectorHelper;
+import org.apache.arrow.vector.BufferBacked;
+import org.apache.arrow.vector.DensityAwareVector;
+import org.apache.arrow.vector.FieldVector;
+import org.apache.arrow.vector.NullVector;
+import org.apache.arrow.vector.UInt4Vector;
+import org.apache.arrow.vector.ValueVector;
+import org.apache.arrow.vector.ZeroVector;
+import org.apache.arrow.vector.compare.VectorVisitor;
+import org.apache.arrow.vector.complex.impl.ComplexCopier;
+import org.apache.arrow.vector.complex.impl.UnionLargeListReader;
+import org.apache.arrow.vector.complex.impl.UnionLargeListWriter;
+import org.apache.arrow.vector.complex.reader.FieldReader;
+import org.apache.arrow.vector.ipc.message.ArrowFieldNode;
+import org.apache.arrow.vector.types.Types.MinorType;
+import org.apache.arrow.vector.types.pojo.ArrowType;
+import org.apache.arrow.vector.types.pojo.Field;
+import org.apache.arrow.vector.types.pojo.FieldType;
+import org.apache.arrow.vector.util.CallBack;
+import org.apache.arrow.vector.util.JsonStringArrayList;
+import org.apache.arrow.vector.util.OversizedAllocationException;
+import org.apache.arrow.vector.util.SchemaChangeRuntimeException;
+import org.apache.arrow.vector.util.TransferPair;
+
+/**
+ * A list vector contains lists of a specific type of elements.  Its structure 
contains 3 elements.
+ * 
+ * A validity buffer.
+ *  An offset buffer, that denotes lists boundaries. 
+ *  A child data vector that contains the elements of lists. 
+ * 
+ *
+ * This is the LargeList variant of list, it has a 64-bit wide offset
+ *
+ * 
+ *   todo review checkedCastToInt usage in this class.
+ *   Once int64 indexed vectors are supported these checks aren't needed.
+ * 
+ */
+public class LargeListVector extends BaseValueVector implements 
RepeatedValueVector, BaseListVector, PromotableVector {
+
+  public static LargeListVector empty(String name, BufferAllocator allocator) {
+return new LargeListVector(name, allocator, 
FieldType.nullable(ArrowType.LargeList.INSTANCE), null);
+  }
+
+  public static final FieldVector DEFAULT_DATA_VECTOR = ZeroVector.INSTANCE;
+  public static final String DATA_VECTOR_NAME = "$data$";
+
+  public static final byte OFFSET_WIDTH = 8;
+  protected ArrowBuf offsetBuffer;
+  protected FieldVector vector;
+  protected final CallBack callBack;
+  protected int valueCount;
+  protected long offsetAllocationSizeInBytes = INITIAL_VALUE_ALLOCATION * 
OFFSET_WIDTH;
+  private final String name;
+
+  protected String defaultDataVectorName = DATA_VECTOR_NAME;
+  protected ArrowBuf validityBuffer;
+  protected UnionLargeListReader reader;
+  private final FieldType fieldType;
+  private int validityAllocationSizeInBytes;
+
+  /**
+   * The maximum index that is actually set.
+   */
+  private long lastSet;
+
+  /**
+   * Constructs a new instance.
+   *
+   * @param name The name of 

[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++

2020-07-01 Thread GitBox


rymurr commented on a change in pull request #7275:
URL: https://github.com/apache/arrow/pull/7275#discussion_r448226970



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java
##
@@ -0,0 +1,991 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.complex;
+
+import static java.util.Collections.singletonList;
+import static org.apache.arrow.memory.util.LargeMemoryUtil.capAtMaxInt;
+import static org.apache.arrow.memory.util.LargeMemoryUtil.checkedCastToInt;
+import static org.apache.arrow.util.Preconditions.checkNotNull;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+
+import org.apache.arrow.memory.ArrowBuf;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.OutOfMemoryException;
+import org.apache.arrow.memory.util.ArrowBufPointer;
+import org.apache.arrow.memory.util.ByteFunctionHelpers;
+import org.apache.arrow.memory.util.CommonUtil;
+import org.apache.arrow.memory.util.hash.ArrowBufHasher;
+import org.apache.arrow.util.Preconditions;
+import org.apache.arrow.vector.AddOrGetResult;
+import org.apache.arrow.vector.BaseFixedWidthVector;
+import org.apache.arrow.vector.BaseValueVector;
+import org.apache.arrow.vector.BaseVariableWidthVector;
+import org.apache.arrow.vector.BitVectorHelper;
+import org.apache.arrow.vector.BufferBacked;
+import org.apache.arrow.vector.DensityAwareVector;
+import org.apache.arrow.vector.FieldVector;
+import org.apache.arrow.vector.NullVector;
+import org.apache.arrow.vector.UInt4Vector;
+import org.apache.arrow.vector.ValueVector;
+import org.apache.arrow.vector.ZeroVector;
+import org.apache.arrow.vector.compare.VectorVisitor;
+import org.apache.arrow.vector.complex.impl.ComplexCopier;
+import org.apache.arrow.vector.complex.impl.UnionLargeListReader;
+import org.apache.arrow.vector.complex.impl.UnionLargeListWriter;
+import org.apache.arrow.vector.complex.reader.FieldReader;
+import org.apache.arrow.vector.ipc.message.ArrowFieldNode;
+import org.apache.arrow.vector.types.Types.MinorType;
+import org.apache.arrow.vector.types.pojo.ArrowType;
+import org.apache.arrow.vector.types.pojo.Field;
+import org.apache.arrow.vector.types.pojo.FieldType;
+import org.apache.arrow.vector.util.CallBack;
+import org.apache.arrow.vector.util.JsonStringArrayList;
+import org.apache.arrow.vector.util.OversizedAllocationException;
+import org.apache.arrow.vector.util.SchemaChangeRuntimeException;
+import org.apache.arrow.vector.util.TransferPair;
+
+/**
+ * A list vector contains lists of a specific type of elements.  Its structure 
contains 3 elements.
+ * 
+ * A validity buffer.
+ *  An offset buffer, that denotes lists boundaries. 
+ *  A child data vector that contains the elements of lists. 
+ * 
+ *
+ * This is the LargeList variant of list, it has a 64-bit wide offset
+ *
+ * 
+ *   todo review checkedCastToInt usage in this class.
+ *   Once int64 indexed vectors are supported these checks aren't needed.
+ * 
+ */
+public class LargeListVector extends BaseValueVector implements 
RepeatedValueVector, BaseListVector, PromotableVector {
+
+  public static LargeListVector empty(String name, BufferAllocator allocator) {
+return new LargeListVector(name, allocator, 
FieldType.nullable(ArrowType.LargeList.INSTANCE), null);
+  }
+
+  public static final FieldVector DEFAULT_DATA_VECTOR = ZeroVector.INSTANCE;
+  public static final String DATA_VECTOR_NAME = "$data$";
+
+  public static final byte OFFSET_WIDTH = 8;
+  protected ArrowBuf offsetBuffer;
+  protected FieldVector vector;
+  protected final CallBack callBack;
+  protected int valueCount;
+  protected long offsetAllocationSizeInBytes = INITIAL_VALUE_ALLOCATION * 
OFFSET_WIDTH;
+  private final String name;
+
+  protected String defaultDataVectorName = DATA_VECTOR_NAME;
+  protected ArrowBuf validityBuffer;
+  protected UnionLargeListReader reader;
+  private final FieldType fieldType;
+  private int validityAllocationSizeInBytes;
+
+  /**
+   * The maximum index that is actually set.
+   */
+  private long lastSet;
+
+  /**
+   * Constructs a new instance.
+   *
+   * @param name The name of 

[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++

2020-07-01 Thread GitBox


rymurr commented on a change in pull request #7275:
URL: https://github.com/apache/arrow/pull/7275#discussion_r448226505



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java
##
@@ -0,0 +1,991 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.complex;
+
+import static java.util.Collections.singletonList;
+import static org.apache.arrow.memory.util.LargeMemoryUtil.capAtMaxInt;
+import static org.apache.arrow.memory.util.LargeMemoryUtil.checkedCastToInt;
+import static org.apache.arrow.util.Preconditions.checkNotNull;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+
+import org.apache.arrow.memory.ArrowBuf;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.OutOfMemoryException;
+import org.apache.arrow.memory.util.ArrowBufPointer;
+import org.apache.arrow.memory.util.ByteFunctionHelpers;
+import org.apache.arrow.memory.util.CommonUtil;
+import org.apache.arrow.memory.util.hash.ArrowBufHasher;
+import org.apache.arrow.util.Preconditions;
+import org.apache.arrow.vector.AddOrGetResult;
+import org.apache.arrow.vector.BaseFixedWidthVector;
+import org.apache.arrow.vector.BaseValueVector;
+import org.apache.arrow.vector.BaseVariableWidthVector;
+import org.apache.arrow.vector.BitVectorHelper;
+import org.apache.arrow.vector.BufferBacked;
+import org.apache.arrow.vector.DensityAwareVector;
+import org.apache.arrow.vector.FieldVector;
+import org.apache.arrow.vector.NullVector;
+import org.apache.arrow.vector.UInt4Vector;
+import org.apache.arrow.vector.ValueVector;
+import org.apache.arrow.vector.ZeroVector;
+import org.apache.arrow.vector.compare.VectorVisitor;
+import org.apache.arrow.vector.complex.impl.ComplexCopier;
+import org.apache.arrow.vector.complex.impl.UnionLargeListReader;
+import org.apache.arrow.vector.complex.impl.UnionLargeListWriter;
+import org.apache.arrow.vector.complex.reader.FieldReader;
+import org.apache.arrow.vector.ipc.message.ArrowFieldNode;
+import org.apache.arrow.vector.types.Types.MinorType;
+import org.apache.arrow.vector.types.pojo.ArrowType;
+import org.apache.arrow.vector.types.pojo.Field;
+import org.apache.arrow.vector.types.pojo.FieldType;
+import org.apache.arrow.vector.util.CallBack;
+import org.apache.arrow.vector.util.JsonStringArrayList;
+import org.apache.arrow.vector.util.OversizedAllocationException;
+import org.apache.arrow.vector.util.SchemaChangeRuntimeException;
+import org.apache.arrow.vector.util.TransferPair;
+
+/**
+ * A list vector contains lists of a specific type of elements.  Its structure 
contains 3 elements.
+ * 
+ * A validity buffer.
+ *  An offset buffer, that denotes lists boundaries. 
+ *  A child data vector that contains the elements of lists. 
+ * 
+ *
+ * This is the LargeList variant of list, it has a 64-bit wide offset
+ *
+ * 
+ *   todo review checkedCastToInt usage in this class.
+ *   Once int64 indexed vectors are supported these checks aren't needed.
+ * 
+ */
+public class LargeListVector extends BaseValueVector implements 
RepeatedValueVector, BaseListVector, PromotableVector {
+
+  public static LargeListVector empty(String name, BufferAllocator allocator) {
+return new LargeListVector(name, allocator, 
FieldType.nullable(ArrowType.LargeList.INSTANCE), null);
+  }
+
+  public static final FieldVector DEFAULT_DATA_VECTOR = ZeroVector.INSTANCE;
+  public static final String DATA_VECTOR_NAME = "$data$";
+
+  public static final byte OFFSET_WIDTH = 8;
+  protected ArrowBuf offsetBuffer;
+  protected FieldVector vector;
+  protected final CallBack callBack;
+  protected int valueCount;
+  protected long offsetAllocationSizeInBytes = INITIAL_VALUE_ALLOCATION * 
OFFSET_WIDTH;
+  private final String name;
+
+  protected String defaultDataVectorName = DATA_VECTOR_NAME;
+  protected ArrowBuf validityBuffer;
+  protected UnionLargeListReader reader;
+  private final FieldType fieldType;
+  private int validityAllocationSizeInBytes;
+
+  /**
+   * The maximum index that is actually set.
+   */
+  private long lastSet;
+
+  /**
+   * Constructs a new instance.
+   *
+   * @param name The name of 

[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++

2020-07-01 Thread GitBox


rymurr commented on a change in pull request #7275:
URL: https://github.com/apache/arrow/pull/7275#discussion_r448227452



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java
##
@@ -0,0 +1,991 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.complex;
+
+import static java.util.Collections.singletonList;
+import static org.apache.arrow.memory.util.LargeMemoryUtil.capAtMaxInt;
+import static org.apache.arrow.memory.util.LargeMemoryUtil.checkedCastToInt;
+import static org.apache.arrow.util.Preconditions.checkNotNull;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+
+import org.apache.arrow.memory.ArrowBuf;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.OutOfMemoryException;
+import org.apache.arrow.memory.util.ArrowBufPointer;
+import org.apache.arrow.memory.util.ByteFunctionHelpers;
+import org.apache.arrow.memory.util.CommonUtil;
+import org.apache.arrow.memory.util.hash.ArrowBufHasher;
+import org.apache.arrow.util.Preconditions;
+import org.apache.arrow.vector.AddOrGetResult;
+import org.apache.arrow.vector.BaseFixedWidthVector;
+import org.apache.arrow.vector.BaseValueVector;
+import org.apache.arrow.vector.BaseVariableWidthVector;
+import org.apache.arrow.vector.BitVectorHelper;
+import org.apache.arrow.vector.BufferBacked;
+import org.apache.arrow.vector.DensityAwareVector;
+import org.apache.arrow.vector.FieldVector;
+import org.apache.arrow.vector.NullVector;
+import org.apache.arrow.vector.UInt4Vector;
+import org.apache.arrow.vector.ValueVector;
+import org.apache.arrow.vector.ZeroVector;
+import org.apache.arrow.vector.compare.VectorVisitor;
+import org.apache.arrow.vector.complex.impl.ComplexCopier;
+import org.apache.arrow.vector.complex.impl.UnionLargeListReader;
+import org.apache.arrow.vector.complex.impl.UnionLargeListWriter;
+import org.apache.arrow.vector.complex.reader.FieldReader;
+import org.apache.arrow.vector.ipc.message.ArrowFieldNode;
+import org.apache.arrow.vector.types.Types.MinorType;
+import org.apache.arrow.vector.types.pojo.ArrowType;
+import org.apache.arrow.vector.types.pojo.Field;
+import org.apache.arrow.vector.types.pojo.FieldType;
+import org.apache.arrow.vector.util.CallBack;
+import org.apache.arrow.vector.util.JsonStringArrayList;
+import org.apache.arrow.vector.util.OversizedAllocationException;
+import org.apache.arrow.vector.util.SchemaChangeRuntimeException;
+import org.apache.arrow.vector.util.TransferPair;
+
+/**
+ * A list vector contains lists of a specific type of elements.  Its structure 
contains 3 elements.
+ * 
+ * A validity buffer.
+ *  An offset buffer, that denotes lists boundaries. 
+ *  A child data vector that contains the elements of lists. 
+ * 
+ *
+ * This is the LargeList variant of list, it has a 64-bit wide offset
+ *
+ * 
+ *   todo review checkedCastToInt usage in this class.
+ *   Once int64 indexed vectors are supported these checks aren't needed.
+ * 
+ */
+public class LargeListVector extends BaseValueVector implements 
RepeatedValueVector, BaseListVector, PromotableVector {
+
+  public static LargeListVector empty(String name, BufferAllocator allocator) {
+return new LargeListVector(name, allocator, 
FieldType.nullable(ArrowType.LargeList.INSTANCE), null);
+  }
+
+  public static final FieldVector DEFAULT_DATA_VECTOR = ZeroVector.INSTANCE;
+  public static final String DATA_VECTOR_NAME = "$data$";
+
+  public static final byte OFFSET_WIDTH = 8;
+  protected ArrowBuf offsetBuffer;
+  protected FieldVector vector;
+  protected final CallBack callBack;
+  protected int valueCount;
+  protected long offsetAllocationSizeInBytes = INITIAL_VALUE_ALLOCATION * 
OFFSET_WIDTH;
+  private final String name;
+
+  protected String defaultDataVectorName = DATA_VECTOR_NAME;
+  protected ArrowBuf validityBuffer;
+  protected UnionLargeListReader reader;
+  private final FieldType fieldType;
+  private int validityAllocationSizeInBytes;
+
+  /**
+   * The maximum index that is actually set.
+   */
+  private long lastSet;
+
+  /**
+   * Constructs a new instance.
+   *
+   * @param name The name of 

[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++

2020-07-01 Thread GitBox


rymurr commented on a change in pull request #7275:
URL: https://github.com/apache/arrow/pull/7275#discussion_r448227944



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java
##
@@ -0,0 +1,991 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.complex;
+
+import static java.util.Collections.singletonList;
+import static org.apache.arrow.memory.util.LargeMemoryUtil.capAtMaxInt;
+import static org.apache.arrow.memory.util.LargeMemoryUtil.checkedCastToInt;
+import static org.apache.arrow.util.Preconditions.checkNotNull;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+
+import org.apache.arrow.memory.ArrowBuf;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.OutOfMemoryException;
+import org.apache.arrow.memory.util.ArrowBufPointer;
+import org.apache.arrow.memory.util.ByteFunctionHelpers;
+import org.apache.arrow.memory.util.CommonUtil;
+import org.apache.arrow.memory.util.hash.ArrowBufHasher;
+import org.apache.arrow.util.Preconditions;
+import org.apache.arrow.vector.AddOrGetResult;
+import org.apache.arrow.vector.BaseFixedWidthVector;
+import org.apache.arrow.vector.BaseValueVector;
+import org.apache.arrow.vector.BaseVariableWidthVector;
+import org.apache.arrow.vector.BitVectorHelper;
+import org.apache.arrow.vector.BufferBacked;
+import org.apache.arrow.vector.DensityAwareVector;
+import org.apache.arrow.vector.FieldVector;
+import org.apache.arrow.vector.NullVector;
+import org.apache.arrow.vector.UInt4Vector;
+import org.apache.arrow.vector.ValueVector;
+import org.apache.arrow.vector.ZeroVector;
+import org.apache.arrow.vector.compare.VectorVisitor;
+import org.apache.arrow.vector.complex.impl.ComplexCopier;
+import org.apache.arrow.vector.complex.impl.UnionLargeListReader;
+import org.apache.arrow.vector.complex.impl.UnionLargeListWriter;
+import org.apache.arrow.vector.complex.reader.FieldReader;
+import org.apache.arrow.vector.ipc.message.ArrowFieldNode;
+import org.apache.arrow.vector.types.Types.MinorType;
+import org.apache.arrow.vector.types.pojo.ArrowType;
+import org.apache.arrow.vector.types.pojo.Field;
+import org.apache.arrow.vector.types.pojo.FieldType;
+import org.apache.arrow.vector.util.CallBack;
+import org.apache.arrow.vector.util.JsonStringArrayList;
+import org.apache.arrow.vector.util.OversizedAllocationException;
+import org.apache.arrow.vector.util.SchemaChangeRuntimeException;
+import org.apache.arrow.vector.util.TransferPair;
+
+/**
+ * A list vector contains lists of a specific type of elements.  Its structure 
contains 3 elements.
+ * 
+ * A validity buffer.
+ *  An offset buffer, that denotes lists boundaries. 
+ *  A child data vector that contains the elements of lists. 
+ * 
+ *
+ * This is the LargeList variant of list, it has a 64-bit wide offset
+ *
+ * 
+ *   todo review checkedCastToInt usage in this class.
+ *   Once int64 indexed vectors are supported these checks aren't needed.
+ * 
+ */
+public class LargeListVector extends BaseValueVector implements 
RepeatedValueVector, BaseListVector, PromotableVector {
+
+  public static LargeListVector empty(String name, BufferAllocator allocator) {
+return new LargeListVector(name, allocator, 
FieldType.nullable(ArrowType.LargeList.INSTANCE), null);
+  }
+
+  public static final FieldVector DEFAULT_DATA_VECTOR = ZeroVector.INSTANCE;
+  public static final String DATA_VECTOR_NAME = "$data$";
+
+  public static final byte OFFSET_WIDTH = 8;
+  protected ArrowBuf offsetBuffer;
+  protected FieldVector vector;
+  protected final CallBack callBack;
+  protected int valueCount;
+  protected long offsetAllocationSizeInBytes = INITIAL_VALUE_ALLOCATION * 
OFFSET_WIDTH;
+  private final String name;
+
+  protected String defaultDataVectorName = DATA_VECTOR_NAME;
+  protected ArrowBuf validityBuffer;
+  protected UnionLargeListReader reader;
+  private final FieldType fieldType;
+  private int validityAllocationSizeInBytes;
+
+  /**
+   * The maximum index that is actually set.
+   */
+  private long lastSet;
+
+  /**
+   * Constructs a new instance.
+   *
+   * @param name The name of 

[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++

2020-07-01 Thread GitBox


rymurr commented on a change in pull request #7275:
URL: https://github.com/apache/arrow/pull/7275#discussion_r448227771



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java
##
@@ -0,0 +1,991 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.complex;
+
+import static java.util.Collections.singletonList;
+import static org.apache.arrow.memory.util.LargeMemoryUtil.capAtMaxInt;
+import static org.apache.arrow.memory.util.LargeMemoryUtil.checkedCastToInt;
+import static org.apache.arrow.util.Preconditions.checkNotNull;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+
+import org.apache.arrow.memory.ArrowBuf;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.OutOfMemoryException;
+import org.apache.arrow.memory.util.ArrowBufPointer;
+import org.apache.arrow.memory.util.ByteFunctionHelpers;
+import org.apache.arrow.memory.util.CommonUtil;
+import org.apache.arrow.memory.util.hash.ArrowBufHasher;
+import org.apache.arrow.util.Preconditions;
+import org.apache.arrow.vector.AddOrGetResult;
+import org.apache.arrow.vector.BaseFixedWidthVector;
+import org.apache.arrow.vector.BaseValueVector;
+import org.apache.arrow.vector.BaseVariableWidthVector;
+import org.apache.arrow.vector.BitVectorHelper;
+import org.apache.arrow.vector.BufferBacked;
+import org.apache.arrow.vector.DensityAwareVector;
+import org.apache.arrow.vector.FieldVector;
+import org.apache.arrow.vector.NullVector;
+import org.apache.arrow.vector.UInt4Vector;
+import org.apache.arrow.vector.ValueVector;
+import org.apache.arrow.vector.ZeroVector;
+import org.apache.arrow.vector.compare.VectorVisitor;
+import org.apache.arrow.vector.complex.impl.ComplexCopier;
+import org.apache.arrow.vector.complex.impl.UnionLargeListReader;
+import org.apache.arrow.vector.complex.impl.UnionLargeListWriter;
+import org.apache.arrow.vector.complex.reader.FieldReader;
+import org.apache.arrow.vector.ipc.message.ArrowFieldNode;
+import org.apache.arrow.vector.types.Types.MinorType;
+import org.apache.arrow.vector.types.pojo.ArrowType;
+import org.apache.arrow.vector.types.pojo.Field;
+import org.apache.arrow.vector.types.pojo.FieldType;
+import org.apache.arrow.vector.util.CallBack;
+import org.apache.arrow.vector.util.JsonStringArrayList;
+import org.apache.arrow.vector.util.OversizedAllocationException;
+import org.apache.arrow.vector.util.SchemaChangeRuntimeException;
+import org.apache.arrow.vector.util.TransferPair;
+
+/**
+ * A list vector contains lists of a specific type of elements.  Its structure 
contains 3 elements.
+ * 
+ * A validity buffer.
+ *  An offset buffer, that denotes lists boundaries. 
+ *  A child data vector that contains the elements of lists. 
+ * 
+ *
+ * This is the LargeList variant of list, it has a 64-bit wide offset
+ *
+ * 
+ *   todo review checkedCastToInt usage in this class.
+ *   Once int64 indexed vectors are supported these checks aren't needed.
+ * 
+ */
+public class LargeListVector extends BaseValueVector implements 
RepeatedValueVector, BaseListVector, PromotableVector {
+
+  public static LargeListVector empty(String name, BufferAllocator allocator) {
+return new LargeListVector(name, allocator, 
FieldType.nullable(ArrowType.LargeList.INSTANCE), null);
+  }
+
+  public static final FieldVector DEFAULT_DATA_VECTOR = ZeroVector.INSTANCE;
+  public static final String DATA_VECTOR_NAME = "$data$";
+
+  public static final byte OFFSET_WIDTH = 8;
+  protected ArrowBuf offsetBuffer;
+  protected FieldVector vector;
+  protected final CallBack callBack;
+  protected int valueCount;
+  protected long offsetAllocationSizeInBytes = INITIAL_VALUE_ALLOCATION * 
OFFSET_WIDTH;
+  private final String name;
+
+  protected String defaultDataVectorName = DATA_VECTOR_NAME;
+  protected ArrowBuf validityBuffer;
+  protected UnionLargeListReader reader;
+  private final FieldType fieldType;
+  private int validityAllocationSizeInBytes;
+
+  /**
+   * The maximum index that is actually set.
+   */
+  private long lastSet;
+
+  /**
+   * Constructs a new instance.
+   *
+   * @param name The name of 

[GitHub] [arrow] mrkn commented on pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module

2020-07-01 Thread GitBox


mrkn commented on pull request #7539:
URL: https://github.com/apache/arrow/pull/7539#issuecomment-652298958


   I wrote a benchmark code that measures the performance of conversion from 
Tensor to SparseTensor.
   And I run this code with `--repetitions=10` and got the following result.
   
   - Converting to SparseCOOTensor is 1.4x-1.8x slower than the original.
   - Converting to SparseCSRMatrix and SparseCSCMatrix is 1.0x-1.2x faster than 
the original.
   - Converting to SparseCSFTensor is 1.0-1.3x slower than the original.
   
   I don't think this result, especially SparseCOOTensor's case can be 
acceptable.
   Now I'm trying to resolve this performance regression.
   
   
   The full result is shown below:
   
   
   
   Format
   IndexType
   ValueType
   Change %
   Baseline
   Contender
   
   
   
   
   COO
   Int32Type
   Int8
   56.426
   2459.397090
   3847.138775
   
   
   COO
   Int16Type
   Int8
   43.062
   2619.848028
   3748.014603
   
   
   COO
   Int64Type
   Int8
   40.137
   2850.826136
   3995.057821
   
   
   COO
   Int8Type
   Int8
   35.478
   2776.716694
   3761.850287
   
   
   COO
   Int16Type
   Int16
   66.812
   2394.011928
   3993.489611
   
   
   COO
   Int32Type
   Int16
   59.281
   2728.577337
   4346.096797
   
   
   COO
   Int64Type
   Int16
   52.043
   2596.404005
   3947.663313
   
   
   COO
   Int8Type
   Int16
   44.510
   2768.567992
   4000.853968
   
   
   COO
   Int32Type
   Float
   93.283
   2287.343291
   4421.039474
   
   
   COO
   Int16Type
   Float
   71.100
   2577.689569
   4410.437858
   
   
   COO
   Int64Type
   Float
   61.035
   2821.300996
   4543.277655
   
   
   COO
   Int8Type
   Float
   53.303
   2666.801559
   4088.291362
   
   
   COO
   Int64Type
   Double
   88.230
   2616.000738
   4924.096764
   
   
   COO
   Int16Type
   Double
   70.188
   2698.335495
   4592.253023
   
   
   COO
   Int32Type
   Double
   69.368
   2579.753902
   4369.267270
   
   
   COO
   Int8Type
   Double
   62.781
   2747.179806
   4471.895533
   
   
   CSR
   Int8Type
   Int8
   -1.660
   4626.181791
   4549.389657
   
   
   CSR
   Int32Type
   Int8
   -4.718
   4708.044556
   4485.938002
   
   
   CSR
   Int64Type
   Int8
   -4.905
   4687.961132
   4458.024004
   
   
   CSR
   Int16Type
   Int8
   -7.164
   4851.803479
   4504.201161
   
   
   CSR
   Int32Type
   Int16
   -1.481
   4722.676795
   4652.711030
   
   
   CSR
   Int64Type
   Int16
   -5.853
   4573.124314
   4305.462839
   
   
   CSR
   Int8Type
   Int16
   -11.890
   4631.290389
   4080.628990
   
   
   CSR
   Int16Type
   Int16
   -12.372
   4737.320600
   4151.197839
   
   
   CSR
   Int64Type
   Float
   -4.999
   4711.575565
   4476.032868
   
   
   CSR
   Int16Type
   Float
   -5.062
   4811.457037
   4567.909440
   
   
   CSR
   Int32Type
   Float
   -7.651
   4814.813084
   4446.432341
   
   
   CSR
   Int8Type
   Float
   -12.196
   5158.868330
   4529.676101
   
   
   CSR
   Int32Type
   Double
   0.384
   4631.005506
   4648.789489
   
   
   CSR
   Int64Type
   Double
   -0.273
   4816.669765
   4803.522209
   
   
   CSR
   Int8Type
   Double
   -3.453
   4646.652774
   4486.217718
   
   
   CSR
   Int16Type
   Double
   -8.963
   4952.032110
   4508.161865
   
   
   CSC
   Int8Type
   Int8
   -4.207
   4871.793338
   4666.848429
   
   
   CSC
   Int64Type
   Int8
   -10.582
   4571.251613
   4087.509465
   
   
   CSC
   Int16Type
   Int8
   -13.189
   4865.666295
   4223.934008
   
   
   CSC
   Int32Type
   Int8
   -16.546
   4999.577506
   4172.360711
   
   
   CSC
   Int32Type
   Int16
   -0.619
   4842.515715
   4812.531287
   
   
   CSC
   Int8Type
   Int16
   -4.128
   4689.737235
   4496.135413
   
   
   CSC
   Int16Type
   Int16
   -8.110
   4643.413441
   4266.836346
   
   
   CSC
   Int64Type
   Int16
   -10.302
   4971.879449
   4459.696887
   
   
   CSC
   Int8Type
   Float
   -0.547
   4694.347970
   4668.667817
   
   
   CSC
   Int32Type
   Float
   -0.619
   4718.207593
   4688.994670
   
   
   CSC
   Int64Type
   Float
   -3.360
   4847.753893
   4684.849545
   
   
   CSC
   Int16Type
   Float
   -7.408
   4956.795348
   4589.607722
   
   
   CSC
   Int32Type
   Double
   0.491
   4917.172760
   4941.321519
   
   
   CSC
   Int64Type
   Double
   -0.754
   5069.589290
   5031.376333
   
   
   CSC
   Int16Type
   Double
   -2.023
   4762.425910
   4666.071367
   
   
   CSC
   Int8Type
   Double
   -5.987
   4934.016268
   4638.636384
   
   
   CSF
   Int16Type
   Int8
   25.394
   7885.986312
   9888.571309
   
   
   CSF
   Int64Type
   Int8
   22.605
   9083.444559
   11136.716151
   
   
   CSF
   Int32Type
   Int8
   13.486
   8228.377064
   9338.031561
   
   
   CSF
   Int8Type
   Int8
   11.900
   7687.347967
   8602.168534
   
   
   CSF
   Int64Type
   Int16
   28.094
   8247.255772
   10564.280045
   
   
   CSF
   Int8Type
   Int16
   20.276
   8172.411486
   9829.454135
   
   
   CSF
   Int32Type
   In

[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++

2020-07-01 Thread GitBox


rymurr commented on a change in pull request #7275:
URL: https://github.com/apache/arrow/pull/7275#discussion_r448232201



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java
##
@@ -0,0 +1,991 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.complex;
+
+import static java.util.Collections.singletonList;
+import static org.apache.arrow.memory.util.LargeMemoryUtil.capAtMaxInt;
+import static org.apache.arrow.memory.util.LargeMemoryUtil.checkedCastToInt;
+import static org.apache.arrow.util.Preconditions.checkNotNull;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+
+import org.apache.arrow.memory.ArrowBuf;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.OutOfMemoryException;
+import org.apache.arrow.memory.util.ArrowBufPointer;
+import org.apache.arrow.memory.util.ByteFunctionHelpers;
+import org.apache.arrow.memory.util.CommonUtil;
+import org.apache.arrow.memory.util.hash.ArrowBufHasher;
+import org.apache.arrow.util.Preconditions;
+import org.apache.arrow.vector.AddOrGetResult;
+import org.apache.arrow.vector.BaseFixedWidthVector;
+import org.apache.arrow.vector.BaseValueVector;
+import org.apache.arrow.vector.BaseVariableWidthVector;
+import org.apache.arrow.vector.BitVectorHelper;
+import org.apache.arrow.vector.BufferBacked;
+import org.apache.arrow.vector.DensityAwareVector;
+import org.apache.arrow.vector.FieldVector;
+import org.apache.arrow.vector.NullVector;
+import org.apache.arrow.vector.UInt4Vector;
+import org.apache.arrow.vector.ValueVector;
+import org.apache.arrow.vector.ZeroVector;
+import org.apache.arrow.vector.compare.VectorVisitor;
+import org.apache.arrow.vector.complex.impl.ComplexCopier;
+import org.apache.arrow.vector.complex.impl.UnionLargeListReader;
+import org.apache.arrow.vector.complex.impl.UnionLargeListWriter;
+import org.apache.arrow.vector.complex.reader.FieldReader;
+import org.apache.arrow.vector.ipc.message.ArrowFieldNode;
+import org.apache.arrow.vector.types.Types.MinorType;
+import org.apache.arrow.vector.types.pojo.ArrowType;
+import org.apache.arrow.vector.types.pojo.Field;
+import org.apache.arrow.vector.types.pojo.FieldType;
+import org.apache.arrow.vector.util.CallBack;
+import org.apache.arrow.vector.util.JsonStringArrayList;
+import org.apache.arrow.vector.util.OversizedAllocationException;
+import org.apache.arrow.vector.util.SchemaChangeRuntimeException;
+import org.apache.arrow.vector.util.TransferPair;
+
+/**
+ * A list vector contains lists of a specific type of elements.  Its structure 
contains 3 elements.
+ * 
+ * A validity buffer.
+ *  An offset buffer, that denotes lists boundaries. 
+ *  A child data vector that contains the elements of lists. 
+ * 
+ *
+ * This is the LargeList variant of list, it has a 64-bit wide offset
+ *
+ * 
+ *   todo review checkedCastToInt usage in this class.
+ *   Once int64 indexed vectors are supported these checks aren't needed.
+ * 
+ */
+public class LargeListVector extends BaseValueVector implements 
RepeatedValueVector, BaseListVector, PromotableVector {
+
+  public static LargeListVector empty(String name, BufferAllocator allocator) {
+return new LargeListVector(name, allocator, 
FieldType.nullable(ArrowType.LargeList.INSTANCE), null);
+  }
+
+  public static final FieldVector DEFAULT_DATA_VECTOR = ZeroVector.INSTANCE;
+  public static final String DATA_VECTOR_NAME = "$data$";
+
+  public static final byte OFFSET_WIDTH = 8;
+  protected ArrowBuf offsetBuffer;
+  protected FieldVector vector;
+  protected final CallBack callBack;
+  protected int valueCount;
+  protected long offsetAllocationSizeInBytes = INITIAL_VALUE_ALLOCATION * 
OFFSET_WIDTH;
+  private final String name;
+
+  protected String defaultDataVectorName = DATA_VECTOR_NAME;
+  protected ArrowBuf validityBuffer;
+  protected UnionLargeListReader reader;
+  private final FieldType fieldType;
+  private int validityAllocationSizeInBytes;
+
+  /**
+   * The maximum index that is actually set.
+   */
+  private long lastSet;
+
+  /**
+   * Constructs a new instance.
+   *
+   * @param name The name of 

[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++

2020-07-01 Thread GitBox


rymurr commented on a change in pull request #7275:
URL: https://github.com/apache/arrow/pull/7275#discussion_r448233329



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/ipc/JsonFileReader.java
##
@@ -564,8 +564,10 @@ private ArrowBuf readIntoBuffer(BufferAllocator allocator, 
BufferType bufferType
 } else if (bufferType.equals(OFFSET)) {
   if (type == Types.MinorType.LARGEVARCHAR || type == 
Types.MinorType.LARGEVARBINARY) {

Review comment:
   thanks, done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] cyb70289 edited a comment on pull request #7603: ARROW-9206: [C++][Flight] Add latency benchmark

2020-07-01 Thread GitBox


cyb70289 edited a comment on pull request #7603:
URL: https://github.com/apache/arrow/pull/7603#issuecomment-652289408


   > You can't compute latency like that if the test is multithreaded.
   
   Assume two threads, one finishes `n1` IO in `t1` time, another finishes `n2` 
IO in `t2` time. So the average latency should be `(t1/n1 + t2/n2) / 2`, not 
`(t1+t2) / (n1+n2)`?
   Or something else?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7603: ARROW-9206: [C++][Flight] Add latency benchmark

2020-07-01 Thread GitBox


pitrou commented on pull request #7603:
URL: https://github.com/apache/arrow/pull/7603#issuecomment-652307944


   IMHO it should be `(t1+t2) / (n1+n2)`. Is it what the code is doing?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] cyb70289 commented on pull request #7603: ARROW-9206: [C++][Flight] Add latency benchmark

2020-07-01 Thread GitBox


cyb70289 commented on pull request #7603:
URL: https://github.com/apache/arrow/pull/7603#issuecomment-652311392


   > IMHO it should be `(t1+t2) / (n1+n2)`. Is it what the code is doing?
   
   Yes, this is what the code is doing.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7603: ARROW-9206: [C++][Flight] Add latency benchmark

2020-07-01 Thread GitBox


pitrou commented on pull request #7603:
URL: https://github.com/apache/arrow/pull/7603#issuecomment-652316126


   Ok, then I was mistaken. Sorry :-)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory

2020-07-01 Thread GitBox


liyafan82 commented on pull request #7347:
URL: https://github.com/apache/arrow/pull/7347#issuecomment-652319908


   > Thanks a lot @liyafan82 I have addressed your suggestions and rebased
   
   @rymurr Thanks for your work. Will merge when it turns green. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rymurr commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++

2020-07-01 Thread GitBox


rymurr commented on a change in pull request #7275:
URL: https://github.com/apache/arrow/pull/7275#discussion_r448255547



##
File path: java/vector/src/main/codegen/templates/UnionLargeListWriter.java
##
@@ -0,0 +1,232 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+import org.apache.arrow.memory.ArrowBuf;
+import org.apache.arrow.vector.complex.writer.DecimalWriter;
+import org.apache.arrow.vector.holders.DecimalHolder;
+
+import java.lang.UnsupportedOperationException;
+import java.math.BigDecimal;
+
+<@pp.dropOutputFile />
+<@pp.changeOutputFile 
name="/org/apache/arrow/vector/complex/impl/UnionLargeListWriter.java" />
+
+
+<#include "/@includes/license.ftl" />
+
+package org.apache.arrow.vector.complex.impl;
+
+import static 
org.apache.arrow.memory.util.LargeMemoryUtil.checkedCastToInt;
+<#include "/@includes/vv_imports.ftl" />
+
+/*
+ * This class is generated using freemarker and the ${.template_name} template.
+ */
+
+@SuppressWarnings("unused")
+public class UnionLargeListWriter extends AbstractFieldWriter {

Review comment:
   agreed, I fixed this...was just being lazy ;-)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rymurr commented on pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++

2020-07-01 Thread GitBox


rymurr commented on pull request #7275:
URL: https://github.com/apache/arrow/pull/7275#issuecomment-652323175


   > Thanks for working on this @rymurr ! Apologies for taking so long to 
review.. It looks pretty good, but I saw what looked like inconsistencies in 
the `LargeListVector` APIs using ints vs longs to me, and otherwise only minor 
things to fix up.
   
   Thanks a lot for the thorough review. I have fixed up everything you 
mentioned. It appears some of the confusion was related to changes for 64-bit 
allocations that were recently merged and the rest was my ignorance!
   
   I have pushed a change with all your recommended fixes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou opened a new pull request #7605: ARROW-9283: [Python] Expose build info

2020-07-01 Thread GitBox


pitrou opened a new pull request #7605:
URL: https://github.com/apache/arrow/pull/7605


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7605: ARROW-9283: [Python] Expose build info

2020-07-01 Thread GitBox


pitrou commented on pull request #7605:
URL: https://github.com/apache/arrow/pull/7605#issuecomment-652330730


   There's a problem where we already generate `__version__` and it ends up 
different, for example:
   ```python
   >>> import pyarrow as pa 


   >>> pa.__version__   


   '0.17.1.dev572+gf25a014ab.d20200701'
   >>> pa.version   


   '1.0.0'
   >>> pa.version_info  


   VersionInfo(major=1, minor=0, patch=0)
   >>> pa.build_info


   BuildInfo(version='1.0.0', version_info=VersionInfo(major=1, minor=0, 
patch=0), so_version='100', full_so_version='100.0.0', 
git_id='5f1800a4dfaf38754362914faf2776d6c3ec9eb9', 
git_description='apache-arrow-0.17.0-568-g5f1800a4d', package_kind='')
   ```
   
   @xhochy @kou What is your advice on this?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7605: ARROW-9283: [Python] Expose build info

2020-07-01 Thread GitBox


github-actions[bot] commented on pull request #7605:
URL: https://github.com/apache/arrow/pull/7605#issuecomment-652332966


   https://issues.apache.org/jira/browse/ARROW-9283



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on pull request #7543: ARROW-9221: [Java] account for big-endian buffers in ArrowBuf.setBytes

2020-07-01 Thread GitBox


lidavidm commented on pull request #7543:
URL: https://github.com/apache/arrow/pull/7543#issuecomment-652379935


   @liyafan82 The problem actually isn't with big-endian platforms! It's 
because Java's ByteBuffer [defaults to 
big-endian](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/nio/ByteBuffer.html#wrap(byte%5B%5D)),
 and some libraries, like Protobuf, inherit that behavior. We discovered this 
when developing with Flight internally; there are Protobuf APIs that return a 
ByteBuffer that may be big-endian; when Flight tries to copy the buffer into an 
ArrowBuf, it corrupts the data.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7544: ARROW-7285: [C++] ensure C++ implementation meets clarified dictionary spec

2020-07-01 Thread GitBox


pitrou commented on pull request #7544:
URL: https://github.com/apache/arrow/pull/7544#issuecomment-652383485


   Thanks for the update. I will merge this PR once CI is green.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7593: ARROW-9160: [C++] Implement contains for exact matches

2020-07-01 Thread GitBox


pitrou commented on a change in pull request #7593:
URL: https://github.com/apache/arrow/pull/7593#discussion_r448326108



##
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##
@@ -297,6 +297,116 @@ void AddAsciiLength(FunctionRegistry* registry) {
   DCHECK_OK(registry->AddFunction(std::move(func)));
 }
 
+// --
+// exact pattern detection
+
+using StrToBoolTransformFunc =
+std::function;
+
+// Apply `transform` to input character data- this function cannot change the
+// length
+template 
+void StringBoolTransform(KernelContext* ctx, const ExecBatch& batch,
+ StrToBoolTransformFunc transform, Datum* out) {
+  using offset_type = typename Type::offset_type;
+
+  if (batch[0].kind() == Datum::ARRAY) {
+const ArrayData& input = *batch[0].array();
+ArrayData* out_arr = out->mutable_array();
+if (input.length > 0) {
+  transform(
+  reinterpret_cast(input.buffers[1]->data()) + 
input.offset,
+  input.buffers[2]->data(), input.length, out_arr->offset,
+  out_arr->buffers[1]->mutable_data());
+}
+  } else {
+const auto& input = checked_cast(*batch[0].scalar());
+auto result = 
checked_pointer_cast(MakeNullScalar(out->type()));
+uint8_t result_value = 0;
+if (input.is_valid) {
+  result->is_valid = true;
+  std::array offsets{0,
+ 
static_cast(input.value->size())};
+  transform(offsets.data(), input.value->data(), 1, /*output_offset=*/0,
+&result_value);
+  out->value = std::make_shared(result_value > 0);
+}
+  }
+}
+
+template 
+void TransformContainsExact(const uint8_t* pattern, int64_t pattern_length,
+const offset_type* offsets, const uint8_t* data,
+int64_t length, int64_t output_offset, uint8_t* 
output) {
+  // This is an implementation of the Knuth-Morris-Pratt algorithm
+
+  // Phase 1: Build the prefix table
+  std::vector prefix_table(pattern_length + 1);
+  offset_type prefix_length = -1;
+  prefix_table[0] = -1;
+  for (offset_type pos = 0; pos < pattern_length; ++pos) {
+// The prefix cannot be expanded, reset.
+if (prefix_length >= 0 && pattern[pos] != pattern[prefix_length]) {
+  prefix_length = prefix_table[prefix_length];
+}
+prefix_length++;
+prefix_table[pos + 1] = prefix_length;
+  }
+
+  // Phase 2: Find the prefix in the data
+  FirstTimeBitmapWriter bitmap_writer(output, output_offset, length);
+  for (int64_t i = 0; i < length; ++i) {
+const uint8_t* current_data = data + offsets[i];
+int64_t current_length = offsets[i + 1] - offsets[i];
+
+int64_t pattern_pos = 0;
+for (int64_t k = 0; k < current_length; k++) {
+  if (pattern[pattern_pos] == current_data[k]) {
+pattern_pos++;
+if (pattern_pos == pattern_length) {
+  bitmap_writer.Set();
+  break;
+}
+  } else {
+pattern_pos = std::max(0, prefix_table[pattern_pos]);
+  }
+}
+bitmap_writer.Next();
+  }
+  bitmap_writer.Finish();

Review comment:
   It is me, or you're never clearing the output bit when the pattern isn't 
found?

##
File path: python/pyarrow/compute.py
##
@@ -113,6 +113,24 @@ def func(left, right):
 multiply = _simple_binary_function('multiply')
 
 
+def contains_exact(array, pattern):
+"""
+Check whether a pattern occurs as part of the values of the array.

Review comment:
   That's rather vague. Does `contains_exact(pa.array([1,2,3]), 
pa.array([1,2]))` return true? I don't think so.

##
File path: cpp/src/arrow/compute/api_scalar.h
##
@@ -259,6 +259,15 @@ Result IsValid(const Datum& values, ExecContext* 
ctx = NULLPTR);
 ARROW_EXPORT
 Result IsNull(const Datum& values, ExecContext* ctx = NULLPTR);
 
+// --
+// String functions
+
+struct ARROW_EXPORT ContainsExactOptions : public FunctionOptions {
+  explicit ContainsExactOptions(std::string pattern = "") : pattern(pattern) {}

Review comment:
   I don't think defaulting to the empty string makes sense here.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on pull request #7544: ARROW-7285: [C++] ensure C++ implementation meets clarified dictionary spec

2020-07-01 Thread GitBox


liyafan82 commented on pull request #7544:
URL: https://github.com/apache/arrow/pull/7544#issuecomment-652395253


   @pitrou Thanks a lot for your effort. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on pull request #7543: ARROW-9221: [Java] account for big-endian buffers in ArrowBuf.setBytes

2020-07-01 Thread GitBox


liyafan82 commented on pull request #7543:
URL: https://github.com/apache/arrow/pull/7543#issuecomment-652395634


   > @liyafan82 The problem actually isn't with big-endian platforms! It's 
because Java's ByteBuffer [defaults to 
big-endian](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/nio/ByteBuffer.html#wrap(byte%5B%5D)),
 and some libraries, like Protobuf, inherit that behavior. We discovered this 
when developing with Flight internally; there are Protobuf APIs that return a 
ByteBuffer that may be big-endian; when Flight tries to copy the buffer into an 
ArrowBuf, it corrupts the data.
   
   @lidavidm Thanks a lot for your clarification. This is really a problem that 
needs to be fixed. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7543: ARROW-9221: [Java] account for big-endian buffers in ArrowBuf.setBytes

2020-07-01 Thread GitBox


liyafan82 commented on a change in pull request #7543:
URL: https://github.com/apache/arrow/pull/7543#discussion_r448338460



##
File path: java/memory/src/main/java/org/apache/arrow/memory/ArrowBuf.java
##
@@ -872,6 +874,7 @@ public void setBytes(long index, ByteBuffer src) {
   --length;
   ++dstAddress;
 }
+src.order(originalByteOrder);

Review comment:
   should we place this in a finally block?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on a change in pull request #7543: ARROW-9221: [Java] account for big-endian buffers in ArrowBuf.setBytes

2020-07-01 Thread GitBox


lidavidm commented on a change in pull request #7543:
URL: https://github.com/apache/arrow/pull/7543#discussion_r448345297



##
File path: java/memory/src/main/java/org/apache/arrow/memory/ArrowBuf.java
##
@@ -872,6 +874,7 @@ public void setBytes(long index, ByteBuffer src) {
   --length;
   ++dstAddress;
 }
+src.order(originalByteOrder);

Review comment:
   Good idea, I've updated the PR.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou opened a new pull request #7606: ARROW-8434: [C++] Avoid multiple schema deserializations in RecordBatchFileReader

2020-07-01 Thread GitBox


pitrou opened a new pull request #7606:
URL: https://github.com/apache/arrow/pull/7606


   This doesn't seem to make a difference in the included benchmark.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jianxind opened a new pull request #7607: ARROW-8996: [C++] Add AVX version for aggregate sum/mean

2020-07-01 Thread GitBox


jianxind opened a new pull request #7607:
URL: https://github.com/apache/arrow/pull/7607


   1. Add AVX2/AVX512 build version of aggregate sum/mean function.
Use set_source_files_properties to append the SIMD build option. 
Register the SIMD path at runtime by CPU feature.
Move the sum/mean implementation to header file thus SIMD version can 
reuse.
Also rework the implementation based on bit_block_counter.
   2. Introduce ARROW_USER_SIMD_LEVEL env for test scope.
ex "ARROW_USER_SIMD_LEVEL=sse4_2 make unittest" on a AVX2 capable 
device.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jianxind commented on pull request #7607: ARROW-8996: [C++] Add AVX version for aggregate sum/mean

2020-07-01 Thread GitBox


jianxind commented on pull request #7607:
URL: https://github.com/apache/arrow/pull/7607#issuecomment-652425725


   @ursabot benchmark --suite-filter=arrow-compute-aggregate-benchmark



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jianxind commented on pull request #7607: ARROW-8996: [C++] Add AVX version for aggregate sum/mean

2020-07-01 Thread GitBox


jianxind commented on pull request #7607:
URL: https://github.com/apache/arrow/pull/7607#issuecomment-652431931


   @emkornfield This is the new version for sum aggregate without intrinsic, 
could you help to review?
   
   The dense part nearly get the same scores with intrinsic for AVX2 on clang, 
gcc result is little low.
   
   Below is the benchmark results(null_percent 0 and 0.01%) on a AVX2(i7-8700) 
device.
   Before
   ```
   SumKernelFloat/1048576/197.5 us 97.3 us 7150 
bytes_per_second=10.0336G/s null_percent=0.01 size=1048.58k
   SumKernelFloat/1048576/062.1 us 62.0 us11292 
bytes_per_second=15.7443G/s null_percent=0 size=1048.58k
   SumKernelDouble/1048576/1   35.4 us 35.4 us19781 
bytes_per_second=27.5977G/s null_percent=0.01 size=1048.58k
   SumKernelDouble/1048576/0   32.5 us 32.5 us21534 
bytes_per_second=30.0657G/s null_percent=0 size=1048.58k
   SumKernelInt8/1048576/1  183 us  183 us 3832 
bytes_per_second=5.34627G/s null_percent=0.01 size=1048.58k
   SumKernelInt8/1048576/0  133 us  132 us 5285 
bytes_per_second=7.37317G/s null_percent=0 size=1048.58k
   SumKernelInt16/1048576/193.3 us 93.2 us 7505 
bytes_per_second=10.4762G/s null_percent=0.01 size=1048.58k
   SumKernelInt16/1048576/068.4 us 68.3 us10249 
bytes_per_second=14.2887G/s null_percent=0 size=1048.58k
   SumKernelInt32/1048576/149.2 us 49.1 us14255 
bytes_per_second=19.874G/s null_percent=0.01 size=1048.58k
   SumKernelInt32/1048576/040.3 us 40.2 us17654 
bytes_per_second=24.2688G/s null_percent=0 size=1048.58k
   SumKernelInt64/1048576/135.3 us 35.3 us19870 
bytes_per_second=27.6826G/s null_percent=0.01 size=1048.58k
   SumKernelInt64/1048576/032.4 us 32.3 us21628 
bytes_per_second=30.1902G/s null_percent=0 size=1048.58k
   ```
   
   After
   ```
   SumKernelFloat/1048576/141.1 us 41.0 us17004 
bytes_per_second=23.7947G/s null_percent=0.01 size=1048.58k
   SumKernelFloat/1048576/025.1 us 25.0 us27884 
bytes_per_second=38.9922G/s null_percent=0 size=1048.58k
   SumKernelDouble/1048576/1   24.6 us 24.5 us28423 
bytes_per_second=39.8205G/s null_percent=0.01 size=1048.58k
   SumKernelDouble/1048576/0   17.1 us 17.1 us40881 
bytes_per_second=57.1186G/s null_percent=0 size=1048.58k
   SumKernelInt8/1048576/1  116 us  115 us 6073 
bytes_per_second=8.46685G/s null_percent=0.01 size=1048.58k
   SumKernelInt8/1048576/0 61.0 us 60.9 us11501 
bytes_per_second=16.0293G/s null_percent=0 size=1048.58k
   SumKernelInt16/1048576/162.2 us 62.2 us11250 
bytes_per_second=15.7108G/s null_percent=0.01 size=1048.58k
   SumKernelInt16/1048576/037.0 us 37.0 us19883 
bytes_per_second=26.4204G/s null_percent=0 size=1048.58k
   SumKernelInt32/1048576/138.4 us 38.4 us18217 
bytes_per_second=25.4367G/s null_percent=0.01 size=1048.58k
   SumKernelInt32/1048576/024.6 us 24.5 us28531 
bytes_per_second=39.8216G/s null_percent=0 size=1048.58k
   SumKernelInt64/1048576/124.1 us 24.1 us29069 
bytes_per_second=40.5531G/s null_percent=0.01 size=1048.58k
   SumKernelInt64/1048576/016.7 us 16.7 us41887 
bytes_per_second=58.3943G/s null_percent=0 size=1048.58k
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7607: ARROW-8996: [C++] Add AVX version for aggregate sum/mean

2020-07-01 Thread GitBox


github-actions[bot] commented on pull request #7607:
URL: https://github.com/apache/arrow/pull/7607#issuecomment-652433109


   https://issues.apache.org/jira/browse/ARROW-8996



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7606: ARROW-8434: [C++] Avoid multiple schema deserializations in RecordBatchFileReader

2020-07-01 Thread GitBox


github-actions[bot] commented on pull request #7606:
URL: https://github.com/apache/arrow/pull/7606#issuecomment-652433108


   https://issues.apache.org/jira/browse/ARROW-8434



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7593: ARROW-9160: [C++] Implement contains for exact matches

2020-07-01 Thread GitBox


wesm commented on pull request #7593:
URL: https://github.com/apache/arrow/pull/7593#issuecomment-652450464


   > Could you elaborate? Why is this not a problem with the lower/upper 
kernels?
   
   The data preallocation is only for fixed size outputs (eg boolean, integers, 
floating point, etc). 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #7604: ARROW-9223: [Python] Propagate Timzone information in pandas conversion

2020-07-01 Thread GitBox


emkornfield commented on a change in pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#discussion_r448413419



##
File path: cpp/src/arrow/python/datetime.cc
##
@@ -262,6 +265,42 @@ int64_t PyDate_to_days(PyDateTime_Date* pydate) {
 PyDateTime_GET_DAY(pydate));
 }
 
+// GIL must be held when calling this function.
+Status StringToTzinfo(const std::string& tz, PyObject** tzinfo) {
+  static const std::regex 
kFixedOffset("([+-])(0[0-9]|1[0-9]|2[0-3]):([0-5][0-9])$");
+  OwnedRef pytz;
+  RETURN_NOT_OK(internal::ImportModule("pytz", &pytz));
+
+  std::cmatch match;
+  if (std::regex_match(tz.c_str(), match, kFixedOffset)) {

Review comment:
   Should avoid c_str here





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche opened a new pull request #7608: ARROW-9288: [C++][Dataset] Fix PartitioningFactory with dictionary encoding for HivePartioning

2020-07-01 Thread GitBox


jorisvandenbossche opened a new pull request #7608:
URL: https://github.com/apache/arrow/pull/7608


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7608: ARROW-9288: [C++][Dataset] Fix PartitioningFactory with dictionary encoding for HivePartioning

2020-07-01 Thread GitBox


github-actions[bot] commented on pull request #7608:
URL: https://github.com/apache/arrow/pull/7608#issuecomment-652480774


   https://issues.apache.org/jira/browse/ARROW-9288



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kiszk commented on pull request #7507: ARROW-8797: [C++] Create test to receive RecordBatch for different endian

2020-07-01 Thread GitBox


kiszk commented on pull request #7507:
URL: https://github.com/apache/arrow/pull/7507#issuecomment-652482871


   @wesm Thank you for your suggestion. I will pursue the approach that you 
suggested. I will check the integration test command line tool and the 
integration test with the JSON_TO_ARROW mode.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7608: ARROW-9288: [C++][Dataset] Fix PartitioningFactory with dictionary encoding for HivePartioning

2020-07-01 Thread GitBox


jorisvandenbossche commented on a change in pull request #7608:
URL: https://github.com/apache/arrow/pull/7608#discussion_r448438477



##
File path: cpp/src/arrow/dataset/partition.cc
##
@@ -646,15 +657,26 @@ class HivePartitioningFactory : public 
PartitioningFactory {
   }
 }
 
-return impl.Finish(&dictionaries_);
+auto schema_result = impl.Finish(&dictionaries_);
+field_names_ = impl.FieldNames();
+return schema_result;
   }
 
   Result> Finish(
   const std::shared_ptr& schema) const override {
-return std::shared_ptr(new HivePartitioning(schema, 
dictionaries_));
+for (FieldRef ref : field_names_) {

Review comment:
   I should probably guard here against the case that `field_names_` was 
not yet updated (if `Finish` is called without `Inspect` being called), with 
empty vector?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #7604: ARROW-9223: [Python] Propagate Timzone information in pandas conversion

2020-07-01 Thread GitBox


emkornfield commented on a change in pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#discussion_r448445229



##
File path: cpp/src/arrow/python/arrow_to_pandas.cc
##
@@ -951,8 +951,21 @@ struct ObjectWriterVisitor {
   template 
   enable_if_timestamp Visit(const Type& type) {
 const TimeUnit::type unit = type.unit();
-auto WrapValue = [unit](typename Type::c_type value, PyObject** out) {
+PyObject* tzinfo = nullptr;
+if (!type.timezone().empty()) {
+  RETURN_NOT_OK(internal::StringToTzinfo(type.timezone(), &tzinfo));
+}
+auto WrapValue = [unit, tzinfo](typename Type::c_type value, PyObject** 
out) {
+  OwnedRef tz(tzinfo);
   RETURN_NOT_OK(internal::PyDateTime_from_int(value, unit, out));
+  RETURN_IF_PYERROR();
+  if (tzinfo != nullptr) {
+PyObject* with_tz = PyObject_CallMethod(*out, "astimezone", "O", 
tzinfo);

Review comment:
   should always set UTC here first.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

2020-07-01 Thread GitBox


jorisvandenbossche commented on pull request #7536:
URL: https://github.com/apache/arrow/pull/7536#issuecomment-652493993


   @bkietz thanks for the update ensuring all uniques as dictionary values!
   
   Testing this out, I ran into an issue with HivePartitioning -> ARROW-9288 / 
#7608
   
   Further, a usability issue: this now creates partition expressions that use 
a dictionary type. Which means that doing something like 
`dataset.to_table(filter=ds.field("part") == "A")` to filter on the partition 
field with a plain string expression doesn't work, limiting the usability of 
this option (and even with the new Python scalar stuff, it would not be easy to 
construct the correct expression):
   
   ```
   In [9]: part = ds.HivePartitioning.discover(max_partition_dictionary_size=2) 
 
   
   In [10]: dataset = ds.dataset("test_partitioned_filter/", format="parquet", 
partitioning=part)
   
   In [11]: fragment = list(dataset.get_fragments())[0]   
   
   In [12]: fragment.partition_expression  
   Out[12]: 
   )>
   
   In [13]: dataset.to_table(filter=ds.field("part") == "A") 
   ...
   ArrowNotImplementedError: cast from string
   ```
   
   It might also be an option to keep the `partition_expression` use the 
dictionary *value type* instead of dictionary type?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7546: ARROW-8733: [C++][Dataset][Python] Expose RowGroupInfo statistics values

2020-07-01 Thread GitBox


jorisvandenbossche commented on pull request #7546:
URL: https://github.com/apache/arrow/pull/7546#issuecomment-652497706


   @rjzamora `num_rows` is already available on the RowGroupInfo object 
(https://github.com/apache/arrow/blob/cd3ed605857994575326c072bbfcf995541fa80e/python/pyarrow/_dataset.pyx#L845-L847)
   
   For the `total_byte_size`, can you open a JIRA for this? (it should be 
similar as `num_rows` to get / cache from the parquet row group, I think)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson opened a new pull request #7609: ARROW-9289: [R] Remove deprecated functions

2020-07-01 Thread GitBox


nealrichardson opened a new pull request #7609:
URL: https://github.com/apache/arrow/pull/7609


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson closed pull request #7602: ARROW-9083: [R] collect int64, uint32, uint64 as R integer type if not out of bounds

2020-07-01 Thread GitBox


nealrichardson closed pull request #7602:
URL: https://github.com/apache/arrow/pull/7602


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7478: ARROW-9055: [C++] Add sum/mean/minmax kernels for Boolean type

2020-07-01 Thread GitBox


jorisvandenbossche commented on pull request #7478:
URL: https://github.com/apache/arrow/pull/7478#issuecomment-652504934


   > We will have to resolve the sum([]) -> null/0 by introducing a "minimum 
valid values" option.
   
   Do we already have a JIRA to track this?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche edited a comment on pull request #7478: ARROW-9055: [C++] Add sum/mean/minmax kernels for Boolean type

2020-07-01 Thread GitBox


jorisvandenbossche edited a comment on pull request #7478:
URL: https://github.com/apache/arrow/pull/7478#issuecomment-652504934


   > We will have to resolve the sum([]) -> null/0 by introducing a "minimum 
valid values" option.
   
   Do we already have a JIRA to track this?
   
   EDIT -> it is mentioned in ARROW-9054



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kiszk commented on pull request #7596: ARROW-9163: [C++] Validate UTF8 contents of a StringArray

2020-07-01 Thread GitBox


kiszk commented on pull request #7596:
URL: https://github.com/apache/arrow/pull/7596#issuecomment-652505301


   Looks good



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7609: ARROW-9289: [R] Remove deprecated functions

2020-07-01 Thread GitBox


github-actions[bot] commented on pull request #7609:
URL: https://github.com/apache/arrow/pull/7609#issuecomment-652507951


   https://issues.apache.org/jira/browse/ARROW-9289



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rymurr commented on pull request #7290: ARROW-1692: [Java] UnionArray round trip not working

2020-07-01 Thread GitBox


rymurr commented on pull request #7290:
URL: https://github.com/apache/arrow/pull/7290#issuecomment-652516529


   This has been modified to incorporate the changes to Unions as proposed on 
the mailing list



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector

2020-07-01 Thread GitBox


nealrichardson commented on a change in pull request #7514:
URL: https://github.com/apache/arrow/pull/7514#discussion_r448483516



##
File path: r/src/array_from_vector.cpp
##
@@ -918,6 +923,97 @@ class Time64Converter : public TimeConverter {
   }
 };
 
+template 
+class BinaryVectorConverter : public VectorConverter {
+ public:
+  ~BinaryVectorConverter() {}
+
+  Status Init(ArrayBuilder* builder) {
+typed_builder_ = checked_cast(builder);
+return Status::OK();
+  }
+
+  Status Ingest(SEXP obj) {
+ARROW_RETURN_IF(TYPEOF(obj) != VECSXP, Status::RError("Expecting a list"));
+R_xlen_t n = XLENGTH(obj);
+
+// Reserve enough space before appending
+int64_t size = 0;
+for (R_xlen_t i = 0; i < n; i++) {
+  SEXP obj_i = VECTOR_ELT(obj, i);
+  if (!Rf_isNull(obj_i)) {
+ARROW_RETURN_IF(TYPEOF(obj_i) != RAWSXP,
+Status::RError("Expecting a raw vector"));
+size += XLENGTH(obj_i);
+  }
+}
+RETURN_NOT_OK(typed_builder_->Reserve(size));
+
+// append
+for (R_xlen_t i = 0; i < n; i++) {
+  SEXP obj_i = VECTOR_ELT(obj, i);
+  if (Rf_isNull(obj_i)) {
+RETURN_NOT_OK(typed_builder_->AppendNull());
+  } else {
+RETURN_NOT_OK(typed_builder_->Append(RAW(obj_i), XLENGTH(obj_i)));
+  }
+}
+return Status::OK();
+  }
+
+  Status GetResult(std::shared_ptr* result) {
+return typed_builder_->Finish(result);
+  }
+
+ private:
+  Builder* typed_builder_;
+};
+
+template 
+class StringVectorConverter : public VectorConverter {
+ public:
+  ~StringVectorConverter() {}
+
+  Status Init(ArrayBuilder* builder) {
+typed_builder_ = checked_cast(builder);
+return Status::OK();
+  }
+
+  Status Ingest(SEXP obj) {
+ARROW_RETURN_IF(TYPEOF(obj) != STRSXP,
+Status::RError("Expecting a character vector"));
+R_xlen_t n = XLENGTH(obj);
+
+// Reserve enough space before appending
+int64_t size = 0;
+for (R_xlen_t i = 0; i < n; i++) {
+  SEXP string_i = STRING_ELT(obj, i);
+  if (string_i != NA_STRING) {
+size += XLENGTH(string_i);
+  }
+}
+RETURN_NOT_OK(typed_builder_->Reserve(size));
+
+// append
+for (R_xlen_t i = 0; i < n; i++) {
+  SEXP string_i = STRING_ELT(obj, i);
+  if (string_i == NA_STRING) {
+RETURN_NOT_OK(typed_builder_->AppendNull());
+  } else {
+RETURN_NOT_OK(typed_builder_->Append(CHAR(string_i), 
XLENGTH(string_i)));

Review comment:
   I fixed this; will deal with efficiency/redundancy in a followup.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector

2020-07-01 Thread GitBox


nealrichardson commented on a change in pull request #7514:
URL: https://github.com/apache/arrow/pull/7514#discussion_r448490350



##
File path: r/src/array_to_vector.cpp
##
@@ -693,6 +741,9 @@ std::shared_ptr Converter::Make(const 
std::shared_ptr& type
 case Type::BOOL:
   return std::make_shared(std::move(arrays));
 
+case Type::BINARY:

Review comment:
   Fixed types: https://issues.apache.org/jira/browse/ARROW-9291





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector

2020-07-01 Thread GitBox


nealrichardson commented on a change in pull request #7514:
URL: https://github.com/apache/arrow/pull/7514#discussion_r448491760



##
File path: r/src/array_from_vector.cpp
##
@@ -918,6 +923,97 @@ class Time64Converter : public TimeConverter {
   }
 };
 
+template 
+class BinaryVectorConverter : public VectorConverter {
+ public:
+  ~BinaryVectorConverter() {}
+
+  Status Init(ArrayBuilder* builder) {
+typed_builder_ = checked_cast(builder);
+return Status::OK();
+  }
+
+  Status Ingest(SEXP obj) {
+ARROW_RETURN_IF(TYPEOF(obj) != VECSXP, Status::RError("Expecting a list"));
+R_xlen_t n = XLENGTH(obj);
+
+// Reserve enough space before appending
+int64_t size = 0;
+for (R_xlen_t i = 0; i < n; i++) {
+  SEXP obj_i = VECTOR_ELT(obj, i);
+  if (!Rf_isNull(obj_i)) {
+ARROW_RETURN_IF(TYPEOF(obj_i) != RAWSXP,
+Status::RError("Expecting a raw vector"));
+size += XLENGTH(obj_i);
+  }
+}
+RETURN_NOT_OK(typed_builder_->Reserve(size));
+
+// append
+for (R_xlen_t i = 0; i < n; i++) {
+  SEXP obj_i = VECTOR_ELT(obj, i);
+  if (Rf_isNull(obj_i)) {
+RETURN_NOT_OK(typed_builder_->AppendNull());
+  } else {
+RETURN_NOT_OK(typed_builder_->Append(RAW(obj_i), XLENGTH(obj_i)));
+  }
+}
+return Status::OK();
+  }
+
+  Status GetResult(std::shared_ptr* result) {
+return typed_builder_->Finish(result);
+  }
+
+ private:
+  Builder* typed_builder_;
+};
+
+template 
+class StringVectorConverter : public VectorConverter {
+ public:
+  ~StringVectorConverter() {}
+
+  Status Init(ArrayBuilder* builder) {
+typed_builder_ = checked_cast(builder);
+return Status::OK();
+  }
+
+  Status Ingest(SEXP obj) {
+ARROW_RETURN_IF(TYPEOF(obj) != STRSXP,
+Status::RError("Expecting a character vector"));
+R_xlen_t n = XLENGTH(obj);
+
+// Reserve enough space before appending
+int64_t size = 0;
+for (R_xlen_t i = 0; i < n; i++) {
+  SEXP string_i = STRING_ELT(obj, i);
+  if (string_i != NA_STRING) {
+size += XLENGTH(string_i);
+  }
+}
+RETURN_NOT_OK(typed_builder_->Reserve(size));
+
+// append
+for (R_xlen_t i = 0; i < n; i++) {
+  SEXP string_i = STRING_ELT(obj, i);
+  if (string_i == NA_STRING) {
+RETURN_NOT_OK(typed_builder_->AppendNull());
+  } else {
+RETURN_NOT_OK(typed_builder_->Append(CHAR(string_i), 
XLENGTH(string_i)));

Review comment:
   Added a note to https://issues.apache.org/jira/browse/ARROW-7798 about 
the redundancy. I checked code coverage and it appears that both paths are 
being called, so that should be addressed (my guess is that factor levels are 
still going through the old).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] BryanCutler commented on pull request #6316: ARROW-7717: [CI] Have nightly integration test for Spark's latest release

2020-07-01 Thread GitBox


BryanCutler commented on pull request #6316:
URL: https://github.com/apache/arrow/pull/6316#issuecomment-652541712


   @kszucs the nightly against Spark master have been passing. Do you think you 
could update this to just add the test against branch-3.0 and remove branch-2.4 
for now? I'm not sure if the bot will pick it up from this PR, but I'll try to 
kick it off a test run.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson closed pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector

2020-07-01 Thread GitBox


nealrichardson closed pull request #7514:
URL: https://github.com/apache/arrow/pull/7514


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] BryanCutler commented on pull request #6316: ARROW-7717: [CI] Have nightly integration test for Spark's latest release

2020-07-01 Thread GitBox


BryanCutler commented on pull request #6316:
URL: https://github.com/apache/arrow/pull/6316#issuecomment-652542479


   @github-actions crossbow submit test-conda-python-3.7-spark-branch-3.0



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] BryanCutler commented on a change in pull request #6316: ARROW-7717: [CI] Have nightly integration test for Spark's latest release

2020-07-01 Thread GitBox


BryanCutler commented on a change in pull request #6316:
URL: https://github.com/apache/arrow/pull/6316#discussion_r448503228



##
File path: dev/tasks/tasks.yml
##
@@ -1833,12 +1833,32 @@ tasks:
 HDFS: 2.9.2
   run: conda-python-hdfs
 
-  test-conda-python-3.7-spark-master:
+  test-conda-python-3.6-spark-2.4.6:
 ci: github
 template: docker-tests/github.linux.yml
 params:
   env:
-PYTHON: 3.7
+PYTHON: 3.6
+SPARK: "v2.4.6"
+  # use the master branch of spark, so prevent reusing any layers
+  run: --no-leaf-cache --env ARROW_PRE_0_15_IPC_FORMAT=1 conda-python-spark
+
+  test-conda-python-3.7-spark-branch-3.0:

Review comment:
   looks like this should be named python-3.8 also





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson closed pull request #7609: ARROW-9289: [R] Remove deprecated functions

2020-07-01 Thread GitBox


nealrichardson closed pull request #7609:
URL: https://github.com/apache/arrow/pull/7609


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #6316: ARROW-7717: [CI] Have nightly integration test for Spark's latest release

2020-07-01 Thread GitBox


github-actions[bot] commented on pull request #6316:
URL: https://github.com/apache/arrow/pull/6316#issuecomment-652544533


   Revision: a914eea4f3ab16e359adee2f37a4fb30a1eba86c
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-371](https://github.com/ursa-labs/crossbow/branches/all?query=actions-371)
   
   |Task|Status|
   ||--|
   |test-conda-python-3.7-spark-branch-3.0|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-371-github-test-conda-python-3.7-spark-branch-3.0)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-371-github-test-conda-python-3.7-spark-branch-3.0)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kiszk commented on a change in pull request #7593: ARROW-9160: [C++] Implement contains for exact matches

2020-07-01 Thread GitBox


kiszk commented on a change in pull request #7593:
URL: https://github.com/apache/arrow/pull/7593#discussion_r448506425



##
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##
@@ -297,6 +297,116 @@ void AddAsciiLength(FunctionRegistry* registry) {
   DCHECK_OK(registry->AddFunction(std::move(func)));
 }
 
+// --
+// exact pattern detection
+
+using StrToBoolTransformFunc =
+std::function;
+
+// Apply `transform` to input character data- this function cannot change the
+// length
+template 
+void StringBoolTransform(KernelContext* ctx, const ExecBatch& batch,
+ StrToBoolTransformFunc transform, Datum* out) {
+  using offset_type = typename Type::offset_type;
+
+  if (batch[0].kind() == Datum::ARRAY) {
+const ArrayData& input = *batch[0].array();
+ArrayData* out_arr = out->mutable_array();
+if (input.length > 0) {
+  transform(
+  reinterpret_cast(input.buffers[1]->data()) + 
input.offset,
+  input.buffers[2]->data(), input.length, out_arr->offset,
+  out_arr->buffers[1]->mutable_data());
+}
+  } else {
+const auto& input = checked_cast(*batch[0].scalar());
+auto result = 
checked_pointer_cast(MakeNullScalar(out->type()));

Review comment:
   We can move this line into then clause before line 327.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] saethlin opened a new pull request #7610: ARROW-9290: [Rust] [Parquet] Add features to allow opting out of dependencies

2020-07-01 Thread GitBox


saethlin opened a new pull request #7610:
URL: https://github.com/apache/arrow/pull/7610


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7610: ARROW-9290: [Rust] [Parquet] Add features to allow opting out of dependencies

2020-07-01 Thread GitBox


github-actions[bot] commented on pull request #7610:
URL: https://github.com/apache/arrow/pull/7610#issuecomment-652553010


   https://issues.apache.org/jira/browse/ARROW-9290



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson opened a new pull request #7611: ARROW-3308: [R] Convert R character vector with data exceeding 2GB to Large type

2020-07-01 Thread GitBox


nealrichardson opened a new pull request #7611:
URL: https://github.com/apache/arrow/pull/7611


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7611: ARROW-3308: [R] Convert R character vector with data exceeding 2GB to Large type

2020-07-01 Thread GitBox


github-actions[bot] commented on pull request #7611:
URL: https://github.com/apache/arrow/pull/7611#issuecomment-652580102


   https://issues.apache.org/jira/browse/ARROW-3308



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou closed pull request #7544: ARROW-7285: [C++] ensure C++ implementation meets clarified dictionary spec

2020-07-01 Thread GitBox


pitrou closed pull request #7544:
URL: https://github.com/apache/arrow/pull/7544


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou opened a new pull request #7612: ARROW-7011: [C++] Implement casts from float/double to decimal

2020-07-01 Thread GitBox


pitrou opened a new pull request #7612:
URL: https://github.com/apache/arrow/pull/7612


   Also naturally available in Python using the Array.cast() method.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-07-01 Thread GitBox


kszucs commented on a change in pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#discussion_r448550619



##
File path: python/pyarrow/scalar.pxi
##
@@ -16,1198 +16,748 @@
 # under the License.
 
 
-_NULL = NA = None
+import collections
 
 
 cdef class Scalar:
 """
-The base class for all array elements.
-"""
-
-
-cdef class NullType(Scalar):
-"""
-Singleton for null array elements.
+The base class for scalars.
 """
-# TODO rename this NullValue?
-
-def __cinit__(self):
-global NA
-if NA is not None:
-raise Exception('Cannot create multiple NAType instances')
-
-self.type = null()
-
-def __repr__(self):
-return 'NULL'
 
-def as_py(self):
-"""
-Return None
-"""
-return None
+def __init__(self):
+raise TypeError("Do not call {}'s constructor directly, use "
+"pa.scalar() instead.".format(self.__class__.__name__))
 
-def __eq__(self, other):
-return NA
+cdef void init(self, const shared_ptr[CScalar]& wrapped):
+self.wrapped = wrapped
 
+@staticmethod
+cdef wrap(const shared_ptr[CScalar]& wrapped):
+cdef:
+Scalar self
+Type type_id = wrapped.get().type.get().id()
 
-_NULL = NA = NullType()
+if type_id == _Type_NA:
+return _NULL
 
+typ = _scalar_classes[type_id]
+self = typ.__new__(typ)
+self.init(wrapped)
 
-cdef class ArrayValue(Scalar):
-"""
-The base class for non-null array elements.
-"""
+return self
 
-def __init__(self):
-raise TypeError("Do not call {}'s constructor directly, use array "
-"subscription instead."
-.format(self.__class__.__name__))
+cdef inline shared_ptr[CScalar] unwrap(self) nogil:
+return self.wrapped
 
-cdef void init(self, DataType type, const shared_ptr[CArray]& sp_array,
-   int64_t index):
-self.type = type
-self.index = index
-self._set_array(sp_array)
+@property
+def type(self):
+"""
+Data type of the Scalar object.
+"""
+return pyarrow_wrap_data_type(self.wrapped.get().type)
 
-cdef void _set_array(self, const shared_ptr[CArray]& sp_array):
-self.sp_array = sp_array
+@property
+def is_valid(self):
+"""
+Holds a valid (non-null) value.
+"""
+return self.wrapped.get().is_valid
 
 def __repr__(self):
-if hasattr(self, 'as_py'):
-return repr(self.as_py())
-else:
-return super(Scalar, self).__repr__()
+return ''.format(
+self.__class__.__name__, self.as_py()
+)
 
 def __str__(self):
-if hasattr(self, 'as_py'):
-return str(self.as_py())
-else:
-return super(Scalar, self).__str__()
+return str(self.as_py())
+
+def equals(self, Scalar other):
+return self.wrapped.get().Equals(other.unwrap().get()[0])
 
 def __eq__(self, other):
-if hasattr(self, 'as_py'):
-if isinstance(other, ArrayValue):
-other = other.as_py()
-return self.as_py() == other
-else:
-raise NotImplementedError(
-"Cannot compare Arrow values that don't support as_py()")
+try:
+if not isinstance(other, Scalar):
+other = scalar(other, type=self.type)
+return self.equals(other)
+except (TypeError, ValueError, ArrowInvalid):
+return NotImplemented
 
 def __hash__(self):
-return hash(self.as_py())
+cdef CScalarHash hasher
+return hasher(self.wrapped)
 
+def as_py(self):
+raise NotImplementedError()
 
-cdef class BooleanValue(ArrayValue):
-"""
-Concrete class for boolean array elements.
-"""
 
-def as_py(self):
-"""
-Return this value as a Python bool.
-"""
-cdef CBooleanArray* ap =  self.sp_array.get()
-return ap.Value(self.index)
+_NULL = NA = None
 
 
-cdef class Int8Value(ArrayValue):
+cdef class NullScalar(Scalar):
 """
-Concrete class for int8 array elements.
+Concrete class for null scalars.
 """
 
-def as_py(self):
-"""
-Return this value as a Python int.
-"""
-cdef CInt8Array* ap =  self.sp_array.get()
-return ap.Value(self.index)
+def __cinit__(self):
+global NA
+if NA is not None:
+raise RuntimeError('Cannot create multiple NullScalar instances')
+self.init(shared_ptr[CScalar](new CNullScalar()))
 
+def __init__(self):
+pass
 
-cdef class UInt8Value(ArrayValue):
-"""
-Concrete class for uint8 array elements.
-"""
+def __eq__(self, other):
+return NA
+
+def __hash__(self):
+   

[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-07-01 Thread GitBox


kszucs commented on a change in pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#discussion_r448550822



##
File path: python/pyarrow/scalar.pxi
##
@@ -16,1198 +16,748 @@
 # under the License.
 
 
-_NULL = NA = None
+import collections
 
 
 cdef class Scalar:
 """
-The base class for all array elements.
-"""
-
-
-cdef class NullType(Scalar):
-"""
-Singleton for null array elements.
+The base class for scalars.
 """
-# TODO rename this NullValue?
-
-def __cinit__(self):
-global NA
-if NA is not None:
-raise Exception('Cannot create multiple NAType instances')
-
-self.type = null()
-
-def __repr__(self):
-return 'NULL'
 
-def as_py(self):
-"""
-Return None
-"""
-return None
+def __init__(self):
+raise TypeError("Do not call {}'s constructor directly, use "
+"pa.scalar() instead.".format(self.__class__.__name__))
 
-def __eq__(self, other):
-return NA
+cdef void init(self, const shared_ptr[CScalar]& wrapped):
+self.wrapped = wrapped
 
+@staticmethod
+cdef wrap(const shared_ptr[CScalar]& wrapped):
+cdef:
+Scalar self
+Type type_id = wrapped.get().type.get().id()
 
-_NULL = NA = NullType()
+if type_id == _Type_NA:
+return _NULL
 
+typ = _scalar_classes[type_id]
+self = typ.__new__(typ)
+self.init(wrapped)
 
-cdef class ArrayValue(Scalar):
-"""
-The base class for non-null array elements.
-"""
+return self
 
-def __init__(self):
-raise TypeError("Do not call {}'s constructor directly, use array "
-"subscription instead."
-.format(self.__class__.__name__))
+cdef inline shared_ptr[CScalar] unwrap(self) nogil:
+return self.wrapped
 
-cdef void init(self, DataType type, const shared_ptr[CArray]& sp_array,
-   int64_t index):
-self.type = type
-self.index = index
-self._set_array(sp_array)
+@property
+def type(self):
+"""
+Data type of the Scalar object.
+"""
+return pyarrow_wrap_data_type(self.wrapped.get().type)
 
-cdef void _set_array(self, const shared_ptr[CArray]& sp_array):
-self.sp_array = sp_array
+@property
+def is_valid(self):
+"""
+Holds a valid (non-null) value.
+"""
+return self.wrapped.get().is_valid
 
 def __repr__(self):
-if hasattr(self, 'as_py'):
-return repr(self.as_py())
-else:
-return super(Scalar, self).__repr__()
+return ''.format(
+self.__class__.__name__, self.as_py()
+)
 
 def __str__(self):
-if hasattr(self, 'as_py'):
-return str(self.as_py())
-else:
-return super(Scalar, self).__str__()
+return str(self.as_py())
+
+def equals(self, Scalar other):
+return self.wrapped.get().Equals(other.unwrap().get()[0])
 
 def __eq__(self, other):
-if hasattr(self, 'as_py'):
-if isinstance(other, ArrayValue):
-other = other.as_py()
-return self.as_py() == other
-else:
-raise NotImplementedError(
-"Cannot compare Arrow values that don't support as_py()")
+try:
+if not isinstance(other, Scalar):
+other = scalar(other, type=self.type)
+return self.equals(other)
+except (TypeError, ValueError, ArrowInvalid):
+return NotImplemented
 
 def __hash__(self):
-return hash(self.as_py())
+cdef CScalarHash hasher
+return hasher(self.wrapped)
 
+def as_py(self):
+raise NotImplementedError()
 
-cdef class BooleanValue(ArrayValue):
-"""
-Concrete class for boolean array elements.
-"""
 
-def as_py(self):
-"""
-Return this value as a Python bool.
-"""
-cdef CBooleanArray* ap =  self.sp_array.get()
-return ap.Value(self.index)
+_NULL = NA = None
 
 
-cdef class Int8Value(ArrayValue):
+cdef class NullScalar(Scalar):
 """
-Concrete class for int8 array elements.
+Concrete class for null scalars.
 """
 
-def as_py(self):
-"""
-Return this value as a Python int.
-"""
-cdef CInt8Array* ap =  self.sp_array.get()
-return ap.Value(self.index)
+def __cinit__(self):
+global NA
+if NA is not None:
+raise RuntimeError('Cannot create multiple NullScalar instances')
+self.init(shared_ptr[CScalar](new CNullScalar()))
 
+def __init__(self):
+pass
 
-cdef class UInt8Value(ArrayValue):
-"""
-Concrete class for uint8 array elements.
-"""
+def __eq__(self, other):
+return NA
+
+def __hash__(self):
+   

[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-07-01 Thread GitBox


kszucs commented on a change in pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#discussion_r448551033



##
File path: python/pyarrow/tests/test_scalars.py
##
@@ -16,427 +16,443 @@
 # under the License.
 
 import datetime
+import decimal
 import pytest
-import unittest
 
 import numpy as np
 
 import pyarrow as pa
 
 
-class TestScalars(unittest.TestCase):
-
-def test_null_singleton(self):
-with pytest.raises(Exception):
-pa.NAType()
+@pytest.mark.parametrize(['value', 'ty', 'klass', 'deprecated'], [
+(False, None, pa.BooleanScalar, pa.BooleanValue),
+(True, None, pa.BooleanScalar, pa.BooleanValue),
+(1, None, pa.Int64Scalar, pa.Int64Value),
+(-1, None, pa.Int64Scalar, pa.Int64Value),
+(1, pa.int8(), pa.Int8Scalar, pa.Int8Value),
+(1, pa.uint8(), pa.UInt8Scalar, pa.UInt8Value),
+(1, pa.int16(), pa.Int16Scalar, pa.Int16Value),
+(1, pa.uint16(), pa.UInt16Scalar, pa.UInt16Value),
+(1, pa.int32(), pa.Int32Scalar, pa.Int32Value),
+(1, pa.uint32(), pa.UInt32Scalar, pa.UInt32Value),
+(1, pa.int64(), pa.Int64Scalar, pa.Int64Value),
+(1, pa.uint64(), pa.UInt64Scalar, pa.UInt64Value),
+(1.0, None, pa.DoubleScalar, pa.DoubleValue),
+(np.float16(1.0), pa.float16(), pa.HalfFloatScalar, pa.HalfFloatValue),
+(1.0, pa.float32(), pa.FloatScalar, pa.FloatValue),
+("string", None, pa.StringScalar, pa.StringValue),
+(b"bytes", None, pa.BinaryScalar, pa.BinaryValue),
+([1, 2, 3], None, pa.ListScalar, pa.ListValue),
+([1, 2, 3, 4], pa.large_list(pa.int8()), pa.LargeListScalar,
+ pa.LargeListValue),
+(datetime.date.today(), None, pa.Date32Scalar, pa.Date64Value),
+(datetime.datetime.now(), None, pa.TimestampScalar, pa.TimestampValue),
+({'a': 1, 'b': [1, 2]}, None, pa.StructScalar, pa.StructValue)
+])
+def test_basics(value, ty, klass, deprecated):
+s = pa.scalar(value, type=ty)
+assert isinstance(s, klass)
+assert s == value
+assert s == s
+assert s != "else"
+assert hash(s) == hash(s)
+assert s.is_valid is True
+with pytest.warns(FutureWarning):
+isinstance(s, deprecated)
+
+s = pa.scalar(None, type=s.type)
+assert s.is_valid is False
+assert s.as_py() is None
+
+
+def test_null_singleton():
+with pytest.raises(Exception):
+pa.NullScalar()
+
+
+def test_nulls():
+null = pa.scalar(None)
+assert null is pa.NA
+assert null.as_py() is None
+assert (null == "something") is pa.NA

Review comment:
   Added.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-07-01 Thread GitBox


kszucs commented on a change in pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#discussion_r448551482



##
File path: python/pyarrow/scalar.pxi
##
@@ -16,1198 +16,745 @@
 # under the License.
 
 
-_NULL = NA = None
+import collections
 
 
 cdef class Scalar:
 """
-The base class for all array elements.
+The base class for scalars.
 """
 
+def __init__(self):
+raise TypeError("Do not call {}'s constructor directly, use "
+"pa.scalar() instead.".format(self.__class__.__name__))
 
-cdef class NullType(Scalar):
-"""
-Singleton for null array elements.
-"""
-# TODO rename this NullValue?
+cdef void init(self, const shared_ptr[CScalar]& wrapped):
+self.wrapped = wrapped
 
-def __cinit__(self):
-global NA
-if NA is not None:
-raise Exception('Cannot create multiple NAType instances')
+@staticmethod
+cdef wrap(const shared_ptr[CScalar]& wrapped):
+cdef:
+Scalar self
+Type type_id = wrapped.get().type.get().id()
+
+if type_id == _Type_NA:
+return _NULL
+
+typ = _scalar_classes[type_id]
+self = typ.__new__(typ)
+self.init(wrapped)
+
+return self
+
+cdef inline shared_ptr[CScalar] unwrap(self) nogil:
+return self.wrapped
+
+@property
+def type(self):
+return pyarrow_wrap_data_type(self.wrapped.get().type)
 
-self.type = null()
+@property
+def is_valid(self):
+return self.wrapped.get().is_valid
 
 def __repr__(self):
-return 'NULL'
+return ''.format(
+self.__class__.__name__, self.as_py()
+)
 
-def as_py(self):
-"""
-Return None
-"""
-return None
+def __str__(self):
+return str(self.as_py())
+
+def equals(self, Scalar other):
+return self.wrapped.get().Equals(other.unwrap().get()[0])
 
 def __eq__(self, other):
-return NA
+try:
+if not isinstance(other, Scalar):
+other = scalar(other, type=self.type)
+return self.equals(other)
+except (TypeError, ValueError, ArrowInvalid):
+return NotImplemented
+
+def __hash__(self):
+cdef CScalarHash hasher
+return hasher(self.wrapped)
+
+def as_py(self):
+raise NotImplementedError()
 
 
-_NULL = NA = NullType()
+_NULL = NA = None
 
 
-cdef class ArrayValue(Scalar):
+cdef class NullScalar(Scalar):
 """
-The base class for non-null array elements.
+Concrete class for null scalars.
 """
 
-def __init__(self):
-raise TypeError("Do not call {}'s constructor directly, use array "
-"subscription instead."
-.format(self.__class__.__name__))
+def __cinit__(self):
+global NA
+if NA is not None:
+raise Exception('Cannot create multiple NAType instances')
+self.init(shared_ptr[CScalar](new CNullScalar()))
 
-cdef void init(self, DataType type, const shared_ptr[CArray]& sp_array,
-   int64_t index):
-self.type = type
-self.index = index
-self._set_array(sp_array)
+def __init__(self):
+pass
 
-cdef void _set_array(self, const shared_ptr[CArray]& sp_array):
-self.sp_array = sp_array
+def __eq__(self, other):
+return NA

Review comment:
   Updated to return False.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




  1   2   >