[GitHub] [arrow] cyb70289 commented on issue #9735: [Golang] Create ipc format for ipc Reader

2021-03-17 Thread GitBox


cyb70289 commented on issue #9735:
URL: https://github.com/apache/arrow/issues/9735#issuecomment-801634282


   > ` subscribe first to send an e-mail`
   > Sorry, because the first time use github. Please guide me know-how !
   
   @hunght3101, mail list is the preferred way for issues discussion. You can 
simply click the `subscribe` link at page https://arrow.apache.org/community/ 
to subscribe to user@ mail list. Then send email to the community about your 
issues.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] projjal commented on a change in pull request #9707: ARROW-11984: [C++] Implement SHA128 and SHA256 functions in Gandiva module - WIP

2021-03-17 Thread GitBox


projjal commented on a change in pull request #9707:
URL: https://github.com/apache/arrow/pull/9707#discussion_r596529641



##
File path: cpp/src/gandiva/gdv_function_stubs.cc
##
@@ -122,6 +123,133 @@ int32_t gdv_fn_populate_varlen_vector(int64_t 
context_ptr, int8_t* data_ptr,
   return 0;
 }
 
+#define SHA128_HASH_FUNCTION(TYPE) 
   \
+  GANDIVA_EXPORT   
   \
+  const char *gdv_fn_sha128_##TYPE(int64_t context, gdv_##TYPE value,  
   \
+bool validity, int32_t *out_length) {  
   \
+if (!validity) {   
   \
+  *out_length = 0; 
   \
+  return "";   
   \

Review comment:
   Return a hash for null values too.

##
File path: cpp/src/gandiva/tests/hash_test.cc
##
@@ -147,4 +153,252 @@ TEST_F(TestHash, TestBuf) {
   }
 }
 
+TEST_F(TestHash, TestSha256Simple) {

Review comment:
   I think you should also add unit test for HashUtils::GetHash() with 
expected sha values.
   You can also add java test in java/gandiva/ProjectorTest and match by 
generating sha using MessageDigest class.

##
File path: cpp/src/gandiva/hash_utils.cc
##
@@ -0,0 +1,137 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include 
+#include "openssl/evp.h"
+#include "gandiva/hash_utils.h"
+#include "gandiva/execution_context.h"
+#include "gandiva/gdv_function_stubs.h"
+
+namespace gandiva {
+  const char* HashUtils::HashUsingSha256(int64_t context,
+ const void* message,
+ size_t message_length,
+ int32_t *out_length) {
+// The buffer size is the hash size + null character
+int sha256_result_length = 65;
+return HashUtils::GetHash(context, message, message_length, EVP_sha256(),
+ sha256_result_length, 
out_length);
+  }
+  const char* HashUtils::HashUsingSha128(int64_t context,
+ const void* message,
+ size_t message_length,
+ int32_t *out_length) {
+// The buffer size is the hash size + null character
+int sha128_result_length = 41;
+return HashUtils::GetHash(context, message, message_length, EVP_sha1(),
+ sha128_result_length, 
out_length);
+  }
+
+  const char* HashUtils::GetHash(int64_t context,
+ const void* message,
+ size_t message_length,
+ const EVP_MD *hash_type,
+ int result_buf_size,
+ int32_t *out_length) {
+EVP_MD_CTX *md_ctx = EVP_MD_CTX_new();
+
+if (md_ctx == nullptr) {
+  HashUtils::ErrorMessage(context, "Could not allocate memory "
+  "for 
SHA processing.");
+  return "";
+}
+
+int evp_success_status = 1;
+
+if (EVP_DigestInit_ex(md_ctx, hash_type, nullptr) != evp_success_status) {
+  HashUtils::ErrorMessage(context, "Could not obtain the hash "
+  "for 
the defined value.");
+  EVP_MD_CTX_free(md_ctx);
+  return "";
+}
+
+if (EVP_DigestUpdate(md_ctx, message, message_length) != 
evp_success_status) {
+  HashUtils::ErrorMessage(context, "Could not obtain the hash for "
+  "the 
defined value.");
+  EVP_MD_CTX_free(md_ctx);
+  return "";
+}
+
+int hash_size = EVP_MD_size(hash_type);
+auto* result = static_cast(OPENSSL_malloc(hash_size));
+
+if (result == nullptr) {
+  HashUtils::ErrorMessage(context, "Could not allocate memory "

[GitHub] [arrow] liyafan82 commented on pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4

2021-03-17 Thread GitBox


liyafan82 commented on pull request #8949:
URL: https://github.com/apache/arrow/pull/8949#issuecomment-801600759


   > If you've already started 
[ARROW-11899](https://issues.apache.org/jira/browse/ARROW-11899) then I'll let 
you finish it up, hopefully it isn't too much work. We are discussing on the ML 
the path forward for LZ4 in general, once that is cleared up we can figure out 
do the work including if @HedgehogCode is interesting in contributing.
   
   Sounds good. Hopefully, I will prepare a PR in a few days. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9745: ARROW-11703: [R] Implement dplyr::arrange() [WIP]

2021-03-17 Thread GitBox


github-actions[bot] commented on pull request #9745:
URL: https://github.com/apache/arrow/pull/9745#issuecomment-801599144


   https://issues.apache.org/jira/browse/ARROW-11703



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ianmcook opened a new pull request #9745: ARROW-11703: [R] Implement dplyr::arrange() [WIP]

2021-03-17 Thread GitBox


ianmcook opened a new pull request #9745:
URL: https://github.com/apache/arrow/pull/9745


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #9147: ARROW-11177: [Java] ArrowMessage failed to parse compressed grpc stream

2021-03-17 Thread GitBox


emkornfield commented on a change in pull request #9147:
URL: https://github.com/apache/arrow/pull/9147#discussion_r596528147



##
File path: 
java/flight/flight-core/src/main/java/org/apache/arrow/flight/ArrowMessage.java
##
@@ -332,6 +333,23 @@ private static ArrowMessage frame(BufferAllocator 
allocator, final InputStream s
 
   }
 
+  /**
+   * Get first byte with EOF check, it is especially needed when using grpc 
compression.
+   * InflaterInputStream need another read to change reachEOF after all bytes 
has been read.
+   *
+   * @param is InputStream
+   * @return -1 if stream is not available, otherwise it will return the 
actual value.
+   * @throws IOException Read first byte failed.
+   */
+  private static int readRawVarint32WithEOFCheck(InputStream is) throws 
IOException {
+int firstByte = is.read();

Review comment:
   I agree with @lidavidm another scenario where zero is perfectly 
allowable is when no data is buffered but EOF hasn't been reached. @stczwd do 
you think you will be able to revise the PR to fix the underlying issue?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson closed pull request #9741: ARROW-12005: [R] Fix a bash typo in configure

2021-03-17 Thread GitBox


nealrichardson closed pull request #9741:
URL: https://github.com/apache/arrow/pull/9741


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4

2021-03-17 Thread GitBox


emkornfield commented on pull request #8949:
URL: https://github.com/apache/arrow/pull/8949#issuecomment-801585216


   If you've already started ARROW-11899 then I'll let you finish it up, 
hopefully it isn't too much work.  We are discussing on the ML the path forward 
for LZ4 in general, once that is cleared up we can figure out do the work 
including if @HedgehogCode is interesting in contributing.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] cyb70289 commented on a change in pull request #9728: ARROW-10250: [C++][FlightRPC] Consistently use FlightClientOptions::Defaults

2021-03-17 Thread GitBox


cyb70289 commented on a change in pull request #9728:
URL: https://github.com/apache/arrow/pull/9728#discussion_r596522463



##
File path: cpp/src/arrow/flight/client.h
##
@@ -94,8 +94,6 @@ class ARROW_FLIGHT_EXPORT FlightWriteSizeStatusDetail : 
public arrow::StatusDeta
 
 class ARROW_FLIGHT_EXPORT FlightClientOptions {
  public:
-  FlightClientOptions();

Review comment:
   I declared default constructor as private and find it's called 
implicitly at line 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/client.cc#L1243,
 the second augment `{}`.
   
   Default constructor creates invalid object now, should not be accidentally 
called.
   Maybe add a constructor with explicit parameters?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4

2021-03-17 Thread GitBox


liyafan82 commented on pull request #8949:
URL: https://github.com/apache/arrow/pull/8949#issuecomment-801583143


   > +1 thank you. @liyafan82 did you have plans to work on the follow-up items 
or ZSTD? Otherwise I can take them up.
   > 
   > @HedgehogCode any thoughts on how to procede for LZ4? We can maybe discuss 
more on the performance JIRA?
   
   @emkornfield Thanks a lot for your effort.
   
   I have started working on ARROW-11899 yesterday. 
   If you are interested in any of the items (including ARROW-11899), please 
feel free to assign them to yourself. I'd like to help with the 
review/discussions :-)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9744: ARROW-12012: [Java][JDBC] Fix BinaryConsumer reallocation

2021-03-17 Thread GitBox


github-actions[bot] commented on pull request #9744:
URL: https://github.com/apache/arrow/pull/9744#issuecomment-801571749


   https://issues.apache.org/jira/browse/ARROW-12012



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zxf opened a new pull request #9744: ARROW-12012: [Java][JDBC] Fix BinaryConsumer reallocation

2021-03-17 Thread GitBox


zxf opened a new pull request #9744:
URL: https://github.com/apache/arrow/pull/9744


   [ARROW-12012](https://issues.apache.org/jira/browse/ARROW-12012)
   An exception will be thrown when BinaryConsumer consumes a large amount or a 
lot of data.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9743: first informative error msg for lz4 error

2021-03-17 Thread GitBox


github-actions[bot] commented on pull request #9743:
URL: https://github.com/apache/arrow/pull/9743#issuecomment-801510996


   
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9741: ARROW-12005: [R] Fix a bash typo in configure

2021-03-17 Thread GitBox


github-actions[bot] commented on pull request #9741:
URL: https://github.com/apache/arrow/pull/9741#issuecomment-801499897


   Revision: e2c22c680aee40a3eadeb3ec583c560372734171
   
   Submitted crossbow builds: [ursacomputing/crossbow @ 
actions-228](https://github.com/ursacomputing/crossbow/branches/all?query=actions-228)
   
   |Task|Status|
   ||--|
   |test-r-install-local|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-228-github-test-r-install-local)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-228-github-test-r-install-local)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9741: ARROW-12005: [R] Fix a bash typo in configure

2021-03-17 Thread GitBox


github-actions[bot] commented on pull request #9741:
URL: https://github.com/apache/arrow/pull/9741#issuecomment-801490891


   https://issues.apache.org/jira/browse/ARROW-12005



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] seddonm1 commented on pull request #9682: ARROW-7364: [Rust] Add cast options to cast kernel [WIP]

2021-03-17 Thread GitBox


seddonm1 commented on pull request #9682:
URL: https://github.com/apache/arrow/pull/9682#issuecomment-801488963


   FYI https://github.com/ballista-compute/sqlparser-rs/pull/299 has been 
raised to add TRY_CAST to the parser.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] seddonm1 commented on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [Splitting to separate PRs]

2021-03-17 Thread GitBox


seddonm1 commented on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-801488843


   Closed after splitting.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] seddonm1 closed pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [Splitting to separate PRs]

2021-03-17 Thread GitBox


seddonm1 closed pull request #9243:
URL: https://github.com/apache/arrow/pull/9243


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on pull request #9621: ARROW-11591: [C++][Compute] Grouped aggregation

2021-03-17 Thread GitBox


bkietz commented on pull request #9621:
URL: https://github.com/apache/arrow/pull/9621#issuecomment-801466302


   @pitrou PTAL



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #9621: ARROW-11591: [C++][Compute] Grouped aggregation

2021-03-17 Thread GitBox


bkietz commented on a change in pull request #9621:
URL: https://github.com/apache/arrow/pull/9621#discussion_r596407348



##
File path: cpp/src/arrow/compute/kernels/aggregate_basic.cc
##
@@ -229,6 +604,710 @@ std::unique_ptr AllInit(KernelContext*, 
const KernelInitArgs& args)
   return ::arrow::internal::make_unique();
 }
 
+struct GroupByImpl : public ScalarAggregator {
+  using AddLengthImpl = std::function&, 
int32_t*)>;
+
+  struct GetAddLengthImpl {
+static constexpr int32_t null_extra_byte = 1;
+
+static void AddFixedLength(int32_t fixed_length, int64_t num_repeats,
+   int32_t* lengths) {
+  for (int64_t i = 0; i < num_repeats; ++i) {
+lengths[i] += fixed_length + null_extra_byte;
+  }
+}
+
+static void AddVarLength(const std::shared_ptr& data, int32_t* 
lengths) {
+  using offset_type = typename StringType::offset_type;
+  constexpr int32_t length_extra_bytes = sizeof(offset_type);
+  auto offset = data->offset;
+  const auto offsets = data->GetValues(1);
+  if (data->MayHaveNulls()) {
+const uint8_t* nulls = data->buffers[0]->data();
+
+for (int64_t i = 0; i < data->length; ++i) {
+  bool is_null = !BitUtil::GetBit(nulls, offset + i);
+  if (is_null) {
+lengths[i] += null_extra_byte + length_extra_bytes;
+  } else {
+lengths[i] += null_extra_byte + length_extra_bytes + 
offsets[offset + i + 1] -
+  offsets[offset + i];
+  }
+}
+  } else {
+for (int64_t i = 0; i < data->length; ++i) {
+  lengths[i] += null_extra_byte + length_extra_bytes + offsets[offset 
+ i + 1] -
+offsets[offset + i];
+}
+  }
+}
+
+template 
+Status Visit(const T& input_type) {
+  int32_t num_bytes = (bit_width(input_type.id()) + 7) / 8;
+  add_length_impl = [num_bytes](const std::shared_ptr& data,
+int32_t* lengths) {
+AddFixedLength(num_bytes, data->length, lengths);
+  };
+  return Status::OK();
+}
+
+Status Visit(const StringType&) {
+  add_length_impl = [](const std::shared_ptr& data, int32_t* 
lengths) {
+AddVarLength(data, lengths);
+  };
+  return Status::OK();
+}
+
+Status Visit(const BinaryType&) {
+  add_length_impl = [](const std::shared_ptr& data, int32_t* 
lengths) {
+AddVarLength(data, lengths);
+  };
+  return Status::OK();
+}
+
+Status Visit(const FixedSizeBinaryType& type) {
+  int32_t num_bytes = type.byte_width();
+  add_length_impl = [num_bytes](const std::shared_ptr& data,
+int32_t* lengths) {
+AddFixedLength(num_bytes, data->length, lengths);
+  };
+  return Status::OK();
+}
+
+AddLengthImpl add_length_impl;
+  };
+
+  using EncodeNextImpl =
+  std::function&, uint8_t**)>;
+
+  struct GetEncodeNextImpl {
+template 
+static void EncodeSmallFixed(const std::shared_ptr& data,
+ uint8_t** encoded_bytes) {
+  auto raw_input = data->buffers[1]->data();
+  auto offset = data->offset;
+  if (data->MayHaveNulls()) {
+const uint8_t* nulls = data->buffers[0]->data();
+for (int64_t i = 0; i < data->length; ++i) {
+  auto& encoded_ptr = encoded_bytes[i];
+  bool is_null = !BitUtil::GetBit(nulls, offset + i);
+  encoded_ptr[0] = is_null ? 1 : 0;
+  encoded_ptr += 1;
+  uint64_t null_multiplier = is_null ? 0 : 1;
+  if (NumBits == 1) {
+encoded_ptr[0] = static_cast(
+null_multiplier * (BitUtil::GetBit(raw_input, offset + i) ? 1 
: 0));
+encoded_ptr += 1;
+  }
+  if (NumBits == 8) {
+encoded_ptr[0] =
+static_cast(null_multiplier * reinterpret_cast(
+   raw_input)[offset + 
i]);
+encoded_ptr += 1;
+  }
+  if (NumBits == 16) {
+reinterpret_cast(encoded_ptr)[0] =
+static_cast(null_multiplier * reinterpret_cast(
+raw_input)[offset 
+ i]);
+encoded_ptr += 2;
+  }
+  if (NumBits == 32) {
+reinterpret_cast(encoded_ptr)[0] =
+static_cast(null_multiplier * reinterpret_cast(
+raw_input)[offset 
+ i]);
+encoded_ptr += 4;
+  }
+  if (NumBits == 64) {
+reinterpret_cast(encoded_ptr)[0] =
+static_cast(null_multiplier * reinterpret_cast(
+raw_input)[offset 
+ i]);
+encoded_ptr += 8;
+  }
+}
+  } else {
+for (int64_t i = 0; i < data->length; ++i)

[GitHub] [arrow] bkietz commented on a change in pull request #9621: ARROW-11591: [C++][Compute] Grouped aggregation

2021-03-17 Thread GitBox


bkietz commented on a change in pull request #9621:
URL: https://github.com/apache/arrow/pull/9621#discussion_r596406429



##
File path: cpp/src/arrow/compute/kernels/aggregate_basic.cc
##
@@ -229,6 +604,710 @@ std::unique_ptr AllInit(KernelContext*, 
const KernelInitArgs& args)
   return ::arrow::internal::make_unique();
 }
 
+struct GroupByImpl : public ScalarAggregator {
+  using AddLengthImpl = std::function&, 
int32_t*)>;
+
+  struct GetAddLengthImpl {
+static constexpr int32_t null_extra_byte = 1;
+
+static void AddFixedLength(int32_t fixed_length, int64_t num_repeats,
+   int32_t* lengths) {
+  for (int64_t i = 0; i < num_repeats; ++i) {
+lengths[i] += fixed_length + null_extra_byte;
+  }
+}
+
+static void AddVarLength(const std::shared_ptr& data, int32_t* 
lengths) {
+  using offset_type = typename StringType::offset_type;
+  constexpr int32_t length_extra_bytes = sizeof(offset_type);
+  auto offset = data->offset;
+  const auto offsets = data->GetValues(1);
+  if (data->MayHaveNulls()) {
+const uint8_t* nulls = data->buffers[0]->data();
+
+for (int64_t i = 0; i < data->length; ++i) {
+  bool is_null = !BitUtil::GetBit(nulls, offset + i);
+  if (is_null) {
+lengths[i] += null_extra_byte + length_extra_bytes;
+  } else {
+lengths[i] += null_extra_byte + length_extra_bytes + 
offsets[offset + i + 1] -
+  offsets[offset + i];
+  }
+}
+  } else {
+for (int64_t i = 0; i < data->length; ++i) {
+  lengths[i] += null_extra_byte + length_extra_bytes + offsets[offset 
+ i + 1] -
+offsets[offset + i];
+}
+  }
+}
+
+template 
+Status Visit(const T& input_type) {
+  int32_t num_bytes = (bit_width(input_type.id()) + 7) / 8;
+  add_length_impl = [num_bytes](const std::shared_ptr& data,
+int32_t* lengths) {
+AddFixedLength(num_bytes, data->length, lengths);
+  };
+  return Status::OK();
+}
+
+Status Visit(const StringType&) {
+  add_length_impl = [](const std::shared_ptr& data, int32_t* 
lengths) {
+AddVarLength(data, lengths);
+  };
+  return Status::OK();
+}
+
+Status Visit(const BinaryType&) {
+  add_length_impl = [](const std::shared_ptr& data, int32_t* 
lengths) {
+AddVarLength(data, lengths);
+  };
+  return Status::OK();
+}
+
+Status Visit(const FixedSizeBinaryType& type) {
+  int32_t num_bytes = type.byte_width();
+  add_length_impl = [num_bytes](const std::shared_ptr& data,
+int32_t* lengths) {
+AddFixedLength(num_bytes, data->length, lengths);
+  };
+  return Status::OK();
+}
+
+AddLengthImpl add_length_impl;
+  };
+
+  using EncodeNextImpl =
+  std::function&, uint8_t**)>;
+
+  struct GetEncodeNextImpl {
+template 
+static void EncodeSmallFixed(const std::shared_ptr& data,
+ uint8_t** encoded_bytes) {
+  auto raw_input = data->buffers[1]->data();
+  auto offset = data->offset;
+  if (data->MayHaveNulls()) {
+const uint8_t* nulls = data->buffers[0]->data();
+for (int64_t i = 0; i < data->length; ++i) {
+  auto& encoded_ptr = encoded_bytes[i];
+  bool is_null = !BitUtil::GetBit(nulls, offset + i);
+  encoded_ptr[0] = is_null ? 1 : 0;
+  encoded_ptr += 1;
+  uint64_t null_multiplier = is_null ? 0 : 1;
+  if (NumBits == 1) {
+encoded_ptr[0] = static_cast(
+null_multiplier * (BitUtil::GetBit(raw_input, offset + i) ? 1 
: 0));
+encoded_ptr += 1;
+  }
+  if (NumBits == 8) {
+encoded_ptr[0] =
+static_cast(null_multiplier * reinterpret_cast(
+   raw_input)[offset + 
i]);
+encoded_ptr += 1;
+  }
+  if (NumBits == 16) {
+reinterpret_cast(encoded_ptr)[0] =
+static_cast(null_multiplier * reinterpret_cast(
+raw_input)[offset 
+ i]);
+encoded_ptr += 2;
+  }
+  if (NumBits == 32) {
+reinterpret_cast(encoded_ptr)[0] =
+static_cast(null_multiplier * reinterpret_cast(
+raw_input)[offset 
+ i]);
+encoded_ptr += 4;
+  }
+  if (NumBits == 64) {
+reinterpret_cast(encoded_ptr)[0] =
+static_cast(null_multiplier * reinterpret_cast(
+raw_input)[offset 
+ i]);
+encoded_ptr += 8;
+  }
+}
+  } else {
+for (int64_t i = 0; i < data->length; ++i)

[GitHub] [arrow] returnString commented on a change in pull request #9710: ARROW-11969: [Rust][DataFusion] Improve Examples in documentation

2021-03-17 Thread GitBox


returnString commented on a change in pull request #9710:
URL: https://github.com/apache/arrow/pull/9710#discussion_r596404956



##
File path: rust/datafusion/README.md
##
@@ -58,6 +58,49 @@ Here are some of the projects known to use DataFusion:
 
 (if you know of another project, please submit a PR to add a link!)
 
+## Example Usage
+
+Run a SQL query against data stored in a CSV:
+
+```rust
+  let mut ctx = ExecutionContext::new();
+  ctx.register_csv("example", "tests/example.csv", CsvReadOptions::new())?;
+
+  // Create a plan to run a SQL query
+  let df = ctx.sql("SELECT a, MIN(b) FROM example GROUP BY a LIMIT 100")?;
+
+  // execute and print results
+  let results: Vec = df.collect().await?;
+  print_batches(&results)?;
+```
+
+Use the DataFrame API to process data stored in a CSV:
+
+```rust

Review comment:
   Nitpicking: this might be a little bit fun with API churn, e.g. I 
believe the input expr ownership work you've recently opened would change these 
from slices to vecs and we don't have a way to catch that automatically like we 
do for the in-crate docs (am I right in thinking that `cargo test` runs all 
doctests?).
   
   Edit: to be clear, I don't think it's a reason to not do it, just curious if 
anyone has ideas for how to prevent doc drift :)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] returnString commented on a change in pull request #9710: ARROW-11969: [Rust][DataFusion] Improve Examples in documentation

2021-03-17 Thread GitBox


returnString commented on a change in pull request #9710:
URL: https://github.com/apache/arrow/pull/9710#discussion_r596404956



##
File path: rust/datafusion/README.md
##
@@ -58,6 +58,49 @@ Here are some of the projects known to use DataFusion:
 
 (if you know of another project, please submit a PR to add a link!)
 
+## Example Usage
+
+Run a SQL query against data stored in a CSV:
+
+```rust
+  let mut ctx = ExecutionContext::new();
+  ctx.register_csv("example", "tests/example.csv", CsvReadOptions::new())?;
+
+  // Create a plan to run a SQL query
+  let df = ctx.sql("SELECT a, MIN(b) FROM example GROUP BY a LIMIT 100")?;
+
+  // execute and print results
+  let results: Vec = df.collect().await?;
+  print_batches(&results)?;
+```
+
+Use the DataFrame API to process data stored in a CSV:
+
+```rust

Review comment:
   Nitpicking: this might be a little bit fun with API churn, e.g. I 
believe the input expr ownership work you've recently opened would change these 
from slices to vecs and we don't have a way to catch that automatically like we 
do for the in-crate docs (am I right in thinking that `cargo test` runs all 
doctests?).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] returnString commented on a change in pull request #9704: ARROW-11964: [Rust][DataFusion] Extend constant folding and parquet filtering support

2021-03-17 Thread GitBox


returnString commented on a change in pull request #9704:
URL: https://github.com/apache/arrow/pull/9704#discussion_r596399314



##
File path: rust/datafusion/src/optimizer/constant_folding.rs
##
@@ -188,6 +188,97 @@ impl<'a> ExprRewriter for ConstantRewriter<'a> {
 right,
 },
 },
+Operator::Plus => match (left.as_ref(), right.as_ref()) {
+(
+Expr::Literal(ScalarValue::Float64(l)),
+Expr::Literal(ScalarValue::Float64(r)),
+) => match (l, r) {
+(Some(l), Some(r)) => {
+Expr::Literal(ScalarValue::Float64(Some(l + r)))
+}
+_ => Expr::Literal(ScalarValue::Float64(None)),
+},
+(
+Expr::Literal(ScalarValue::Int64(l)),
+Expr::Literal(ScalarValue::Int64(r)),
+) => match (l, r) {
+(Some(l), Some(r)) => {
+Expr::Literal(ScalarValue::Int64(Some(l + r)))
+}
+_ => Expr::Literal(ScalarValue::Int64(None)),
+},
+(
+Expr::Literal(ScalarValue::Int32(l)),
+Expr::Literal(ScalarValue::Int32(r)),
+) => match (l, r) {
+(Some(l), Some(r)) => {
+Expr::Literal(ScalarValue::Int32(Some(l + r)))
+}
+_ => Expr::Literal(ScalarValue::Int64(None)),

Review comment:
   Would it be worth adding some kind of macro to simplify the maintenance 
of this section? I'm not sure if this is intentionally an Int64 or whether this 
is meant to be an Int32, and I wonder if we could have a macro that reduces 
these maths ops to just the ScalarValue variant and an operator (slight 
awkwardness: can't pass operators to macros).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pachamaltese opened a new pull request #9743: first informative error msg for lz4 error

2021-03-17 Thread GitBox


pachamaltese opened a new pull request #9743:
URL: https://github.com/apache/arrow/pull/9743


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jonkeane commented on pull request #9741: ARROW-12005: [R] Fix a bash typo in configure

2021-03-17 Thread GitBox


jonkeane commented on pull request #9741:
URL: https://github.com/apache/arrow/pull/9741#issuecomment-801425608


   @github-actions crossbow submit test-r-install-local



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9740: ARROW-12003: [R] Fix NOTE re undefined global function group_by_drop_default

2021-03-17 Thread GitBox


github-actions[bot] commented on pull request #9740:
URL: https://github.com/apache/arrow/pull/9740#issuecomment-801401711


   https://issues.apache.org/jira/browse/ARROW-12003



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson closed pull request #9740: ARROW-12003: [R] Fix NOTE re undefined global function group_by_drop_default

2021-03-17 Thread GitBox


nealrichardson closed pull request #9740:
URL: https://github.com/apache/arrow/pull/9740


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #9742: [WIP] ARROW-11928: [C++] Execution engine API

2021-03-17 Thread GitBox


pitrou commented on pull request #9742:
URL: https://github.com/apache/arrow/pull/9742#issuecomment-801355489


   @bkietz @wesm Here is an initial stab at the exec node API. Only base 
classes are present.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou opened a new pull request #9742: [WIP] ARROW-11928: [C++] Execution engine API

2021-03-17 Thread GitBox


pitrou opened a new pull request #9742:
URL: https://github.com/apache/arrow/pull/9742


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jonkeane opened a new pull request #9741: ARROW-12005: [R] Fix a bash typo in configure

2021-03-17 Thread GitBox


jonkeane opened a new pull request #9741:
URL: https://github.com/apache/arrow/pull/9741


   Without the quotes, we get ` parse error: condition expected: !=`



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield closed pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4

2021-03-17 Thread GitBox


emkornfield closed pull request #8949:
URL: https://github.com/apache/arrow/pull/8949


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4

2021-03-17 Thread GitBox


emkornfield commented on pull request #8949:
URL: https://github.com/apache/arrow/pull/8949#issuecomment-801332194


   +1 thank you.  @liyafan82 did you have plans to work on the follow-up items 
or ZSTD? Otherwise I can take them up.
   
   @hedgehogcode any thoughts on how to procede for LZ4?  We can maybe discuss 
more on the performance JIRA?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield closed pull request #9421: ARROW-11066: [FlightRPC][Java] Make zero-copy writes a configurable option

2021-03-17 Thread GitBox


emkornfield closed pull request #9421:
URL: https://github.com/apache/arrow/pull/9421


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on pull request #9421: ARROW-11066: [FlightRPC][Java] Make zero-copy writes a configurable option

2021-03-17 Thread GitBox


emkornfield commented on pull request #9421:
URL: https://github.com/apache/arrow/pull/9421#issuecomment-801327752


   +1 merging.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #9671: ARROW-7905: [Go][Parquet] Initial Chunk of Parquet port to Go

2021-03-17 Thread GitBox


emkornfield commented on a change in pull request #9671:
URL: https://github.com/apache/arrow/pull/9671#discussion_r596278550



##
File path: go/parquet/LICENSE.txt
##
@@ -0,0 +1,1987 @@
+

Review comment:
   yeah, looking back it seems like we needed to copy to make this usable.  
I think there is an open item for go to support symlinks.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #9629: ARROW-11838: [C++] Support IPC reads with shared dictionaries.

2021-03-17 Thread GitBox


pitrou commented on pull request #9629:
URL: https://github.com/apache/arrow/pull/9629#issuecomment-801280256


   You need to update the git submodule as part of this PR.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zeroshade commented on pull request #9671: ARROW-7905: [Go][Parquet] Initial Chunk of Parquet port to Go

2021-03-17 Thread GitBox


zeroshade commented on pull request #9671:
URL: https://github.com/apache/arrow/pull/9671#issuecomment-801275287


   rebased branch with master, hopefully that fixes the failing check :) 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sbinet commented on a change in pull request #9671: ARROW-7905: [Go][Parquet] Initial Chunk of Parquet port to Go

2021-03-17 Thread GitBox


sbinet commented on a change in pull request #9671:
URL: https://github.com/apache/arrow/pull/9671#discussion_r596236416



##
File path: go/parquet/internal/bmi/_lib/bitmap_bmi2.s
##
@@ -0,0 +1,174 @@
+   .text

Review comment:
   ok.
   
   fine by me.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zeroshade commented on a change in pull request #9671: ARROW-7905: [Go][Parquet] Initial Chunk of Parquet port to Go

2021-03-17 Thread GitBox


zeroshade commented on a change in pull request #9671:
URL: https://github.com/apache/arrow/pull/9671#discussion_r596233932



##
File path: go/parquet/internal/bmi/_lib/bitmap_bmi2.s
##
@@ -0,0 +1,174 @@
+   .text

Review comment:
   Well, the assembly in the _lib directory technically doesn't have to 
exist at build time / run time. It's only needed for generating the .s files in 
the bmi directory itself via c2goasm. So you'd still have seamless `go get` 
installation workflow and wouldn't need to regenerate the _lib .s files in 
order to use this package. 
   
   That all being said, I'd rather keep all the files checked in anyways.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jmgpeeters commented on pull request #9629: ARROW-11838: [C++] Support IPC reads with shared dictionaries.

2021-03-17 Thread GitBox


jmgpeeters commented on pull request #9629:
URL: https://github.com/apache/arrow/pull/9629#issuecomment-801261388


   Hm, for some reason the github integration test checked out an older version 
of the arrow-testing data,
   
   
   Run ci/scripts/util_checkout.sh
   Submodule 'cpp/submodules/parquet-testing' 
(https://github.com/apache/parquet-testing.git) registered for path 
'cpp/submodules/parquet-testing'
   Submodule 'testing' (https://github.com/apache/arrow-testing) registered for 
path 'testing'
   Cloning into 
'/home/runner/work/arrow/arrow/cpp/submodules/parquet-testing'...
   Cloning into '/home/runner/work/arrow/arrow/testing'...
   Submodule path 'cpp/submodules/parquet-testing': checked out 
'8e7badc6a3817a02e06d17b5d8ab6b6dc356e890'
   From https://github.com/apache/arrow-testing
* branche8ce32338f2dfeca3a5126f7677bdee159604000 -> FETCH_HEAD
   Submodule path 'testing': checked out 
'e8ce32338f2dfeca3a5126f7677bdee159604000'
   
   
   which corresponds to one of yours: 
https://github.com/apache/arrow-testing/commit/e8ce32338f2dfeca3a5126f7677bdee159604000
   
   Unless this is immediately obvious to you (the script seems to work fine for 
me locally), I'll take a look tomorrow.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sbinet commented on a change in pull request #9671: ARROW-7905: [Go][Parquet] Initial Chunk of Parquet port to Go

2021-03-17 Thread GitBox


sbinet commented on a change in pull request #9671:
URL: https://github.com/apache/arrow/pull/9671#discussion_r596219612



##
File path: go/parquet/internal/bmi/_lib/bitmap_bmi2.s
##
@@ -0,0 +1,174 @@
+   .text

Review comment:
   well, removing those would require people to have all the tooling 
installed to regenerate them before being able to use go-parquet, right?
   
   I'd rather keep all the files checked in that are needed to preserve the 
usual seamless "go get" installation workflow.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sbinet commented on a change in pull request #9671: ARROW-7905: [Go][Parquet] Initial Chunk of Parquet port to Go

2021-03-17 Thread GitBox


sbinet commented on a change in pull request #9671:
URL: https://github.com/apache/arrow/pull/9671#discussion_r596217949



##
File path: go/parquet/internal/bmi/_lib/bitmap_bmi2.c
##
@@ -0,0 +1,38 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include 
+#include 
+
+void extract_bits(uint64_t bitmap, uint64_t select_bitmap, uint64_t* res) {
+   *res = _pext_u64(bitmap, select_bitmap);
+}
+
+void popcount64(uint64_t bitmap, uint64_t* res) {

Review comment:
   giving back to Caesar what's Caesar's...
   (a.k.a @stuartcarnie in that instance)
   
and where credits are due...
   https://www.influxdata.com/blog/influxdata-apache-arrow-go-implementation/





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mm0708 commented on issue #8099: Specifying schema when using write_feather?

2021-03-17 Thread GitBox


mm0708 commented on issue #8099:
URL: https://github.com/apache/arrow/issues/8099#issuecomment-801249604


   I dropped the column causing issues because it wasn't essential 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kdkavanagh commented on issue #8099: Specifying schema when using write_feather?

2021-03-17 Thread GitBox


kdkavanagh commented on issue #8099:
URL: https://github.com/apache/arrow/issues/8099#issuecomment-801240233


   @mm0708 what did you conclude here? I have this same question/problem



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou closed pull request #9738: ARROW-11997: [Python] concat_tables crashes python interpreter

2021-03-17 Thread GitBox


pitrou closed pull request #9738:
URL: https://github.com/apache/arrow/pull/9738


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #9738: ARROW-11997: [Python] concat_tables crashes python interpreter

2021-03-17 Thread GitBox


pitrou commented on pull request #9738:
URL: https://github.com/apache/arrow/pull/9738#issuecomment-801225326


   @dianaclarke You may want to enable Travis-CI on your Arrow fork for faster 
CI results.
   (in this case this doesn't matter, since Travis-CI doesn't run Python tests)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #9738: ARROW-11997: [Python] concat_tables crashes python interpreter

2021-03-17 Thread GitBox


pitrou commented on pull request #9738:
URL: https://github.com/apache/arrow/pull/9738#issuecomment-801224615


   Green CI run at https://github.com/dianaclarke/arrow/runs/2131383253



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ianmcook opened a new pull request #9740: ARROW-12003: [R] Fix NOTE re undefined global function group_by_drop_default

2021-03-17 Thread GitBox


ianmcook opened a new pull request #9740:
URL: https://github.com/apache/arrow/pull/9740


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9739: ARROW-12000: [Documentation] Add note about deviation from style guide on struct/classes

2021-03-17 Thread GitBox


github-actions[bot] commented on pull request #9739:
URL: https://github.com/apache/arrow/pull/9739#issuecomment-801195182


   https://issues.apache.org/jira/browse/ARROW-12000



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson closed pull request #9733: ARROW-11996: [R] Make r/configure run successfully on Solaris

2021-03-17 Thread GitBox


nealrichardson closed pull request #9733:
URL: https://github.com/apache/arrow/pull/9733


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on a change in pull request #9725: ARROW-8631: [C++][Python][Dataset] Add ReadOptions to CsvFileFormat, expose options to python

2021-03-17 Thread GitBox


lidavidm commented on a change in pull request #9725:
URL: https://github.com/apache/arrow/pull/9725#discussion_r596093793



##
File path: cpp/src/arrow/dataset/file_csv.h
##
@@ -37,6 +37,11 @@ class ARROW_DS_EXPORT CsvFileFormat : public FileFormat {
  public:
   /// Options affecting the parsing of CSV files
   csv::ParseOptions parse_options = csv::ParseOptions::Defaults();
+  /// Options affecting how CSV files are read.
+  ///
+  /// Note use_threads is ignored (it is always considered false) and 
block_size
+  /// should be set on CsvFragmentScanOptions.
+  csv::ReadOptions read_options = csv::ReadOptions::Defaults();

Review comment:
   Makes sense. I was going back and forth on it because of the redundancy. 
I'll inline all relevant fields and xref ReadOptions for documentation.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #9725: ARROW-8631: [C++][Python][Dataset] Add ReadOptions to CsvFileFormat, expose options to python

2021-03-17 Thread GitBox


bkietz commented on a change in pull request #9725:
URL: https://github.com/apache/arrow/pull/9725#discussion_r596084650



##
File path: cpp/src/arrow/dataset/dataset_internal.h
##
@@ -185,5 +185,14 @@ inline bool operator==(const SubtreeImpl::Encoded& l, 
const SubtreeImpl::Encoded
  l.partition_expression == r.partition_expression;
 }
 
+template 
+std::shared_ptr DowncastFragmentScanOptions(
+const std::shared_ptr& scan_options, const std::string& 
type_name) {
+  if (!scan_options) return nullptr;
+  if (!scan_options->fragment_scan_options) return nullptr;
+  if (scan_options->fragment_scan_options->type_name() != type_name) return 
nullptr;

Review comment:
   I think this should be an error
   ```suggestion
 if (scan_options->fragment_scan_options->type_name() != type_name) {
   return Status::Invalid("FragmentScanOptions of type ", 
scan_options->fragment_scan_options->type_name(),
   " were provided for scanning a fragment of type ", type_name);
 }
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #9725: ARROW-8631: [C++][Python][Dataset] Add ReadOptions to CsvFileFormat, expose options to python

2021-03-17 Thread GitBox


bkietz commented on a change in pull request #9725:
URL: https://github.com/apache/arrow/pull/9725#discussion_r596092120



##
File path: cpp/src/arrow/dataset/file_csv.h
##
@@ -37,6 +37,11 @@ class ARROW_DS_EXPORT CsvFileFormat : public FileFormat {
  public:
   /// Options affecting the parsing of CSV files
   csv::ParseOptions parse_options = csv::ParseOptions::Defaults();
+  /// Options affecting how CSV files are read.
+  ///
+  /// Note use_threads is ignored (it is always considered false) and 
block_size
+  /// should be set on CsvFragmentScanOptions.
+  csv::ReadOptions read_options = csv::ReadOptions::Defaults();

Review comment:
   I'm not sure it's reasonable to add `ReadOptions` here. I think only 
`skip_rows` and `autogenerate_column_names` are meaningful here. Column 
renaming can be accomplished in projection if desired (obviating 
`column_names`) and as noted `use_threads` and `block_size` are configured 
elsewhere.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on a change in pull request #9715: ARROW-11745: [C++] WIP: Add helper to generate random record batches by schema

2021-03-17 Thread GitBox


lidavidm commented on a change in pull request #9715:
URL: https://github.com/apache/arrow/pull/9715#discussion_r596088841



##
File path: cpp/src/arrow/testing/random.cc
##
@@ -558,5 +584,248 @@ std::shared_ptr 
RandomArrayGenerator::ArrayOf(std::shared_ptr t
   return RandomArrayGeneratorOfImpl{this, type, size, null_probability, 
nullptr}.Finish();
 }
 
+namespace {
+template 
+typename T::c_type GetMetadata(const KeyValueMetadata* metadata, const 
std::string& key,
+   typename T::c_type default_value) {

Review comment:
   Nice, I admit I'm not as up to speed on my template metaprogramming as 
I'd like to be.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on a change in pull request #9715: ARROW-11745: [C++] Add helper to generate random record batches by schema

2021-03-17 Thread GitBox


lidavidm commented on a change in pull request #9715:
URL: https://github.com/apache/arrow/pull/9715#discussion_r596087724



##
File path: cpp/src/arrow/testing/random.cc
##
@@ -558,5 +584,248 @@ std::shared_ptr 
RandomArrayGenerator::ArrayOf(std::shared_ptr t
   return RandomArrayGeneratorOfImpl{this, type, size, null_probability, 
nullptr}.Finish();
 }
 
+namespace {
+template 
+typename T::c_type GetMetadata(const KeyValueMetadata* metadata, const 
std::string& key,
+   typename T::c_type default_value) {
+  if (!metadata) return default_value;
+  const auto index = metadata->FindKey(key);
+  if (index < 0) return default_value;
+  const auto& value = metadata->value(index);
+  typename T::c_type output{};
+  auto type = checked_pointer_cast(TypeTraits::type_singleton());
+  if (!internal::ParseValue(*type, value.data(), value.length(), &output)) {
+ABORT_NOT_OK(Status::Invalid("Could not parse ", key, " = ", value));
+  }
+  return output;
+}
+
+Result> GenerateArray(const Field& field, int64_t 
length,
+ RandomArrayGenerator* generator) {
+#define GENERATE_INTEGRAL_CASE_VIEW(BASE_TYPE, VIEW_TYPE)  
  \
+  case VIEW_TYPE::type_id: {   
  \
+const BASE_TYPE::c_type min_value = GetMetadata(
  \
+field.metadata().get(), "min", 
std::numeric_limits::min());   \
+const BASE_TYPE::c_type max_value = GetMetadata(
  \
+field.metadata().get(), "max", 
std::numeric_limits::max());   \
+return generator->Numeric(length, min_value, max_value, 
null_probability) \
+->View(field.type());  
  \
+  }
+#define GENERATE_INTEGRAL_CASE(ARROW_TYPE) \
+  GENERATE_INTEGRAL_CASE_VIEW(ARROW_TYPE, ARROW_TYPE)
+#define GENERATE_FLOATING_CASE(ARROW_TYPE, GENERATOR_FUNC) 
 \
+  case ARROW_TYPE::type_id: {  
 \
+const ARROW_TYPE::c_type min_value = GetMetadata(  
 \
+field.metadata().get(), "min", 
std::numeric_limits::min()); \
+const ARROW_TYPE::c_type max_value = GetMetadata(  
 \
+field.metadata().get(), "max", 
std::numeric_limits::max()); \
+const double nan_probability = 
 \
+GetMetadata(field.metadata().get(), "nan_probability", 0); 
 \
+return generator->GENERATOR_FUNC(length, min_value, max_value, 
null_probability,\
+ nan_probability); 
 \
+  }
+
+  const double null_probability =
+  field.nullable()
+  ? GetMetadata(field.metadata().get(), 
"null_probability", 0.01)
+  : 0.0;
+  switch (field.type()->id()) {
+case Type::type::NA:
+  return std::make_shared(length);
+
+case Type::type::BOOL: {
+  const double true_probability =
+  GetMetadata(field.metadata().get(), "true_probability", 
0.5);
+  return generator->Boolean(length, true_probability, null_probability);
+}
+
+  GENERATE_INTEGRAL_CASE(UInt8Type);
+  GENERATE_INTEGRAL_CASE(Int8Type);
+  GENERATE_INTEGRAL_CASE(UInt16Type);
+  GENERATE_INTEGRAL_CASE(Int16Type);
+  GENERATE_INTEGRAL_CASE(UInt32Type);
+  GENERATE_INTEGRAL_CASE(Int32Type);
+  GENERATE_INTEGRAL_CASE(UInt64Type);
+  GENERATE_INTEGRAL_CASE(Int64Type);
+  GENERATE_INTEGRAL_CASE_VIEW(Int16Type, HalfFloatType);
+  GENERATE_FLOATING_CASE(FloatType, Float32);
+  GENERATE_FLOATING_CASE(DoubleType, Float64);
+
+case Type::type::STRING:
+case Type::type::BINARY: {
+  const int32_t min_length = 
GetMetadata(field.metadata().get(), "min", 0);
+  const int32_t max_length =
+  GetMetadata(field.metadata().get(), "max", 1024);
+  const int32_t unique_values =
+  GetMetadata(field.metadata().get(), "unique", -1);
+  if (unique_values > 0) {
+return generator
+->StringWithRepeats(length, unique_values, min_length, max_length,
+null_probability)
+->View(field.type());
+  }
+  return generator->String(length, min_length, max_length, 
null_probability)
+  ->View(field.type());
+}
+
+case Type::type::DECIMAL128:
+case Type::type::DECIMAL256:
+case Type::type::FIXED_SIZE_BINARY: {
+  auto byte_width =
+  
internal::checked_pointer_cast(field.type())->byte_width();
+  return generator->FixedSizeBinary(length, byte_width, null_probability)
+  ->View(field.type());
+}
+
+  GENERATE_INTEGRAL_CASE_VIEW(Int32Type, Date32Type);
+  GENERATE_INTEGRAL_CASE_VIEW(Int64Type, Date64Type);
+  GENERATE_INTEGRAL_CASE_VIEW(Int64Type, TimestampType);
+  GENERATE_INTE

[GitHub] [arrow] pitrou closed pull request #8023: ARROW-9318: [C++] Parquet encryption key management

2021-03-17 Thread GitBox


pitrou closed pull request #8023:
URL: https://github.com/apache/arrow/pull/8023


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #8023: ARROW-9318: [C++] Parquet encryption key management

2021-03-17 Thread GitBox


pitrou commented on pull request #8023:
URL: https://github.com/apache/arrow/pull/8023#issuecomment-801144014


   I'll merge then. Thank you @thamht4190 and @ggershinsky for contributing 
this!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on a change in pull request #9715: ARROW-11745: [C++] Add helper to generate random record batches by schema

2021-03-17 Thread GitBox


lidavidm commented on a change in pull request #9715:
URL: https://github.com/apache/arrow/pull/9715#discussion_r596086831



##
File path: cpp/src/arrow/testing/random.cc
##
@@ -558,5 +584,248 @@ std::shared_ptr 
RandomArrayGenerator::ArrayOf(std::shared_ptr t
   return RandomArrayGeneratorOfImpl{this, type, size, null_probability, 
nullptr}.Finish();
 }
 
+namespace {
+template 
+typename T::c_type GetMetadata(const KeyValueMetadata* metadata, const 
std::string& key,
+   typename T::c_type default_value) {
+  if (!metadata) return default_value;
+  const auto index = metadata->FindKey(key);
+  if (index < 0) return default_value;
+  const auto& value = metadata->value(index);
+  typename T::c_type output{};
+  auto type = checked_pointer_cast(TypeTraits::type_singleton());
+  if (!internal::ParseValue(*type, value.data(), value.length(), &output)) {
+ABORT_NOT_OK(Status::Invalid("Could not parse ", key, " = ", value));
+  }
+  return output;
+}
+
+Result> GenerateArray(const Field& field, int64_t 
length,
+ RandomArrayGenerator* generator) {
+#define GENERATE_INTEGRAL_CASE_VIEW(BASE_TYPE, VIEW_TYPE)  
  \
+  case VIEW_TYPE::type_id: {   
  \
+const BASE_TYPE::c_type min_value = GetMetadata(
  \
+field.metadata().get(), "min", 
std::numeric_limits::min());   \
+const BASE_TYPE::c_type max_value = GetMetadata(
  \
+field.metadata().get(), "max", 
std::numeric_limits::max());   \
+return generator->Numeric(length, min_value, max_value, 
null_probability) \
+->View(field.type());  
  \
+  }
+#define GENERATE_INTEGRAL_CASE(ARROW_TYPE) \
+  GENERATE_INTEGRAL_CASE_VIEW(ARROW_TYPE, ARROW_TYPE)
+#define GENERATE_FLOATING_CASE(ARROW_TYPE, GENERATOR_FUNC) 
 \
+  case ARROW_TYPE::type_id: {  
 \
+const ARROW_TYPE::c_type min_value = GetMetadata(  
 \
+field.metadata().get(), "min", 
std::numeric_limits::min()); \
+const ARROW_TYPE::c_type max_value = GetMetadata(  
 \
+field.metadata().get(), "max", 
std::numeric_limits::max()); \
+const double nan_probability = 
 \
+GetMetadata(field.metadata().get(), "nan_probability", 0); 
 \
+return generator->GENERATOR_FUNC(length, min_value, max_value, 
null_probability,\
+ nan_probability); 
 \
+  }
+
+  const double null_probability =
+  field.nullable()
+  ? GetMetadata(field.metadata().get(), 
"null_probability", 0.01)
+  : 0.0;
+  switch (field.type()->id()) {
+case Type::type::NA:

Review comment:
   Yeah, I will add more validation for these things - I also noticed that 
a null_probability outside [0,1] on MSVC leads to runtime errors (which do not 
appear to apply to GCC).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on a change in pull request #9715: ARROW-11745: [C++] Add helper to generate random record batches by schema

2021-03-17 Thread GitBox


lidavidm commented on a change in pull request #9715:
URL: https://github.com/apache/arrow/pull/9715#discussion_r596085481



##
File path: cpp/src/arrow/testing/random.h
##
@@ -358,6 +362,10 @@ class ARROW_TESTING_EXPORT RandomArrayGenerator {
   std::default_random_engine seed_rng_;
 };
 
+ARROW_TESTING_EXPORT
+std::shared_ptr Generate(const FieldVector& fields, 
int64_t size,
+ SeedType seed);

Review comment:
   I am not particularly for magic like that. But maybe we could make 
Generate{Batch,Table,Array,...} methods of Generator to allow them to share a 
seed?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #9730: ARROW-9878: [Python] Document caveats of to_pandas(self_destruct=True)

2021-03-17 Thread GitBox


pitrou commented on a change in pull request #9730:
URL: https://github.com/apache/arrow/pull/9730#discussion_r596085147



##
File path: docs/source/python/pandas.rst
##
@@ -293,3 +293,19 @@ Used together, the call
 
 will yield significantly lower memory usage in some scenarios. Without these
 options, ``to_pandas`` will always double memory.
+
+Note that ``self_destruct=True`` is not guaranteed to save memory. Since the
+conversion happens column by column, memory is also freed column by column. But
+if multiple columns share an underlying allocation, then no memory will be

Review comment:
   Ah, you're right, thanks.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on a change in pull request #9715: ARROW-11745: [C++] Add helper to generate random record batches by schema

2021-03-17 Thread GitBox


lidavidm commented on a change in pull request #9715:
URL: https://github.com/apache/arrow/pull/9715#discussion_r596084830



##
File path: cpp/src/arrow/testing/random.h
##
@@ -358,6 +362,10 @@ class ARROW_TESTING_EXPORT RandomArrayGenerator {
   std::default_random_engine seed_rng_;
 };
 
+ARROW_TESTING_EXPORT
+std::shared_ptr Generate(const FieldVector& fields, 
int64_t size,

Review comment:
   Might it be better to just allow passing an explicit schema instead?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on a change in pull request #9730: ARROW-9878: [Python] Document caveats of to_pandas(self_destruct=True)

2021-03-17 Thread GitBox


lidavidm commented on a change in pull request #9730:
URL: https://github.com/apache/arrow/pull/9730#discussion_r596083089



##
File path: docs/source/python/pandas.rst
##
@@ -293,3 +293,19 @@ Used together, the call
 
 will yield significantly lower memory usage in some scenarios. Without these
 options, ``to_pandas`` will always double memory.
+
+Note that ``self_destruct=True`` is not guaranteed to save memory. Since the
+conversion happens column by column, memory is also freed column by column. But
+if multiple columns share an underlying allocation, then no memory will be

Review comment:
   Yes, multiple arrays may share the same buffer. In IPC for instance, we 
read a record batch's worth of data from the file at a time, and hence all 
arrays in that batch share the same buffer. In Flight, similarly, we receive a 
record batch's worth of data from gRPC and (for implementation reasons) 
concatenate it into a single buffer, so we end up in the same situation. (I 
don't think this applies to, say, Parquet or CSV, since there's actual decoding 
for those formats, but haven't tested it.)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on a change in pull request #9730: ARROW-9878: [Python] Document caveats of to_pandas(self_destruct=True)

2021-03-17 Thread GitBox


lidavidm commented on a change in pull request #9730:
URL: https://github.com/apache/arrow/pull/9730#discussion_r596081637



##
File path: docs/source/python/pandas.rst
##
@@ -293,3 +293,19 @@ Used together, the call
 
 will yield significantly lower memory usage in some scenarios. Without these
 options, ``to_pandas`` will always double memory.
+
+Note that ``self_destruct=True`` is not guaranteed to save memory. Since the
+conversion happens column by column, memory is also freed column by column. But
+if multiple columns share an underlying allocation, then no memory will be
+freed until all of those columns are converted. In particular, data that comes
+from IPC or Flight is prone to this, as memory will be laid out as follows::
+
+  Record Batch 0: Allocation 0: array 0 chunk 0, array 1 chunk 0, ...
+  Record Batch 1: Allocation 1: array 0 chunk 1, array 1 chunk 1, ...
+  ...
+
+In this case, no memory can be freed until the entire table is converted, even
+with ``self_destruct=True``.
+
+Additionally, even if memory is freed by Arrow, depending on the allocator in
+use, the memory may not be returned to the operating system immediately.

Review comment:
   I'll remove this, but I do feel this is a commonly misunderstood point.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #9715: ARROW-11745: [C++] Add helper to generate random record batches by schema

2021-03-17 Thread GitBox


bkietz commented on a change in pull request #9715:
URL: https://github.com/apache/arrow/pull/9715#discussion_r596029757



##
File path: cpp/src/arrow/testing/random.h
##
@@ -358,6 +362,10 @@ class ARROW_TESTING_EXPORT RandomArrayGenerator {
   std::default_random_engine seed_rng_;
 };
 
+ARROW_TESTING_EXPORT
+std::shared_ptr Generate(const FieldVector& fields, 
int64_t size,
+ SeedType seed);

Review comment:
   What do you think about replacing explicit seed arguments (which are 
usually repeated within a suite) with
   ```c++
   /// The seed used for random generation. Non const to support customizing 
the seed for a suite
   static SeedType kSeed = 0xDEADBEEF;
   ```
   
   or
   
   ```c++
   /// Sets the seed used for random generation while it is in scope.
   /// May be constructed at static scope to set the seed for an entire
   /// test/benchmark suite. If no SeedGuard is in scope, the seed will
   /// be kDefaultSeed
   struct SeedGuard {
 explicit SeedGuard(SeedType seed) { PushSeed(seed); }
 ~SeedGuard() { PopSeed(); }
   
 static constexpr SeedType kDefaultSeed = 0xDEADBEEF;
   
 static void PopSeed();
 static void PushSeed(SeedType);
   };
   ```
   ?

##
File path: cpp/src/arrow/testing/random.cc
##
@@ -369,12 +371,16 @@ std::shared_ptr 
RandomArrayGenerator::FixedSizeBinary(int64_t size,
 std::move(null_bitmap), 
null_count);
 }
 
-std::shared_ptr RandomArrayGenerator::Offsets(int64_t size, int32_t 
first_offset,
- int32_t last_offset,
- double null_probability,
- bool force_empty_nulls) {
-  using GenOpt = GenerateOptions>;
-  GenOpt options(seed(), first_offset, last_offset, null_probability);
+namespace {
+template 

Review comment:
   When I read this first, I was looking for `typename 
ArrayType::TypeClass::offset_type` because I assumed this parameter referred to 
`StringArray` or so
   ```suggestion
   template 
   ```

##
File path: cpp/src/arrow/testing/random_test.cc
##
@@ -0,0 +1,195 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include 
+
+#include "arrow/array.h"
+#include "arrow/record_batch.h"
+#include "arrow/testing/gtest_util.h"
+#include "arrow/testing/random.h"
+#include "arrow/type.h"
+#include "arrow/util/key_value_metadata.h"
+
+namespace arrow {
+namespace random {
+
+class RandomArrayTest : public 
::testing::TestWithParam> {
+ protected:
+  std::shared_ptr GetField() { return GetParam(); }
+};
+
+template 
+class RandomNumericArrayTest : public ::testing::Test {
+ protected:
+  std::shared_ptr GetField() { return field("field0", 
std::make_shared()); }
+
+  std::shared_ptr> Downcast(std::shared_ptr array) {
+return internal::checked_pointer_cast>(array);
+  }
+};
+
+TEST_P(RandomArrayTest, GenerateArray) {
+  auto field = GetField();
+  auto batch = Generate({field}, 128, 0xDEADBEEF);
+  AssertSchemaEqual(schema({field}), batch->schema());
+  auto array = batch->column(0);
+  ASSERT_EQ(128, array->length());
+  ASSERT_OK(array->ValidateFull());
+}
+
+TEST_P(RandomArrayTest, GenerateNonNullArray) {
+  auto field =
+  GetField()->WithMetadata(key_value_metadata({{"null_probability", 
"0.0"}}));
+  if (field->type()->id() == Type::type::NA) {
+GTEST_SKIP() << "Cannot generate non-null null arrays";
+  }
+  auto batch = Generate({field}, 128, 0xDEADBEEF);
+  AssertSchemaEqual(schema({field}), batch->schema());
+  auto array = batch->column(0);
+  ASSERT_OK(array->ValidateFull());
+  ASSERT_EQ(0, array->null_count());
+}
+
+TEST_P(RandomArrayTest, GenerateNonNullableArray) {
+  auto field = GetField()->WithNullable(false);
+  if (field->type()->id() == Type::type::NA) {
+GTEST_SKIP() << "Cannot generate non-null null arrays";
+  }
+  auto batch = Generate({field}, 128, 0xDEADBEEF);
+  AssertSchemaEqual(schema({field}), batch->schema());
+  auto array = batch->column(0);
+  ASSERT_OK(array->ValidateFull());
+  ASSERT_EQ(0, array->null_count());
+}
+
+struct FieldParamName

[GitHub] [arrow] pitrou commented on a change in pull request #8468: ARROW-10306: [C++] Add string replacement kernel

2021-03-17 Thread GitBox


pitrou commented on a change in pull request #8468:
URL: https://github.com/apache/arrow/pull/8468#discussion_r596070474



##
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##
@@ -1194,6 +1198,197 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+// --
+// replace substring
+
+template 
+struct ReplaceSubStringBase {
+  using ArrayType = typename TypeTraits::ArrayType;
+  using ScalarType = typename TypeTraits::ScalarType;
+  using BuilderType = typename TypeTraits::BuilderType;
+  using offset_type = typename Type::offset_type;
+  using ValueDataBuilder = TypedBufferBuilder;
+  using OffsetBuilder = TypedBufferBuilder;
+  using State = OptionsWrapper;
+
+  static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+Derived derived(ctx, State::Get(ctx));
+if (ctx->status().ok()) {
+  derived.Replace(ctx, batch, out);
+}
+  }
+  void Replace(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+std::shared_ptr value_data_builder =
+std::make_shared();
+std::shared_ptr offset_builder = 
std::make_shared();
+
+if (batch[0].kind() == Datum::ARRAY) {
+  // We already know how many strings we have, so we can use 
Reserve/UnsafeAppend
+  KERNEL_RETURN_IF_ERROR(ctx, 
offset_builder->Reserve(batch[0].array()->length));
+
+  const ArrayData& input = *batch[0].array();
+  KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Append(0));  // offsets 
start at 0
+  KERNEL_RETURN_IF_ERROR(
+  ctx, VisitArrayDataInline(
+   input,
+   [&](util::string_view s) {
+ RETURN_NOT_OK(static_cast(*this).ReplaceString(
+ s, value_data_builder.get()));
+ offset_builder->UnsafeAppend(
+ 
static_cast(value_data_builder->length()));
+ return Status::OK();
+   },
+   [&]() {
+ // offset for null value
+ offset_builder->UnsafeAppend(
+ 
static_cast(value_data_builder->length()));
+ return Status::OK();
+   }));
+  ArrayData* output = out->mutable_array();
+  KERNEL_RETURN_IF_ERROR(ctx, 
value_data_builder->Finish(&output->buffers[2]));
+  KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Finish(&output->buffers[1]));
+} else {
+  const auto& input = checked_cast(*batch[0].scalar());
+  auto result = std::make_shared();
+  if (input.is_valid) {
+util::string_view s = static_cast(*input.value);
+KERNEL_RETURN_IF_ERROR(
+ctx, static_cast(*this).ReplaceString(s, 
value_data_builder.get()));
+KERNEL_RETURN_IF_ERROR(ctx, 
value_data_builder->Finish(&result->value));
+result->is_valid = true;
+  }
+  out->value = result;
+}
+  }
+};
+
+template 
+struct ReplaceSubString : ReplaceSubStringBase> {

Review comment:
   Well, you don't need to refactor other kernels for now, but I suppose 
this one could easily be adapted, no? :-)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou closed pull request #9720: ARROW-11976: [C++] Fix sporadic TSAN error with GatingTask

2021-03-17 Thread GitBox


pitrou closed pull request #9720:
URL: https://github.com/apache/arrow/pull/9720


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #9730: ARROW-9878: [Python] Document caveats of to_pandas(self_destruct=True)

2021-03-17 Thread GitBox


pitrou commented on a change in pull request #9730:
URL: https://github.com/apache/arrow/pull/9730#discussion_r596063732



##
File path: docs/source/python/pandas.rst
##
@@ -293,3 +293,19 @@ Used together, the call
 
 will yield significantly lower memory usage in some scenarios. Without these
 options, ``to_pandas`` will always double memory.
+
+Note that ``self_destruct=True`` is not guaranteed to save memory. Since the
+conversion happens column by column, memory is also freed column by column. But
+if multiple columns share an underlying allocation, then no memory will be

Review comment:
   What does "share" mean here? Do you mean different arrays share the same 
underlying buffer? How does that happen?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #9730: ARROW-9878: [Python] Document caveats of to_pandas(self_destruct=True)

2021-03-17 Thread GitBox


pitrou commented on a change in pull request #9730:
URL: https://github.com/apache/arrow/pull/9730#discussion_r596063175



##
File path: docs/source/python/pandas.rst
##
@@ -293,3 +293,19 @@ Used together, the call
 
 will yield significantly lower memory usage in some scenarios. Without these
 options, ``to_pandas`` will always double memory.
+
+Note that ``self_destruct=True`` is not guaranteed to save memory. Since the
+conversion happens column by column, memory is also freed column by column. But
+if multiple columns share an underlying allocation, then no memory will be
+freed until all of those columns are converted. In particular, data that comes
+from IPC or Flight is prone to this, as memory will be laid out as follows::
+
+  Record Batch 0: Allocation 0: array 0 chunk 0, array 1 chunk 0, ...
+  Record Batch 1: Allocation 1: array 0 chunk 1, array 1 chunk 1, ...
+  ...
+
+In this case, no memory can be freed until the entire table is converted, even
+with ``self_destruct=True``.
+
+Additionally, even if memory is freed by Arrow, depending on the allocator in
+use, the memory may not be returned to the operating system immediately.

Review comment:
   This is true of all deallocations, so I find it rather un-useful to 
mention specifically here.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #9739: ARROW-12000: [Documentation] Add note about deviation from style guide on struct/classes

2021-03-17 Thread GitBox


pitrou commented on pull request #9739:
URL: https://github.com/apache/arrow/pull/9739#issuecomment-801118593


   This doesn't seem exact. We have `struct` types in public headers (such as 
`filesystem.h`).
   I think @wesm 's rationale seems more faithful to reality:
   > In the public headers we should be trying to use struct only for objects 
that are principally simple data containers where we are OK with exposing all 
the internal members, and any methods are primarily conveniences.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] dianaclarke commented on a change in pull request #9738: ARROW-11997: [Python] concat_tables crashes python interpreter

2021-03-17 Thread GitBox


dianaclarke commented on a change in pull request #9738:
URL: https://github.com/apache/arrow/pull/9738#discussion_r596059886



##
File path: python/pyarrow/tests/test_table.py
##
@@ -271,7 +271,9 @@ def ne(xarrs, yarrs):
 eq([a, c], [d])
 ne([c, a], [a, c])
 
-assert not pa.chunked_array([], type=pa.int32()).equals(None)
+# ARROW-4822
+with pytest.raises(AttributeError):
+pa.chunked_array([], type=pa.int32()).equals(None)

Review comment:
   Oh, I just saw your edit: `table.equals(None) can still raise TypeError 
though.`
   
   Are you okay with this? This is how it was before.
   
   ```
   >>> import pyarrow
   >>> table = pyarrow.chunked_array([], type=pyarrow.int32())
   >>> table.equals(None)
   False
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] dianaclarke commented on a change in pull request #9738: ARROW-11997: [Python] concat_tables crashes python interpreter

2021-03-17 Thread GitBox


dianaclarke commented on a change in pull request #9738:
URL: https://github.com/apache/arrow/pull/9738#discussion_r596057813



##
File path: python/pyarrow/tests/test_table.py
##
@@ -271,7 +271,9 @@ def ne(xarrs, yarrs):
 eq([a, c], [d])
 ne([c, a], [a, c])
 
-assert not pa.chunked_array([], type=pa.int32()).equals(None)
+# ARROW-4822
+with pytest.raises(AttributeError):
+pa.chunked_array([], type=pa.int32()).equals(None)

Review comment:
   Behaviour is back to normal now.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #9738: ARROW-11997: [Python] concat_tables crashes python interpreter

2021-03-17 Thread GitBox


pitrou commented on a change in pull request #9738:
URL: https://github.com/apache/arrow/pull/9738#discussion_r596043491



##
File path: python/pyarrow/tests/test_table.py
##
@@ -271,7 +271,9 @@ def ne(xarrs, yarrs):
 eq([a, c], [d])
 ne([c, a], [a, c])
 
-assert not pa.chunked_array([], type=pa.int32()).equals(None)
+# ARROW-4822
+with pytest.raises(AttributeError):
+pa.chunked_array([], type=pa.int32()).equals(None)

Review comment:
   Right, `table == None` should return False. `table.equals(None)` can 
still raise `TypeError` though.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #9738: ARROW-11997: [Python] concat_tables crashes python interpreter

2021-03-17 Thread GitBox


pitrou commented on a change in pull request #9738:
URL: https://github.com/apache/arrow/pull/9738#discussion_r596043491



##
File path: python/pyarrow/tests/test_table.py
##
@@ -271,7 +271,9 @@ def ne(xarrs, yarrs):
 eq([a, c], [d])
 ne([c, a], [a, c])
 
-assert not pa.chunked_array([], type=pa.int32()).equals(None)
+# ARROW-4822
+with pytest.raises(AttributeError):
+pa.chunked_array([], type=pa.int32()).equals(None)

Review comment:
   Right, `table == None` should return False.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] dianaclarke commented on a change in pull request #9738: ARROW-11997: [Python] concat_tables crashes python interpreter

2021-03-17 Thread GitBox


dianaclarke commented on a change in pull request #9738:
URL: https://github.com/apache/arrow/pull/9738#discussion_r596042150



##
File path: python/pyarrow/tests/test_table.py
##
@@ -271,7 +271,9 @@ def ne(xarrs, yarrs):
 eq([a, c], [d])
 ne([c, a], [a, c])
 
-assert not pa.chunked_array([], type=pa.int32()).equals(None)
+# ARROW-4822
+with pytest.raises(AttributeError):
+pa.chunked_array([], type=pa.int32()).equals(None)

Review comment:
   This is a behaviour change though.
   
   Before:
   ```
   >>> import pyarrow
   >>> table = pyarrow.chunked_array([], type=pyarrow.int32())
   >>> table is None
   False
   >>> table == None
   False
   ```
   
   After:
   
   ```
   >>> import pyarrow
   >>> table = pyarrow.chunked_array([], type=pyarrow.int32())
   >>> table is None
   False
   >>> table == None
   Traceback (most recent call last):
 File "", line 1, in 
 File "pyarrow/table.pxi", line 187, in pyarrow.lib.ChunkedArray.__eq__
   return self.equals(other)
 File "pyarrow/table.pxi", line 212, in pyarrow.lib.ChunkedArray.equals
   CChunkedArray* other_arr = other.chunked_array
   AttributeError: 'NoneType' object has no attribute 'chunked_array'
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz closed pull request #9685: ARROW-10372: [Dataset][C++][Python][R] Support reading compressed CSV

2021-03-17 Thread GitBox


bkietz closed pull request #9685:
URL: https://github.com/apache/arrow/pull/9685


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] dianaclarke commented on pull request #9738: ARROW-11997: [Python] concat_tables crashes python interpreter

2021-03-17 Thread GitBox


dianaclarke commented on pull request #9738:
URL: https://github.com/apache/arrow/pull/9738#issuecomment-801073738


   Oh, thanks a ton @pitrou!
   
   I'm working on the additional failing tests now. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #9738: ARROW-11997: [Python] concat_tables crashes python interpreter

2021-03-17 Thread GitBox


pitrou commented on pull request #9738:
URL: https://github.com/apache/arrow/pull/9738#issuecomment-801062229


   This is a recurring problem, so instead of testing for None manually we 
could let Cython do it for us:
   ```diff
   diff --git a/python/pyarrow/lib.pyx b/python/pyarrow/lib.pyx
   index eba0f5a47..aa6918b54 100644
   --- a/python/pyarrow/lib.pyx
   +++ b/python/pyarrow/lib.pyx
   @@ -15,9 +15,10 @@
# specific language governing permissions and limitations
# under the License.

   -# cython: profile=False
   -# distutils: language = c++
   +# cython: profile = False
# cython: embedsignature = True
   +# cython: nonecheck = True
   +# distutils: language = c++

import datetime
import decimal as _pydecimal
   diff --git a/python/pyarrow/table.pxi b/python/pyarrow/table.pxi
   index 2d474ba76..3f1fc28ee 100644
   --- a/python/pyarrow/table.pxi
   +++ b/python/pyarrow/table.pxi
   @@ -2225,8 +2225,6 @@ def concat_tables(tables, c_bool promote=False, 
MemoryPool memory_pool=None):
CConcatenateTablesOptions.Defaults())

for table in tables:
   -if table is None:
   -raise TypeError("Unable to concatenate table. Table is None.")
c_tables.push_back(table.sp_table)

with nogil:
   ```
   
   Note that a few tests will need fixing.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] dianaclarke commented on a change in pull request #9739: ARROW-12000: [Documentation] Add note about deviation from style guide on struct/classes

2021-03-17 Thread GitBox


dianaclarke commented on a change in pull request #9739:
URL: https://github.com/apache/arrow/pull/9739#discussion_r595979695



##
File path: docs/source/developers/cpp/development.rst
##
@@ -77,6 +77,10 @@ This project follows `Google's C++ Style Guide
 * We relax the line length restriction to 90 characters.
 * We use the ``NULLPTR`` macro in header files (instead of ``nullptr``) defined
   in ``src/arrow/util/macros.h`` to support building C++/CLI (ARROW-1134)
+* For internal types (not in public headers) it is ok to use structs for
+  convenience (e.g. list initialization) if access control is not needed
+  (e.g. no public access to a nested type) even if the struct has
+  state/invaiants

Review comment:
   `s/invaiants/invariants/` ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] westonpace opened a new pull request #9739: ARROW-12000: [Documentation] Add note about deviation from style guide on struct/classes

2021-03-17 Thread GitBox


westonpace opened a new pull request #9739:
URL: https://github.com/apache/arrow/pull/9739


   
![docs](https://user-images.githubusercontent.com/1696093/111465419-2b1e8880-86c6-11eb-98d5-5e5b873c224c.png)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on pull request #9421: ARROW-11066: [FlightRPC][Java] Make zero-copy writes a configurable option

2021-03-17 Thread GitBox


lidavidm commented on pull request #9421:
URL: https://github.com/apache/arrow/pull/9421#issuecomment-801015523


   I think we should be all good (obviously let's keep an eye out for if it 
regresses again though), thanks for taking a look.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on pull request #9147: ARROW-11177: [Java] ArrowMessage failed to parse compressed grpc stream

2021-03-17 Thread GitBox


lidavidm commented on pull request #9147:
URL: https://github.com/apache/arrow/pull/9147#issuecomment-801014998


   I still believe we should not be polling available().



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on pull request #9728: ARROW-10250: [C++][FlightRPC] Consistently use FlightClientOptions::Defaults

2021-03-17 Thread GitBox


lidavidm commented on pull request #9728:
URL: https://github.com/apache/arrow/pull/9728#issuecomment-801014690


   Done, sorry (I thought I ran archery locally…)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9738: ARROW-11997: [Python] concat_tables crashes python interpreter

2021-03-17 Thread GitBox


github-actions[bot] commented on pull request #9738:
URL: https://github.com/apache/arrow/pull/9738#issuecomment-801014571


   https://issues.apache.org/jira/browse/ARROW-11997



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs closed pull request #9713: ARROW-11971: [Packaging] Vcpkg patch doesn't apply on windows due to line endings

2021-03-17 Thread GitBox


kszucs closed pull request #9713:
URL: https://github.com/apache/arrow/pull/9713


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #9713: ARROW-11971: [Packaging] Vcpkg patch doesn't apply on windows due to line endings

2021-03-17 Thread GitBox


kszucs commented on pull request #9713:
URL: https://github.com/apache/arrow/pull/9713#issuecomment-800999549


   CI error is unrelated, merging.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] dianaclarke opened a new pull request #9738: ARROW-11997: [Python] concat_tables crashes python interpreter

2021-03-17 Thread GitBox


dianaclarke opened a new pull request #9738:
URL: https://github.com/apache/arrow/pull/9738


   Before:
   
   >>> import pyarrow
   >>> pyarrow.concat_tables([None])
   Bus error: 10
   Python quit unexpectedly
   
   After:
   
   >>> import pyarrow
   >>> pyarrow.concat_tables([None])
   Traceback (most recent call last):
     File "", line 1, in 
     File "pyarrow/table.pxi", line 2229, in pyarrow.lib.concat_tables
       raise TypeError("Unable to concatenate table. Table is None.")
   TypeError: Unable to concatenate table. Table is None.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] westonpace commented on pull request #9720: ARROW-11976: [C++] Fix sporadic TSAN error with GatingTask

2021-03-17 Thread GitBox


westonpace commented on pull request #9720:
URL: https://github.com/apache/arrow/pull/9720#issuecomment-800985902


   Thanks for cleaning this up.  LGTM.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou closed pull request #9678: ARROW-11907: [C++] Use our own executor in S3FileSystem

2021-03-17 Thread GitBox


pitrou closed pull request #9678:
URL: https://github.com/apache/arrow/pull/9678


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #9728: ARROW-10250: [C++][FlightRPC] Consistently use FlightClientOptions::Defaults

2021-03-17 Thread GitBox


pitrou commented on pull request #9728:
URL: https://github.com/apache/arrow/pull/9728#issuecomment-800968946


   Can you fix the Python lint error?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #9734: ARROW-11998: [C++] Make it easier to create vectors of move-only data types for tests

2021-03-17 Thread GitBox


pitrou commented on pull request #9734:
URL: https://github.com/apache/arrow/pull/9734#issuecomment-800967868


   Note: originally from 
https://github.com/westonpace/arrow/pull/6/commits/781dccacaea129668abf1aec87becfba910a266b



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb closed pull request #9695: ARROW-11955: [Rust][DataFusion] Support Union

2021-03-17 Thread GitBox


alamb closed pull request #9695:
URL: https://github.com/apache/arrow/pull/9695


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on a change in pull request #9695: ARROW-11955: [Rust][DataFusion] Support Union

2021-03-17 Thread GitBox


alamb commented on a change in pull request #9695:
URL: https://github.com/apache/arrow/pull/9695#discussion_r595886251



##
File path: rust/datafusion/src/logical_plan/builder.rs
##
@@ -413,6 +436,32 @@ mod tests {
 Ok(())
 }
 
+#[test]
+fn plan_builder_union_combined_single_union() -> Result<()> {
+let plan = LogicalPlanBuilder::scan_empty(
+"employee.csv",
+&employee_schema(),
+Some(vec![3, 4]),
+)?;
+
+let plan = plan
+.union(plan.build()?)?
+.union(plan.build()?)?
+.union(plan.build()?)?
+.build()?;
+
+// output has only one union
+let expected = "Union\

Review comment:
   👍 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] hunght3101 opened a new issue #9737: [Golang] How i create input format for IPC Reader?

2021-03-17 Thread GitBox


hunght3101 opened a new issue #9737:
URL: https://github.com/apache/arrow/issues/9737


   user@
   In library
   ```
   // NewReader returns a reader that reads records from an input stream.
   func NewReader(r io.Reader, opts ...Option) (*Reader, error) {
return NewReaderFromMessageReader(NewMessageReader(r), opts...)
   }
   ```
   How i create parameter `r` right when i call NewReader function ?
   Example : 
   ```
   var r io.Reader
   r = strings.NewReader(input)
   rr, err := ipc.NewReader(r, ipc.WithSchema(schema), ipc.WithAllocator(mem))
if err != nil {
fmt.Println(err)
}
   ```
   How i create input ? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #9720: ARROW-11976: [C++] Fix sporadic TSAN error with GatingTask

2021-03-17 Thread GitBox


pitrou commented on a change in pull request #9720:
URL: https://github.com/apache/arrow/pull/9720#discussion_r595837653



##
File path: cpp/src/arrow/testing/gtest_util.cc
##
@@ -719,43 +719,44 @@ ExtensionTypeGuard::~ExtensionTypeGuard() {
   }
 }
 
-class GatingTask::Impl {
+class GatingTask::Impl : public std::enable_shared_from_this 
{
  public:
   explicit Impl(double timeout_seconds)
   : timeout_seconds_(timeout_seconds), status_(), unlocked_(false) {}
 
   ~Impl() {
-std::unique_lock lk(mx_);
-if (!finished_cv_.wait_for(
-lk, std::chrono::nanoseconds(static_cast(timeout_seconds_ 
* 1e9)),
-[this] { return num_finished_ == num_launched_; })) {
+if (num_running_ != num_launched_) {
   ADD_FAILURE()
-  << "A GatingTask instance was destroyed but the underlying tasks did 
not "
+  << "A GatingTask instance was destroyed but some underlying tasks 
did not "
+ "start running"
+  << std::endl;
+} else if (num_finished_ != num_launched_) {
+  ADD_FAILURE()
+  << "A GatingTask instance was destroyed but some underlying tasks 
did not "
  "finish running"
   << std::endl;
 }
   }
 
   std::function Task() {
 num_launched_++;

Review comment:
   I think Task() is only meant to be called from the main thread.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #9720: ARROW-11976: [C++] Fix sporadic TSAN error with GatingTask

2021-03-17 Thread GitBox


pitrou commented on a change in pull request #9720:
URL: https://github.com/apache/arrow/pull/9720#discussion_r595837013



##
File path: cpp/src/arrow/testing/gtest_util.cc
##
@@ -769,10 +770,8 @@ class GatingTask::Impl {
   }
 
   Status Unlock() {
-{
-  std::lock_guard lk(mx_);
-  unlocked_ = true;
-}
+std::lock_guard lk(mx_);

Review comment:
   Indeed, but this is a test util so I'd favor simplicity.

##
File path: cpp/src/arrow/testing/gtest_util.cc
##
@@ -769,10 +770,8 @@ class GatingTask::Impl {
   }
 
   Status Unlock() {
-{
-  std::lock_guard lk(mx_);
-  unlocked_ = true;
-}
+std::lock_guard lk(mx_);

Review comment:
   Indeed, but this is a test utility so I'd favor simplicity.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] hunght3101 commented on issue #9735: [Golang] Create ipc format for ipc Reader

2021-03-17 Thread GitBox


hunght3101 commented on issue #9735:
URL: https://github.com/apache/arrow/issues/9735#issuecomment-800908056


   @emkornfield 
   Excuse me, i know `r = strings.NewReader("Object Apache Arrow")` is problem. 
But i dont know to input right to it.
   Please re-open this issues ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4

2021-03-17 Thread GitBox


liyafan82 commented on pull request #8949:
URL: https://github.com/apache/arrow/pull/8949#issuecomment-800898224


   > please update the docs to match, something like.
   
   > "Slice the buffer to contain the uncompressed bytes"
   
   Updated. Thank you.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4

2021-03-17 Thread GitBox


liyafan82 commented on pull request #8949:
URL: https://github.com/apache/arrow/pull/8949#issuecomment-800897864


   > With the new enum, maybe we can make this an accessor that returns and 
enum instead? and then the byte can be extracted from there where necesssary?
   
   Sounds good. I have revised the code accordingly. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4

2021-03-17 Thread GitBox


liyafan82 commented on a change in pull request #8949:
URL: https://github.com/apache/arrow/pull/8949#discussion_r595805905



##
File path: 
java/compression/src/main/java/org/apache/arrow/compression/Lz4CompressionCodec.java
##
@@ -0,0 +1,158 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.compression;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+
+import org.apache.arrow.flatbuf.CompressionType;
+import org.apache.arrow.memory.ArrowBuf;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.util.MemoryUtil;
+import org.apache.arrow.util.Preconditions;
+import org.apache.arrow.vector.compression.CompressionCodec;
+import org.apache.arrow.vector.compression.CompressionUtil;
+import 
org.apache.commons.compress.compressors.lz4.FramedLZ4CompressorInputStream;
+import 
org.apache.commons.compress.compressors.lz4.FramedLZ4CompressorOutputStream;
+import org.apache.commons.compress.utils.IOUtils;
+
+import io.netty.util.internal.PlatformDependent;

Review comment:
   I guess no for now. Maybe we can have one in the future, so we can 
remove the dependency on Netty (and other dependencies on Netty as well).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




  1   2   >