date:20210105

[GitHub] [arrow] mrkn commented on pull request #6302: ARROW-7633: [C++][CI] Create fuzz targets for tensors and sparse tensors

2021-01-05 Thread GitBox



mrkn commented on pull request #6302:
URL: https://github.com/apache/arrow/pull/6302#issuecomment-754473308


   @pitrou I'm sorry for the response to be too late.
   I've committed the fixes for two comments.
   And, if I should move `MakeRandomTensor`, I'll add the commit to do it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nevi-me commented on pull request #9025: ARROW-10259: [Rust] Add custom metadata to Field

2021-01-05 Thread GitBox



nevi-me commented on pull request #9025:
URL: https://github.com/apache/arrow/pull/9025#issuecomment-754477186


   Please enable `generate_custom_metadata_case()`in 
`dev/archery/archery/integration/datagen.py`. You'll see the Rust case disabled.
   That integration test fails with 
   
   ```
   Converting /tmp/arrow-integration-kj0llr2i/generated_custom_metadata.json to 
/tmp/tmpecqxfv9u/70af42de_generated_custom_metadata.json_as_file
   Error: ParseError("Field \'metadata\' must have exact one json map for a 
key-value pair")
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] KirillLykov commented on a change in pull request #9079: ARROW-10578: [C++] Comparison kernels crashing for string array with null string scalar

2021-01-05 Thread GitBox



KirillLykov commented on a change in pull request #9079:
URL: https://github.com/apache/arrow/pull/9079#discussion_r551791854



##
File path: python/pyarrow/tests/test_compute.py
##
@@ -803,12 +836,17 @@ def con(values): return pa.array(values)
 def con(values): return pa.chunked_array([values])
 
 arr = con([1, 2, 3, None])
-# TODO this is a hacky way to construct a scalar ..
-scalar = pa.array([2]).sum()
+scalar = pa.scalar(2)
 
 result = pc.equal(arr, scalar)
 assert result.equals(con([False, True, False, None]))
 
+if typ == "array":
+nascalar = pa.scalar(None, type="int64")
+result = pc.equal(arr, nascalar)
+isnull = pc.is_null(result)

Review comment:
   All done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche closed pull request #9079: ARROW-10578: [C++] Comparison kernels crashing for string array with null string scalar

2021-01-05 Thread GitBox



jorisvandenbossche closed pull request #9079:
URL: https://github.com/apache/arrow/pull/9079


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace opened a new pull request #9100: Added a test to try and reproduce arrow-11067. Since it can only fai…

2021-01-05 Thread GitBox



westonpace opened a new pull request #9100:
URL: https://github.com/apache/arrow/pull/9100


   …l on mac I'm going to commit just the test first to see if it fails 
properly and I can avoid setting up a mac devenv.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #9100: Added a test to try and reproduce arrow-11067. Since it can only fai…

2021-01-05 Thread GitBox



github-actions[bot] commented on pull request #9100:
URL: https://github.com/apache/arrow/pull/9100#issuecomment-754533608


   
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #9100: Added a test to try and reproduce arrow-11067. Since it can only fai…

2021-01-05 Thread GitBox



pitrou commented on pull request #9100:
URL: https://github.com/apache/arrow/pull/9100#issuecomment-754557189


   I restarted the MacOS jobs because they did fail, but not for the right 
reason :-(



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #9100: Added a test to try and reproduce arrow-11067. Since it can only fai…

2021-01-05 Thread GitBox



pitrou commented on a change in pull request #9100:
URL: https://github.com/apache/arrow/pull/9100#discussion_r551852896



##
File path: cpp/src/arrow/util/trie_test.cc
##
@@ -175,6 +175,15 @@ TEST(Trie, EmptyString) {
   ASSERT_EQ(-1, trie.Find("x"));
 }
 
+TEST(Trie, LongString) {
+  TrieBuilder builder;
+  ASSERT_OK(builder.Append(""));
+  const Trie trie = builder.Finish();
+  auto maxlen = static_cast(std::numeric_limits::max()) + 1;

Review comment:
   We may want to test with a bunch of limit-case sizes (e.g. `2^n, 2^n-1, 
2^n+1` for `n` in 15, 16, 17).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kszucs commented on pull request #8915: ARROW-10904: [Python] Add support for Python 3.9 macOS wheels

2021-01-05 Thread GitBox



kszucs commented on pull request #8915:
URL: https://github.com/apache/arrow/pull/8915#issuecomment-754560147


   I'm afraid that it is enforced at organization level, so we need to update 
that action.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] mqy opened a new pull request #9101: ARROW-11131: [Rust] Improve performance of bool_equal

2021-01-05 Thread GitBox



mqy opened a new pull request #9101:
URL: https://github.com/apache/arrow/pull/9101


   This PR follows https://github.com/apache/arrow/pull/8541, 
   
   1. Implement the logic when both `lhs` and `rhs` have zero null count.
   2. May fixed a possible condition testing bug in `(0..len).all(|i| {...}` . 
   NOTE: there are other similar cases where `test_struct()` failed when 
changed to these way.
   3. Add benchmarks `equal_bool_512` and `equal_bool_nulls_512`.
   
   Benchmark comparing to arrow master:
   
   `equal_bool_512  time:   [52.946 ns 53.075 ns 53.188 ns] 
  
   change: [-96.295% -96.285% -96.277%] (p = 0.00 < 
0.05)
   Performance has improved.
   Found 7 outliers among 100 measurements (7.00%)
 3 (3.00%) low severe
 2 (2.00%) high mild
 2 (2.00%) high severe
   
   equal_bool_nulls_512time:   [2.3502 us 2.3696 us 2.3893 us]  

   change: [-62.285% -62.019% -61.714%] (p = 0.00 < 
0.05)
   Performance has improved.`



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #9101: ARROW-11131: [Rust] Improve performance of bool_equal

2021-01-05 Thread GitBox



github-actions[bot] commented on pull request #9101:
URL: https://github.com/apache/arrow/pull/9101#issuecomment-754570371


   https://issues.apache.org/jira/browse/ARROW-11131



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] liyafan82 commented on a change in pull request #8963: ARROW-10962: [FlightRPC][Java] fill in empty body buffer if needed

2021-01-05 Thread GitBox



liyafan82 commented on a change in pull request #8963:
URL: https://github.com/apache/arrow/pull/8963#discussion_r551866156



##
File path: 
java/flight/flight-core/src/test/java/org/apache/arrow/flight/TestBasicOperation.java
##
@@ -317,6 +325,71 @@ private void test(BiConsumer consumer) throws Exc
 }
   }
 
+  /** ARROW-10962: accept FlightData messages generated by Protobuf (which can 
omit empty fields). */
+  @Test
+  public void testProtobufRecordBatchCompatibility() throws Exception {
+final Schema schema = new 
Schema(Collections.singletonList(Field.nullable("foo", new ArrowType.Int(32, 
true;
+try (final BufferAllocator allocator = new 
RootAllocator(Integer.MAX_VALUE);
+ final VectorSchemaRoot root = VectorSchemaRoot.create(schema, 
allocator)) {
+  final VectorUnloader unloader = new VectorUnloader(root);
+  root.setRowCount(0);
+  final MethodDescriptor.Marshaller marshaller = 
ArrowMessage.createMarshaller(allocator);
+  final ByteArrayOutputStream baos = new ByteArrayOutputStream();
+  try (final ArrowMessage message = new 
ArrowMessage(unloader.getRecordBatch(), null, new IpcOption())) {
+Assert.assertEquals(ArrowMessage.HeaderType.RECORD_BATCH, 
message.getMessageType());
+try (final InputStream serialized = marshaller.stream(message)) {
+  final byte[] buf = new byte[1024];
+  while (true) {
+int read = serialized.read(buf);
+if (read < 0) {
+  break;
+}
+baos.write(buf, 0, read);
+  }
+}
+  }
+  final byte[] serializedMessage = baos.toByteArray();
+  final Flight.FlightData protobufData = 
Flight.FlightData.parseFrom(serializedMessage);
+  Assert.assertEquals(0, protobufData.getDataBody().size());
+  // Should not throw
+  final ArrowRecordBatch rb =
+  marshaller.parse(new 
ByteArrayInputStream(protobufData.toByteArray())).asRecordBatch();
+  Assert.assertEquals(rb.computeBodyLength(), 0);
+}
+  }
+
+  /** ARROW-10962: accept FlightData messages generated by Protobuf (which can 
omit empty fields). */
+  @Test
+  public void testProtobufSchemaCompatibility() throws Exception {
+final Schema schema = new 
Schema(Collections.singletonList(Field.nullable("foo", new ArrowType.Int(32, 
true;
+try (final BufferAllocator allocator = new 
RootAllocator(Integer.MAX_VALUE)) {
+  final MethodDescriptor.Marshaller marshaller = 
ArrowMessage.createMarshaller(allocator);
+  final ByteArrayOutputStream baos = new ByteArrayOutputStream();
+  Flight.FlightDescriptor descriptor = FlightDescriptor.command(new 
byte[0]).toProtocol();
+  try (final ArrowMessage message = new ArrowMessage(descriptor, schema, 
new IpcOption())) {
+Assert.assertEquals(ArrowMessage.HeaderType.SCHEMA, 
message.getMessageType());
+try (final InputStream serialized = marshaller.stream(message)) {
+  final byte[] buf = new byte[1024];
+  while (true) {
+int read = serialized.read(buf);
+if (read < 0) {
+  break;
+}
+baos.write(buf, 0, read);
+  }
+}
+  }
+  final byte[] serializedMessage = baos.toByteArray();
+  final Flight.FlightData protobufData = 
Flight.FlightData.parseFrom(serializedMessage)
+  .toBuilder()
+  .setDataBody(ByteString.EMPTY)
+  .build();
+  Assert.assertEquals(0, protobufData.getDataBody().size());
+  // Should not throw
+  marshaller.parse(new 
ByteArrayInputStream(protobufData.toByteArray())).asSchema();

Review comment:
   Maybe we need to assign the schema to an object and verify that its body 
is null?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] liyafan82 commented on a change in pull request #8963: ARROW-10962: [FlightRPC][Java] fill in empty body buffer if needed

2021-01-05 Thread GitBox



liyafan82 commented on a change in pull request #8963:
URL: https://github.com/apache/arrow/pull/8963#discussion_r551866948



##
File path: 
java/flight/flight-core/src/test/java/org/apache/arrow/flight/TestBasicOperation.java
##
@@ -317,6 +325,71 @@ private void test(BiConsumer consumer) throws Exc
 }
   }
 
+  /** ARROW-10962: accept FlightData messages generated by Protobuf (which can 
omit empty fields). */
+  @Test
+  public void testProtobufRecordBatchCompatibility() throws Exception {
+final Schema schema = new 
Schema(Collections.singletonList(Field.nullable("foo", new ArrowType.Int(32, 
true;
+try (final BufferAllocator allocator = new 
RootAllocator(Integer.MAX_VALUE);
+ final VectorSchemaRoot root = VectorSchemaRoot.create(schema, 
allocator)) {
+  final VectorUnloader unloader = new VectorUnloader(root);
+  root.setRowCount(0);
+  final MethodDescriptor.Marshaller marshaller = 
ArrowMessage.createMarshaller(allocator);
+  final ByteArrayOutputStream baos = new ByteArrayOutputStream();
+  try (final ArrowMessage message = new 
ArrowMessage(unloader.getRecordBatch(), null, new IpcOption())) {
+Assert.assertEquals(ArrowMessage.HeaderType.RECORD_BATCH, 
message.getMessageType());
+try (final InputStream serialized = marshaller.stream(message)) {
+  final byte[] buf = new byte[1024];
+  while (true) {
+int read = serialized.read(buf);
+if (read < 0) {
+  break;
+}
+baos.write(buf, 0, read);
+  }
+}
+  }
+  final byte[] serializedMessage = baos.toByteArray();
+  final Flight.FlightData protobufData = 
Flight.FlightData.parseFrom(serializedMessage);
+  Assert.assertEquals(0, protobufData.getDataBody().size());
+  // Should not throw
+  final ArrowRecordBatch rb =
+  marshaller.parse(new 
ByteArrayInputStream(protobufData.toByteArray())).asRecordBatch();
+  Assert.assertEquals(rb.computeBodyLength(), 0);
+}
+  }
+
+  /** ARROW-10962: accept FlightData messages generated by Protobuf (which can 
omit empty fields). */
+  @Test
+  public void testProtobufSchemaCompatibility() throws Exception {
+final Schema schema = new 
Schema(Collections.singletonList(Field.nullable("foo", new ArrowType.Int(32, 
true;
+try (final BufferAllocator allocator = new 
RootAllocator(Integer.MAX_VALUE)) {
+  final MethodDescriptor.Marshaller marshaller = 
ArrowMessage.createMarshaller(allocator);
+  final ByteArrayOutputStream baos = new ByteArrayOutputStream();
+  Flight.FlightDescriptor descriptor = FlightDescriptor.command(new 
byte[0]).toProtocol();
+  try (final ArrowMessage message = new ArrowMessage(descriptor, schema, 
new IpcOption())) {
+Assert.assertEquals(ArrowMessage.HeaderType.SCHEMA, 
message.getMessageType());
+try (final InputStream serialized = marshaller.stream(message)) {

Review comment:
   We can extract a common method for the logic of writing the message to a 
ByteArrayOutputStream?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] liyafan82 commented on a change in pull request #8963: ARROW-10962: [FlightRPC][Java] fill in empty body buffer if needed

2021-01-05 Thread GitBox



liyafan82 commented on a change in pull request #8963:
URL: https://github.com/apache/arrow/pull/8963#discussion_r551867296



##
File path: 
java/flight/flight-core/src/test/java/org/apache/arrow/flight/TestBasicOperation.java
##
@@ -317,6 +325,71 @@ private void test(BiConsumer consumer) throws Exc
 }
   }
 
+  /** ARROW-10962: accept FlightData messages generated by Protobuf (which can 
omit empty fields). */
+  @Test
+  public void testProtobufRecordBatchCompatibility() throws Exception {
+final Schema schema = new 
Schema(Collections.singletonList(Field.nullable("foo", new ArrowType.Int(32, 
true;
+try (final BufferAllocator allocator = new 
RootAllocator(Integer.MAX_VALUE);
+ final VectorSchemaRoot root = VectorSchemaRoot.create(schema, 
allocator)) {
+  final VectorUnloader unloader = new VectorUnloader(root);
+  root.setRowCount(0);
+  final MethodDescriptor.Marshaller marshaller = 
ArrowMessage.createMarshaller(allocator);
+  final ByteArrayOutputStream baos = new ByteArrayOutputStream();
+  try (final ArrowMessage message = new 
ArrowMessage(unloader.getRecordBatch(), null, new IpcOption())) {
+Assert.assertEquals(ArrowMessage.HeaderType.RECORD_BATCH, 
message.getMessageType());
+try (final InputStream serialized = marshaller.stream(message)) {
+  final byte[] buf = new byte[1024];
+  while (true) {
+int read = serialized.read(buf);
+if (read < 0) {
+  break;
+}
+baos.write(buf, 0, read);
+  }
+}
+  }
+  final byte[] serializedMessage = baos.toByteArray();
+  final Flight.FlightData protobufData = 
Flight.FlightData.parseFrom(serializedMessage);
+  Assert.assertEquals(0, protobufData.getDataBody().size());
+  // Should not throw
+  final ArrowRecordBatch rb =
+  marshaller.parse(new 
ByteArrayInputStream(protobufData.toByteArray())).asRecordBatch();
+  Assert.assertEquals(rb.computeBodyLength(), 0);
+}
+  }
+
+  /** ARROW-10962: accept FlightData messages generated by Protobuf (which can 
omit empty fields). */
+  @Test
+  public void testProtobufSchemaCompatibility() throws Exception {
+final Schema schema = new 
Schema(Collections.singletonList(Field.nullable("foo", new ArrowType.Int(32, 
true;
+try (final BufferAllocator allocator = new 
RootAllocator(Integer.MAX_VALUE)) {
+  final MethodDescriptor.Marshaller marshaller = 
ArrowMessage.createMarshaller(allocator);
+  final ByteArrayOutputStream baos = new ByteArrayOutputStream();
+  Flight.FlightDescriptor descriptor = FlightDescriptor.command(new 
byte[0]).toProtocol();
+  try (final ArrowMessage message = new ArrowMessage(descriptor, schema, 
new IpcOption())) {

Review comment:
   Do we need to check the message body here?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] Dandandan commented on pull request #9070: ARROW-11030: [Rust][DataFusion] Concatenate left side batches to single batch in HashJoinExec

2021-01-05 Thread GitBox



Dandandan commented on pull request #9070:
URL: https://github.com/apache/arrow/pull/9070#issuecomment-754573137


   @jorgecarleitao @andygrove 
   Do you think this design is good enough for now?
   For now it means anyone storing > 4GB variable size data in a build-side 
column should opt to use 64-bit offsets, and in the future one option would be 
to automatically detect this and apply a `cast` when needed. I think that 
should have the least performance implications.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] liyafan82 commented on a change in pull request #9088: ARROW-11114: [Java] Fix Schema and Field metadata JSON serialization

2021-01-05 Thread GitBox



liyafan82 commented on a change in pull request #9088:
URL: https://github.com/apache/arrow/pull/9088#discussion_r551868939



##
File path: 
java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestField.java
##
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.types.pojo;
+
+import static org.apache.arrow.vector.types.pojo.Schema.METADATA_KEY;
+import static org.apache.arrow.vector.types.pojo.Schema.METADATA_VALUE;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+import java.io.IOException;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.Map;
+
+import org.apache.arrow.vector.types.pojo.ArrowType.Int;
+import org.junit.Test;
+
+public class TestField {
+
+  private static Field field(String name, boolean nullable, ArrowType type, 
Map metadata) {
+return new Field(name, new FieldType(nullable, type, null, metadata), 
Collections.emptyList());
+  }
+
+  @Test
+  public void testMetadata() throws IOException {
+Map metadata = new HashMap<>(1);
+metadata.put("testKey", "testValue");
+
+Schema schema = new Schema(Collections.singletonList(
+field("a", false, new Int(8, true), metadata)
+));
+contains(schema, "\"" + METADATA_KEY + "\" : \"testKey\"", "\"" + 
METADATA_VALUE + "\" : \"testValue\"");
+
+String json = schema.toJson();
+Schema actual = Schema.fromJSON(json);
+
+Map actualMetadata = 
actual.getFields().get(0).getMetadata();
+assertEquals(1, actualMetadata.size());
+assertEquals("testValue", actualMetadata.get("testKey"));
+  }
+
+  private void contains(Schema schema, String... s) {

Review comment:
   To avoid duplicate call to `toJson`, maybe the first parameter could be 
a `String`?
   In addition, the method name could be `jsonContains`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou opened a new pull request #9102: ARROW-10955: [C++] Fix JSON reading of list(null) values

2021-01-05 Thread GitBox



pitrou opened a new pull request #9102:
URL: https://github.com/apache/arrow/pull/9102


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #9097: ARROW-10881: [C++] Fix EXC_BAD_ACCESS in PutSpaced

2021-01-05 Thread GitBox



pitrou commented on a change in pull request #9097:
URL: https://github.com/apache/arrow/pull/9097#discussion_r551874341



##
File path: cpp/src/parquet/encoding.cc
##
@@ -110,11 +110,16 @@ class PlainEncoder : public EncoderImpl, virtual public 
TypedEncoder {
 
   void PutSpaced(const T* src, int num_values, const uint8_t* valid_bits,
  int64_t valid_bits_offset) override {
-PARQUET_ASSIGN_OR_THROW(auto buffer, ::arrow::AllocateBuffer(num_values * 
sizeof(T),
- 
this->memory_pool()));
-T* data = reinterpret_cast(buffer->mutable_data());
-int num_valid_values = ::arrow::util::internal::SpacedCompress(
-src, num_values, valid_bits, valid_bits_offset, data);
+const T* data = src;
+int num_valid_values = num_values;
+if (valid_bits != NULLPTR) {
+  PARQUET_ASSIGN_OR_THROW(auto buffer, ::arrow::AllocateBuffer(num_values 
* sizeof(T),
+   
this->memory_pool()));

Review comment:
   It seems `buffer` will be deallocated at the end of this `if` block, but 
you're using its contents afterwards (when calling `Put`). 

##
File path: cpp/src/parquet/encoding.cc
##
@@ -110,11 +110,16 @@ class PlainEncoder : public EncoderImpl, virtual public 
TypedEncoder {
 
   void PutSpaced(const T* src, int num_values, const uint8_t* valid_bits,
  int64_t valid_bits_offset) override {
-PARQUET_ASSIGN_OR_THROW(auto buffer, ::arrow::AllocateBuffer(num_values * 
sizeof(T),
- 
this->memory_pool()));
-T* data = reinterpret_cast(buffer->mutable_data());
-int num_valid_values = ::arrow::util::internal::SpacedCompress(
-src, num_values, valid_bits, valid_bits_offset, data);
+const T* data = src;
+int num_valid_values = num_values;
+if (valid_bits != NULLPTR) {
+  PARQUET_ASSIGN_OR_THROW(auto buffer, ::arrow::AllocateBuffer(num_values 
* sizeof(T),
+   
this->memory_pool()));
+  T* data_ = reinterpret_cast(buffer->mutable_data());

Review comment:
   Underscore-terminated variable names are only for attributes. Can you 
use another name? (e.g. `non_spaced_data`)

##
File path: cpp/src/parquet/encoding.cc
##
@@ -309,11 +314,16 @@ class PlainEncoder : public EncoderImpl, 
virtual public BooleanEnco
 
   void PutSpaced(const bool* src, int num_values, const uint8_t* valid_bits,
  int64_t valid_bits_offset) override {
-PARQUET_ASSIGN_OR_THROW(auto buffer, ::arrow::AllocateBuffer(num_values * 
sizeof(T),
- 
this->memory_pool()));
-T* data = reinterpret_cast(buffer->mutable_data());
-int num_valid_values = ::arrow::util::internal::SpacedCompress(
-src, num_values, valid_bits, valid_bits_offset, data);
+const T* data = src;
+int num_valid_values = num_values;
+if (valid_bits != NULLPTR) {
+  PARQUET_ASSIGN_OR_THROW(auto buffer, ::arrow::AllocateBuffer(num_values 
* sizeof(T),

Review comment:
   Same concerns below.

##
File path: cpp/src/parquet/encoding.h
##
@@ -290,14 +290,18 @@ class TypedDecoder : virtual public Decoder {
   /// \return The number of values decoded, including nulls.
   virtual int DecodeSpaced(T* buffer, int num_values, int null_count,
const uint8_t* valid_bits, int64_t 
valid_bits_offset) {
-int values_to_read = num_values - null_count;
-int values_read = Decode(buffer, values_to_read);
-if (values_read != values_to_read) {
-  throw ParquetException("Number of values / definition_levels read did 
not match");
+if (null_count > 0) {
+  int values_to_read = num_values - null_count;
+  int values_read = Decode(buffer, values_to_read);
+  if (values_read != values_to_read) {
+throw ParquetException("Number of values / definition_levels read did 
not match");
+  }
+
+  return ::arrow::util::internal::SpacedExpand(buffer, num_values, 
null_count,
+  valid_bits, 
valid_bits_offset);
+} else {
+  return num_values;

Review comment:
   Why don't you call `Decode` in this branch?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] codecov-io commented on pull request #9101: ARROW-11131: [Rust] Improve performance of bool_equal

2021-01-05 Thread GitBox



codecov-io commented on pull request #9101:
URL: https://github.com/apache/arrow/pull/9101#issuecomment-754580244


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9101?src=pr&el=h1) Report
   > Merging 
[#9101](https://codecov.io/gh/apache/arrow/pull/9101?src=pr&el=desc) (4dd6bfb) 
into 
[master](https://codecov.io/gh/apache/arrow/commit/86cf246d161512923253baa8fe62e00de88db73a?el=desc)
 (86cf246) will **increase** coverage by `0.05%`.
   > The diff coverage is `67.24%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow/pull/9101/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9101?src=pr&el=tree)
   
   ```diff
   @@Coverage Diff @@
   ##   master#9101  +/-   ##
   ==
   + Coverage   82.60%   82.66%   +0.05% 
   ==
 Files 204  204  
 Lines   5017550327 +152 
   ==
   + Hits4144741602 +155 
   + Misses   8728 8725   -3 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow/pull/9101?src=pr&el=tree) | Coverage 
Δ | |
   |---|---|---|
   | 
[rust/arrow/src/array/equal/boolean.rs](https://codecov.io/gh/apache/arrow/pull/9101/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvYm9vbGVhbi5ycw==)
 | `69.84% <67.24%> (-30.16%)` | :arrow_down: |
   | 
[rust/arrow/src/array/array\_string.rs](https://codecov.io/gh/apache/arrow/pull/9101/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfc3RyaW5nLnJz)
 | `88.88% <0.00%> (-1.06%)` | :arrow_down: |
   | 
[rust/parquet/src/schema/types.rs](https://codecov.io/gh/apache/arrow/pull/9101/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9zY2hlbWEvdHlwZXMucnM=)
 | `89.52% <0.00%> (-0.42%)` | :arrow_down: |
   | 
[rust/parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow/pull/9101/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9hcnJvdy9hcnJheV9yZWFkZXIucnM=)
 | `75.44% <0.00%> (+0.09%)` | :arrow_up: |
   | 
[rust/parquet/src/arrow/arrow\_reader.rs](https://codecov.io/gh/apache/arrow/pull/9101/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9hcnJvdy9hcnJvd19yZWFkZXIucnM=)
 | `91.57% <0.00%> (+0.18%)` | :arrow_up: |
   | 
[rust/arrow/src/array/array\_binary.rs](https://codecov.io/gh/apache/arrow/pull/9101/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfYmluYXJ5LnJz)
 | `90.98% <0.00%> (+0.43%)` | :arrow_up: |
   | 
[rust/parquet/src/arrow/schema.rs](https://codecov.io/gh/apache/arrow/pull/9101/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9hcnJvdy9zY2hlbWEucnM=)
 | `91.52% <0.00%> (+0.58%)` | :arrow_up: |
   | 
[rust/arrow/src/datatypes.rs](https://codecov.io/gh/apache/arrow/pull/9101/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZGF0YXR5cGVzLnJz)
 | `77.66% <0.00%> (+2.63%)` | :arrow_up: |
   | 
[rust/arrow/src/ffi.rs](https://codecov.io/gh/apache/arrow/pull/9101/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZmZpLnJz)
 | `75.67% <0.00%> (+3.45%)` | :arrow_up: |
   | 
[rust/arrow/src/record\_batch.rs](https://codecov.io/gh/apache/arrow/pull/9101/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvcmVjb3JkX2JhdGNoLnJz)
 | `83.96% <0.00%> (+10.13%)` | :arrow_up: |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/arrow/pull/9101?src=pr&el=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/arrow/pull/9101?src=pr&el=footer). Last 
update 
[f49f293...4dd6bfb](https://codecov.io/gh/apache/arrow/pull/9101?src=pr&el=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #9102: ARROW-10955: [C++] Fix JSON reading of list(null) values

2021-01-05 Thread GitBox



github-actions[bot] commented on pull request #9102:
URL: https://github.com/apache/arrow/pull/9102#issuecomment-754584408


   https://issues.apache.org/jira/browse/ARROW-10955



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] alamb commented on a change in pull request #9086: [Rust] [DataFusion] [Experiment] Blocking threads filter

2021-01-05 Thread GitBox



alamb commented on a change in pull request #9086:
URL: https://github.com/apache/arrow/pull/9086#discussion_r551883985



##
File path: rust/datafusion/src/physical_plan/filter.rs
##
@@ -103,25 +103,23 @@ impl ExecutionPlan for FilterExec {
 }
 
 async fn execute(&self, partition: usize) -> 
Result {
-Ok(Box::pin(FilterExecStream {
-schema: self.input.schema().clone(),
-predicate: self.predicate.clone(),
-input: self.input.execute(partition).await?,
-}))
+let predicate = self.predicate.clone();
+
+let stream = self.input.execute(partition).await?;
+let stream = stream.then(move |batch| {
+let predicate = predicate.clone();
+async move {
+// Filtering batches is CPU-bounded and therefore justifies a 
dedicated thread pool
+task::spawn_blocking(move || batch_filter(&batch?, &predicate))
+.await
+.unwrap()
+}
+});
+
+Ok(Box::pin(SizedRecordBatchStream::new(stream, self.schema(

Review comment:
   I think buffering with a channel is the Rust-y  way to get more parallel 
execution without unlimited buffering





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kszucs opened a new pull request #9103: [CI] Use pip to install crossbow's dependencies for the comment bot

2021-01-05 Thread GitBox



kszucs opened a new pull request #9103:
URL: https://github.com/apache/arrow/pull/9103


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] alamb commented on a change in pull request #9093: ARROW-11125: [Rust] Logical equality for list arrays

2021-01-05 Thread GitBox



alamb commented on a change in pull request #9093:
URL: https://github.com/apache/arrow/pull/9093#discussion_r551885031



##
File path: rust/datafusion/tests/sql.rs
##
@@ -132,6 +132,7 @@ async fn parquet_single_nan_schema() {
 }
 
 #[tokio::test]
+#[ignore = "Test ignored, will be enabled as part of the nested Parquet 
reader"]

Review comment:
   What happened to this test?  It looks like it used to pass and now it 
doesn't?

##
File path: rust/arrow/src/array/equal/boolean.rs
##
@@ -15,35 +15,60 @@
 // specific language governing permissions and limitations
 // under the License.
 
-use crate::array::ArrayData;
+use crate::array::{data::count_nulls, ArrayData};
+use crate::buffer::Buffer;
+use crate::util::bit_util::get_bit;
 
 use super::utils::equal_bits;
 
 pub(super) fn boolean_equal(
 lhs: &ArrayData,
 rhs: &ArrayData,
+lhs_nulls: Option<&Buffer>,
+rhs_nulls: Option<&Buffer>,
 lhs_start: usize,
 rhs_start: usize,
 len: usize,
 ) -> bool {
 let lhs_values = lhs.buffers()[0].as_slice();
 let rhs_values = rhs.buffers()[0].as_slice();
 
-// TODO: we can do this more efficiently if all values are not-null
-(0..len).all(|i| {
-let lhs_pos = lhs_start + i;
-let rhs_pos = rhs_start + i;
-let lhs_is_null = lhs.is_null(lhs_pos);
-let rhs_is_null = rhs.is_null(rhs_pos);
-
-lhs_is_null
-|| (lhs_is_null == rhs_is_null)
-&& equal_bits(
-lhs_values,
-rhs_values,
-lhs_pos + lhs.offset(),
-rhs_pos + rhs.offset(),
-1,
-)
-})
+let lhs_null_count = count_nulls(lhs_nulls, lhs_start, len);
+let rhs_null_count = count_nulls(rhs_nulls, rhs_start, len);
+
+if lhs_null_count == 0 && rhs_null_count == 0 {
+(0..len).all(|i| {
+let lhs_pos = lhs_start + i;
+let rhs_pos = rhs_start + i;
+
+equal_bits(
+lhs_values,
+rhs_values,
+lhs_pos + lhs.offset(),
+rhs_pos + rhs.offset(),
+1,
+)
+})
+} else {
+// get a ref of the null buffer bytes, to use in testing for nullness
+let lhs_null_bytes = lhs_nulls.as_ref().unwrap().as_slice();

Review comment:
   Is it possible for `lhs_nulls == Some(..)` but `rhs_nulls == None` (and 
visa versa?) Given they are optional arguments I wasn't sure if they would 
always both be either `None` or `Some`

##
File path: rust/arrow/src/array/equal/structure.rs
##
@@ -37,39 +37,20 @@ fn equal_values(
 rhs_start: usize,
 len: usize,
 ) -> bool {
-let mut temp_lhs: Option = None;
-let mut temp_rhs: Option = None;
-
 lhs.child_data()
 .iter()
 .zip(rhs.child_data())
 .all(|(lhs_values, rhs_values)| {
 // merge the null data
-let lhs_merged_nulls = match (lhs_nulls, lhs_values.null_buffer()) 
{

Review comment:
   I think the use of `temp_lhs` and `temp_rhs` here is to avoid the 
`lhs_nulls.cloned()` and `rhs_nulls.cloned()` calls below. 

##
File path: rust/datafusion/tests/sql.rs
##
@@ -132,6 +132,7 @@ async fn parquet_single_nan_schema() {
 }
 
 #[tokio::test]
+#[ignore = "Test ignored, will be enabled as part of the nested Parquet 
reader"]

Review comment:
   When I ran this test locally, it fails with a seemingly non-sensical 
error
   
   ```
   failures:
   
    parquet_list_columns stdout 
   thread 'parquet_list_columns' panicked at 'assertion failed: `(left == 
right)`
 left: `PrimitiveArray
   [
 null,
 1,
   ]`,
right: `PrimitiveArray
   [
 null,
 1,
   ]`', datafusion/tests/sql.rs:204:5
   ```

##
File path: rust/arrow/src/array/equal/utils.rs
##
@@ -76,3 +80,185 @@ pub(super) fn equal_len(
 ) -> bool {
 lhs_values[lhs_start..(lhs_start + len)] == 
rhs_values[rhs_start..(rhs_start + len)]
 }
+
+/// Computes the logical validity bitmap of the array data using the
+/// parent's array data. The parent should be a list or struct, else
+/// the logical bitmap of the array is returned unaltered.
+///
+/// Parent data is passed along with the parent's logical bitmap, as
+/// nested arrays could have a logical bitmap different to the physical
+/// one on the `ArrayData`.
+pub(super) fn child_logical_null_buffer(
+parent_data: &ArrayData,
+logical_null_buffer: Option,

Review comment:
   I think if you changed 
   
   ```
   let parent_bitmap = 
logical_null_buffer.map(Bitmap::from).unwrap_or_else(|| {
   ```
   
   to 
   
   ```
   let parent_bitmap = 
logical_null_buffer.cloned().map(Bitmap::from).unwrap_or_else(|| {
   
   ```
   
   Then the signature could take an `Option<&Buffer>` and the code is cleaner 
(fewer calls to `.cloned()` outside this function). 
   
   But I don't t

[GitHub] [arrow] alamb commented on a change in pull request #9064: ARROW-11074: [Rust][DataFusion] Implement predicate push-down for parquet tables

2021-01-05 Thread GitBox



alamb commented on a change in pull request #9064:
URL: https://github.com/apache/arrow/pull/9064#discussion_r551900306



##
File path: rust/datafusion/src/datasource/parquet.rs
##
@@ -62,17 +64,37 @@ impl TableProvider for ParquetTable {
 self.schema.clone()
 }
 
+fn supports_filter_pushdown(
+&self,
+_filter: &Expr,
+) -> Result {
+Ok(TableProviderFilterPushDown::Inexact)
+}
+
 /// Scan the file(s), using the provided projection, and return one 
BatchIterator per
 /// partition.
 fn scan(
 &self,
 projection: &Option>,
 batch_size: usize,
-_filters: &[Expr],
+filters: &[Expr],

Review comment:
   I agree that starting with this PR and then extending the approach to be 
something more generic is a good approach.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #9103: [CI] Use pip to install crossbow's dependencies for the comment bot

2021-01-05 Thread GitBox



github-actions[bot] commented on pull request #9103:
URL: https://github.com/apache/arrow/pull/9103#issuecomment-754610712


   
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nevi-me commented on a change in pull request #9093: ARROW-11125: [Rust] Logical equality for list arrays

2021-01-05 Thread GitBox



nevi-me commented on a change in pull request #9093:
URL: https://github.com/apache/arrow/pull/9093#discussion_r551913784



##
File path: rust/arrow/src/array/equal/list.rs
##
@@ -71,45 +79,94 @@ fn offset_value_equal(
 pub(super) fn list_equal(
 lhs: &ArrayData,
 rhs: &ArrayData,
+lhs_nulls: Option<&Buffer>,
+rhs_nulls: Option<&Buffer>,
 lhs_start: usize,
 rhs_start: usize,
 len: usize,
 ) -> bool {
 let lhs_offsets = lhs.buffer::(0);
 let rhs_offsets = rhs.buffer::(0);
 
+// There is an edge-case where a n-length list that has 0 children, 
results in panics.
+// For example; an array with offsets [0, 0, 0, 0, 0] has 4 slots, but 
will have

Review comment:
   Ah yes, that's the offsets for the 4 slots. A list's offsets are always 
list_length + 1, as they point to the range of values. [0, 2, 3] has 2 slots, 
with slot 1 being `[1, 2]`, and slot 2 being `[3]`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nevi-me commented on a change in pull request #9093: ARROW-11125: [Rust] Logical equality for list arrays

2021-01-05 Thread GitBox



nevi-me commented on a change in pull request #9093:
URL: https://github.com/apache/arrow/pull/9093#discussion_r551915799



##
File path: rust/datafusion/tests/sql.rs
##
@@ -132,6 +132,7 @@ async fn parquet_single_nan_schema() {
 }
 
 #[tokio::test]
+#[ignore = "Test ignored, will be enabled as part of the nested Parquet 
reader"]

Review comment:
   The same test failed at some point on the parquet list-writer branch. 
I'm not confident that we were reconstructing list arrays correctly from 
parquet data.
   I'm opting to disable it temporarily, then re-enable it in #8927, as I'd 
otherwise be duplicating my effort here.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nevi-me commented on a change in pull request #9093: ARROW-11125: [Rust] Logical equality for list arrays

2021-01-05 Thread GitBox



nevi-me commented on a change in pull request #9093:
URL: https://github.com/apache/arrow/pull/9093#discussion_r551919876



##
File path: rust/arrow/src/array/equal/utils.rs
##
@@ -76,3 +80,185 @@ pub(super) fn equal_len(
 ) -> bool {
 lhs_values[lhs_start..(lhs_start + len)] == 
rhs_values[rhs_start..(rhs_start + len)]
 }
+
+/// Computes the logical validity bitmap of the array data using the
+/// parent's array data. The parent should be a list or struct, else
+/// the logical bitmap of the array is returned unaltered.
+///
+/// Parent data is passed along with the parent's logical bitmap, as
+/// nested arrays could have a logical bitmap different to the physical
+/// one on the `ArrayData`.
+pub(super) fn child_logical_null_buffer(
+parent_data: &ArrayData,
+logical_null_buffer: Option,

Review comment:
   I like your approach, better to clone in one place, than various.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nevi-me commented on a change in pull request #9093: ARROW-11125: [Rust] Logical equality for list arrays

2021-01-05 Thread GitBox



nevi-me commented on a change in pull request #9093:
URL: https://github.com/apache/arrow/pull/9093#discussion_r551920615



##
File path: rust/arrow/src/array/equal/utils.rs
##
@@ -76,3 +80,185 @@ pub(super) fn equal_len(
 ) -> bool {
 lhs_values[lhs_start..(lhs_start + len)] == 
rhs_values[rhs_start..(rhs_start + len)]
 }
+
+/// Computes the logical validity bitmap of the array data using the
+/// parent's array data. The parent should be a list or struct, else
+/// the logical bitmap of the array is returned unaltered.
+///
+/// Parent data is passed along with the parent's logical bitmap, as
+/// nested arrays could have a logical bitmap different to the physical
+/// one on the `ArrayData`.
+pub(super) fn child_logical_null_buffer(
+parent_data: &ArrayData,
+logical_null_buffer: Option,
+child_data: &ArrayData,
+) -> Option {
+let parent_len = parent_data.len();
+let parent_bitmap = 
logical_null_buffer.map(Bitmap::from).unwrap_or_else(|| {
+let ceil = bit_util::ceil(parent_len, 8);
+Bitmap::from(Buffer::from(vec![0b; ceil]))
+});
+let self_null_bitmap = child_data.null_bitmap().clone().unwrap_or_else(|| {
+let ceil = bit_util::ceil(child_data.len(), 8);
+Bitmap::from(Buffer::from(vec![0b; ceil]))
+});
+match parent_data.data_type() {
+DataType::List(_) => Some(logical_list_bitmap::(
+parent_data,
+parent_bitmap,
+self_null_bitmap,
+)),
+DataType::LargeList(_) => Some(logical_list_bitmap::(
+parent_data,
+parent_bitmap,
+self_null_bitmap,
+)),
+DataType::FixedSizeList(_, len) => {
+let len = *len as usize;
+let array_offset = parent_data.offset();
+let bitmap_len = bit_util::ceil(parent_len * len, 8);
+let mut buffer =
+MutableBuffer::new(bitmap_len).with_bitset(bitmap_len, false);
+let mut null_slice = buffer.as_slice_mut();
+(array_offset..parent_len + array_offset).for_each(|index| {
+let start = index * len;
+let end = start + len;
+let mask = parent_bitmap.is_set(index);
+(start..end).for_each(|child_index| {
+if mask && self_null_bitmap.is_set(child_index) {
+bit_util::set_bit(&mut null_slice, child_index);
+}
+});
+});
+Some(buffer.into())
+}
+DataType::Struct(_) => {
+// Arrow implementations are free to pad data, which can result in 
null buffers not
+// having the same length.
+// Rust bitwise comparisons will return an error if left AND right 
is performed on
+// buffers of different length.
+// This might be a valid case during integration testing, where we 
read Arrow arrays
+// from IPC data, which has padding.
+//
+// We first perform a bitwise comparison, and if there is an 
error, we revert to a
+// slower method that indexes into the buffers one-by-one.
+let result = &parent_bitmap & &self_null_bitmap;
+if let Ok(bitmap) = result {
+return Some(bitmap.bits);
+}
+// slow path
+let array_offset = parent_data.offset();
+let mut buffer = MutableBuffer::new_null(parent_len);
+let mut null_slice = buffer.as_slice_mut();
+(0..parent_len).for_each(|index| {
+if parent_bitmap.is_set(index + array_offset)
+&& self_null_bitmap.is_set(index + array_offset)
+{
+bit_util::set_bit(&mut null_slice, index);
+}
+});
+Some(buffer.into())
+}
+DataType::Union(_) => {
+unimplemented!("Logical equality not yet implemented for union 
arrays")
+}
+DataType::Dictionary(_, _) => {
+unimplemented!("Logical equality not yet implemented for nested 
dictionaries")
+}
+data_type => {
+panic!("Data type {:?} is not a supported nested type", data_type)
+}
+}
+}
+
+// Calculate a list child's logical bitmap/buffer

Review comment:
   I'm comfortable that I've captured the correct semantics of logical 
equality for lists; that said, lists have been a thorn on my side for some time 
now :(





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kszucs commented on a change in pull request #9103: ARROW-11132: [CI] Use pip to install crossbow's dependencies for the comment bot

2021-01-05 Thread GitBox



kszucs commented on a change in pull request #9103:
URL: https://github.com/apache/arrow/pull/9103#discussion_r551921283



##
File path: .github/workflows/comment_bot.yml
##
@@ -34,17 +34,12 @@ jobs:
 uses: actions/checkout@v2
 with:
   path: arrow
-  # because libgit2 is a dependency of crossbow so prefer conda

Review comment:
   pygit2 ships wheels now with libgit2 bundled





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #9103: ARROW-11132: [CI] Use pip to install crossbow's dependencies for the comment bot

2021-01-05 Thread GitBox



github-actions[bot] commented on pull request #9103:
URL: https://github.com/apache/arrow/pull/9103#issuecomment-754630798


   https://issues.apache.org/jira/browse/ARROW-11132



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] codecov-io edited a comment on pull request #9089: ARROW-11122: [Rust] Added FFI support for date and time.

2021-01-05 Thread GitBox



codecov-io edited a comment on pull request #9089:
URL: https://github.com/apache/arrow/pull/9089#issuecomment-753752955


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9089?src=pr&el=h1) Report
   > Merging 
[#9089](https://codecov.io/gh/apache/arrow/pull/9089?src=pr&el=desc) (887df6e) 
into 
[master](https://codecov.io/gh/apache/arrow/commit/86cf246d161512923253baa8fe62e00de88db73a?el=desc)
 (86cf246) will **increase** coverage by `0.07%`.
   > The diff coverage is `88.14%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow/pull/9089/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9089?src=pr&el=tree)
   
   ```diff
   @@Coverage Diff @@
   ##   master#9089  +/-   ##
   ==
   + Coverage   82.60%   82.68%   +0.07% 
   ==
 Files 204  204  
 Lines   5017550307 +132 
   ==
   + Hits4144741595 +148 
   + Misses   8728 8712  -16 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow/pull/9089?src=pr&el=tree) | Coverage 
Δ | |
   |---|---|---|
   | 
[rust/arrow-pyarrow-integration-testing/src/lib.rs](https://codecov.io/gh/apache/arrow/pull/9089/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy1weWFycm93LWludGVncmF0aW9uLXRlc3Rpbmcvc3JjL2xpYi5ycw==)
 | `0.00% <ø> (ø)` | |
   | 
[rust/parquet/src/schema/types.rs](https://codecov.io/gh/apache/arrow/pull/9089/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9zY2hlbWEvdHlwZXMucnM=)
 | `89.52% <50.00%> (-0.42%)` | :arrow_down: |
   | 
[rust/parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow/pull/9089/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9hcnJvdy9hcnJheV9yZWFkZXIucnM=)
 | `75.44% <69.56%> (+0.09%)` | :arrow_up: |
   | 
[rust/arrow/src/ffi.rs](https://codecov.io/gh/apache/arrow/pull/9089/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZmZpLnJz)
 | `75.95% <83.87%> (+3.73%)` | :arrow_up: |
   | 
[rust/arrow/src/array/array\_string.rs](https://codecov.io/gh/apache/arrow/pull/9089/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfc3RyaW5nLnJz)
 | `88.88% <87.50%> (-1.06%)` | :arrow_down: |
   | 
[rust/arrow/src/array/array\_binary.rs](https://codecov.io/gh/apache/arrow/pull/9089/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfYmluYXJ5LnJz)
 | `90.98% <100.00%> (+0.43%)` | :arrow_up: |
   | 
[rust/arrow/src/datatypes.rs](https://codecov.io/gh/apache/arrow/pull/9089/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZGF0YXR5cGVzLnJz)
 | `77.66% <100.00%> (+2.63%)` | :arrow_up: |
   | 
[rust/arrow/src/record\_batch.rs](https://codecov.io/gh/apache/arrow/pull/9089/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvcmVjb3JkX2JhdGNoLnJz)
 | `83.96% <100.00%> (+10.13%)` | :arrow_up: |
   | 
[rust/parquet/src/arrow/arrow\_reader.rs](https://codecov.io/gh/apache/arrow/pull/9089/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9hcnJvdy9hcnJvd19yZWFkZXIucnM=)
 | `91.57% <100.00%> (+0.18%)` | :arrow_up: |
   | 
[rust/parquet/src/arrow/schema.rs](https://codecov.io/gh/apache/arrow/pull/9089/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9hcnJvdy9zY2hlbWEucnM=)
 | `91.52% <100.00%> (+0.58%)` | :arrow_up: |
   | ... and [3 
more](https://codecov.io/gh/apache/arrow/pull/9089/diff?src=pr&el=tree-more) | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/arrow/pull/9089?src=pr&el=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/arrow/pull/9089?src=pr&el=footer). Last 
update 
[a745299...887df6e](https://codecov.io/gh/apache/arrow/pull/9089?src=pr&el=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] liyafan82 commented on a change in pull request #9053: ARROW-11081: [Java] Make IPC option immutable

2021-01-05 Thread GitBox



liyafan82 commented on a change in pull request #9053:
URL: https://github.com/apache/arrow/pull/9053#discussion_r551933603



##
File path: 
java/flight/flight-core/src/main/java/org/apache/arrow/flight/ArrowMessage.java
##
@@ -194,10 +194,9 @@ public ArrowMessage(FlightDescriptor descriptor) {
   private ArrowMessage(FlightDescriptor descriptor, MessageMetadataResult 
message, ArrowBuf appMetadata,
ArrowBuf buf) {
 // No need to take IpcOption as this is used for deserialized ArrowMessage 
coming from the wire.
-this.writeOption = new IpcOption();
-if (message != null) {
-  this.writeOption.metadataVersion = 
MetadataVersion.fromFlatbufID(message.getMessage().version());
-}
+this.writeOption = message != null ?
+new IpcOption(false, 
MetadataVersion.fromFlatbufID(message.getMessage().version())) :

Review comment:
   done. thanks for the good suggestion. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] liyafan82 commented on a change in pull request #9053: ARROW-11081: [Java] Make IPC option immutable

2021-01-05 Thread GitBox



liyafan82 commented on a change in pull request #9053:
URL: https://github.com/apache/arrow/pull/9053#discussion_r551933913



##
File path: 
java/flight/flight-core/src/test/java/org/apache/arrow/flight/TestMetadataVersion.java
##
@@ -56,9 +56,8 @@ public static void setUpClass() {
 schema = new Schema(Collections.singletonList(Field.nullable("foo", new 
ArrowType.Int(32, true;
 unionSchema = new Schema(
 Collections.singletonList(Field.nullable("union", new 
ArrowType.Union(UnionMode.Dense, new int[]{0};
-optionV4 = new IpcOption();
-optionV4.metadataVersion = MetadataVersion.V4;
-optionV5 = new IpcOption();
+optionV4 = new IpcOption(false, MetadataVersion.V4);

Review comment:
   commented. thank you.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] liyucheng09 commented on pull request #8386: ARROW-10224: [Python] Add support for Python 3.9 except macOS wheel and Windows wheel

2021-01-05 Thread GitBox



liyucheng09 commented on pull request #8386:
URL: https://github.com/apache/arrow/pull/8386#issuecomment-754640646


   @terencehonles Thanks a lot. I installed pyarrorw sucessfully.
   
   I was just curious that when I type `pip install pyarrow` the error log 
shows that it was stopped during install numpy, which is already installed in 
my environment (maybe the version is too advanced?).
   
   I thought that figure it out may help me to understand deeper about python 
package installing process.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] alamb commented on pull request #8882: ARROW-10864: [Rust] Use standard ordering for floats

2021-01-05 Thread GitBox



alamb commented on pull request #8882:
URL: https://github.com/apache/arrow/pull/8882#issuecomment-754640846


   I will try and look at this PR later today



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #9023: ARROW-11043: [C++] Add "is_nan" kernel

2021-01-05 Thread GitBox



pitrou commented on a change in pull request #9023:
URL: https://github.com/apache/arrow/pull/9023#discussion_r551938144



##
File path: cpp/src/arrow/compute/api_scalar.h
##
@@ -345,6 +345,18 @@ Result IsValid(const Datum& values, ExecContext* 
ctx = NULLPTR);
 ARROW_EXPORT
 Result IsNull(const Datum& values, ExecContext* ctx = NULLPTR);
 
+/// \brief IsNan returns true for each element of `values` that is NaN,
+/// false otherwise
+///
+/// \param[in] values input to look for NaN
+/// \param[in] ctx the function execution context, optional
+/// \return the resulting datum
+///
+/// \since X.X.X

Review comment:
   This would be 3.0.0 if this PR is merged in a week or two.

##
File path: cpp/src/arrow/compute/kernels/scalar_validity_test.cc
##
@@ -31,61 +31,98 @@
 namespace arrow {
 namespace compute {
 
+template 
 class TestValidityKernels : public ::testing::Test {
  protected:
-  // XXX Since IsValid and IsNull don't touch any buffers but the null bitmap
-  // testing multiple types seems redundant.
-  using ArrowType = BooleanType;
-
   static std::shared_ptr type_singleton() {
 return TypeTraits::type_singleton();
   }
 };
 
-TEST_F(TestValidityKernels, ArrayIsValid) {
+typedef TestValidityKernels TestBooleanValidityKernels;
+typedef TestValidityKernels TestFloatValidityKernels;
+typedef TestValidityKernels TestDoubleValidityKernels;
+
+TEST_F(TestBooleanValidityKernels, ArrayIsValid) {
   CheckScalarUnary("is_valid", type_singleton(), "[]", type_singleton(), "[]");
   CheckScalarUnary("is_valid", type_singleton(), "[null]", type_singleton(), 
"[false]");
   CheckScalarUnary("is_valid", type_singleton(), "[1]", type_singleton(), 
"[true]");
   CheckScalarUnary("is_valid", type_singleton(), "[null, 1, 0, null]", 
type_singleton(),
"[false, true, true, false]");
 }
 
-TEST_F(TestValidityKernels, IsValidIsNullNullType) {
+TEST_F(TestBooleanValidityKernels, IsValidIsNullNullType) {
   CheckScalarUnary("is_null", std::make_shared(5),
ArrayFromJSON(boolean(), "[true, true, true, true, true]"));
   CheckScalarUnary("is_valid", std::make_shared(5),
ArrayFromJSON(boolean(), "[false, false, false, false, 
false]"));
 }
 
-TEST_F(TestValidityKernels, ArrayIsValidBufferPassthruOptimization) {
+TEST_F(TestBooleanValidityKernels, ArrayIsValidBufferPassthruOptimization) {
   Datum arg = ArrayFromJSON(boolean(), "[null, 1, 0, null]");
   ASSERT_OK_AND_ASSIGN(auto validity, arrow::compute::IsValid(arg));
   ASSERT_EQ(validity.array()->buffers[1], arg.array()->buffers[0]);
 }
 
-TEST_F(TestValidityKernels, ScalarIsValid) {
+TEST_F(TestBooleanValidityKernels, ScalarIsValid) {
   CheckScalarUnary("is_valid", MakeScalar(19.7), MakeScalar(true));
   CheckScalarUnary("is_valid", MakeNullScalar(float64()), MakeScalar(false));
 }
 
-TEST_F(TestValidityKernels, ArrayIsNull) {
+TEST_F(TestBooleanValidityKernels, ArrayIsNull) {
   CheckScalarUnary("is_null", type_singleton(), "[]", type_singleton(), "[]");
   CheckScalarUnary("is_null", type_singleton(), "[null]", type_singleton(), 
"[true]");
   CheckScalarUnary("is_null", type_singleton(), "[1]", type_singleton(), 
"[false]");
   CheckScalarUnary("is_null", type_singleton(), "[null, 1, 0, null]", 
type_singleton(),
"[true, false, false, true]");
 }
 
-TEST_F(TestValidityKernels, IsNullSetsZeroNullCount) {
+TEST_F(TestBooleanValidityKernels, IsNullSetsZeroNullCount) {
   auto arr = ArrayFromJSON(int32(), "[1, 2, 3, 4]");
   std::shared_ptr result = (*IsNull(arr)).array();
   ASSERT_EQ(result->null_count, 0);
 }
 
-TEST_F(TestValidityKernels, ScalarIsNull) {
+TEST_F(TestBooleanValidityKernels, ScalarIsNull) {
   CheckScalarUnary("is_null", MakeScalar(19.7), MakeScalar(false));
   CheckScalarUnary("is_null", MakeNullScalar(float64()), MakeScalar(true));
 }
 
+TEST_F(TestFloatValidityKernels, FloatArrayIsNan) {

Review comment:
   Note that you can write type-parameterized tests, though it's perhaps 
not very useful here:
   
https://github.com/google/googletest/blob/master/googletest/docs/advanced.md#typed-tests

##
File path: cpp/src/arrow/compute/kernels/scalar_validity_test.cc
##
@@ -31,61 +31,98 @@
 namespace arrow {
 namespace compute {
 
+template 
 class TestValidityKernels : public ::testing::Test {
  protected:
-  // XXX Since IsValid and IsNull don't touch any buffers but the null bitmap
-  // testing multiple types seems redundant.
-  using ArrowType = BooleanType;
-
   static std::shared_ptr type_singleton() {
 return TypeTraits::type_singleton();
   }
 };
 
-TEST_F(TestValidityKernels, ArrayIsValid) {
+typedef TestValidityKernels TestBooleanValidityKernels;
+typedef TestValidityKernels TestFloatValidityKernels;
+typedef TestValidityKernels TestDoubleValidityKernels;
+
+TEST_F(TestBooleanValidityKernels, ArrayIsValid) {
   CheckScalarUnary("is_valid", type_singleton(), "[]", type_singleton(), "[]");
   CheckScalarUnary("i

[GitHub] [arrow] liyafan82 commented on pull request #9053: ARROW-11081: [Java] Make IPC option immutable

2021-01-05 Thread GitBox



liyafan82 commented on pull request #9053:
URL: https://github.com/apache/arrow/pull/9053#issuecomment-754651491


   > @liyafan82 does this actually make a difference in benchmarks? I agree it 
is easier to reason about, but is there any way to avoid backward incompability?
   
   @emkornfield Sorry I do not have a benchmark result. I think it should be a 
positive change, as it avoids object allocations. 
   To avoid backward incompability, I have restored the default constructor. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #8963: ARROW-10962: [FlightRPC][Java] fill in empty body buffer if needed

2021-01-05 Thread GitBox



lidavidm commented on a change in pull request #8963:
URL: https://github.com/apache/arrow/pull/8963#discussion_r551954971



##
File path: 
java/flight/flight-core/src/test/java/org/apache/arrow/flight/TestBasicOperation.java
##
@@ -317,6 +325,71 @@ private void test(BiConsumer consumer) throws Exc
 }
   }
 
+  /** ARROW-10962: accept FlightData messages generated by Protobuf (which can 
omit empty fields). */
+  @Test
+  public void testProtobufRecordBatchCompatibility() throws Exception {
+final Schema schema = new 
Schema(Collections.singletonList(Field.nullable("foo", new ArrowType.Int(32, 
true;
+try (final BufferAllocator allocator = new 
RootAllocator(Integer.MAX_VALUE);
+ final VectorSchemaRoot root = VectorSchemaRoot.create(schema, 
allocator)) {
+  final VectorUnloader unloader = new VectorUnloader(root);
+  root.setRowCount(0);
+  final MethodDescriptor.Marshaller marshaller = 
ArrowMessage.createMarshaller(allocator);
+  final ByteArrayOutputStream baos = new ByteArrayOutputStream();
+  try (final ArrowMessage message = new 
ArrowMessage(unloader.getRecordBatch(), null, new IpcOption())) {
+Assert.assertEquals(ArrowMessage.HeaderType.RECORD_BATCH, 
message.getMessageType());
+try (final InputStream serialized = marshaller.stream(message)) {
+  final byte[] buf = new byte[1024];
+  while (true) {
+int read = serialized.read(buf);
+if (read < 0) {
+  break;
+}
+baos.write(buf, 0, read);
+  }
+}
+  }
+  final byte[] serializedMessage = baos.toByteArray();
+  final Flight.FlightData protobufData = 
Flight.FlightData.parseFrom(serializedMessage);
+  Assert.assertEquals(0, protobufData.getDataBody().size());
+  // Should not throw
+  final ArrowRecordBatch rb =
+  marshaller.parse(new 
ByteArrayInputStream(protobufData.toByteArray())).asRecordBatch();
+  Assert.assertEquals(rb.computeBodyLength(), 0);
+}
+  }
+
+  /** ARROW-10962: accept FlightData messages generated by Protobuf (which can 
omit empty fields). */
+  @Test
+  public void testProtobufSchemaCompatibility() throws Exception {
+final Schema schema = new 
Schema(Collections.singletonList(Field.nullable("foo", new ArrowType.Int(32, 
true;
+try (final BufferAllocator allocator = new 
RootAllocator(Integer.MAX_VALUE)) {
+  final MethodDescriptor.Marshaller marshaller = 
ArrowMessage.createMarshaller(allocator);
+  final ByteArrayOutputStream baos = new ByteArrayOutputStream();
+  Flight.FlightDescriptor descriptor = FlightDescriptor.command(new 
byte[0]).toProtocol();
+  try (final ArrowMessage message = new ArrowMessage(descriptor, schema, 
new IpcOption())) {
+Assert.assertEquals(ArrowMessage.HeaderType.SCHEMA, 
message.getMessageType());
+try (final InputStream serialized = marshaller.stream(message)) {
+  final byte[] buf = new byte[1024];
+  while (true) {
+int read = serialized.read(buf);
+if (read < 0) {
+  break;
+}
+baos.write(buf, 0, read);
+  }
+}
+  }
+  final byte[] serializedMessage = baos.toByteArray();
+  final Flight.FlightData protobufData = 
Flight.FlightData.parseFrom(serializedMessage)
+  .toBuilder()
+  .setDataBody(ByteString.EMPTY)
+  .build();
+  Assert.assertEquals(0, protobufData.getDataBody().size());
+  // Should not throw
+  marshaller.parse(new 
ByteArrayInputStream(protobufData.toByteArray())).asSchema();

Review comment:
   The schema itself has no body. I've added a test to check the parsed 
message.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #8963: ARROW-10962: [FlightRPC][Java] fill in empty body buffer if needed

2021-01-05 Thread GitBox



lidavidm commented on a change in pull request #8963:
URL: https://github.com/apache/arrow/pull/8963#discussion_r551959223



##
File path: 
java/flight/flight-core/src/test/java/org/apache/arrow/flight/TestBasicOperation.java
##
@@ -317,6 +325,71 @@ private void test(BiConsumer consumer) throws Exc
 }
   }
 
+  /** ARROW-10962: accept FlightData messages generated by Protobuf (which can 
omit empty fields). */
+  @Test
+  public void testProtobufRecordBatchCompatibility() throws Exception {
+final Schema schema = new 
Schema(Collections.singletonList(Field.nullable("foo", new ArrowType.Int(32, 
true;
+try (final BufferAllocator allocator = new 
RootAllocator(Integer.MAX_VALUE);
+ final VectorSchemaRoot root = VectorSchemaRoot.create(schema, 
allocator)) {
+  final VectorUnloader unloader = new VectorUnloader(root);
+  root.setRowCount(0);
+  final MethodDescriptor.Marshaller marshaller = 
ArrowMessage.createMarshaller(allocator);
+  final ByteArrayOutputStream baos = new ByteArrayOutputStream();
+  try (final ArrowMessage message = new 
ArrowMessage(unloader.getRecordBatch(), null, new IpcOption())) {
+Assert.assertEquals(ArrowMessage.HeaderType.RECORD_BATCH, 
message.getMessageType());
+try (final InputStream serialized = marshaller.stream(message)) {
+  final byte[] buf = new byte[1024];
+  while (true) {
+int read = serialized.read(buf);
+if (read < 0) {
+  break;
+}
+baos.write(buf, 0, read);
+  }
+}
+  }
+  final byte[] serializedMessage = baos.toByteArray();
+  final Flight.FlightData protobufData = 
Flight.FlightData.parseFrom(serializedMessage);
+  Assert.assertEquals(0, protobufData.getDataBody().size());
+  // Should not throw
+  final ArrowRecordBatch rb =
+  marshaller.parse(new 
ByteArrayInputStream(protobufData.toByteArray())).asRecordBatch();
+  Assert.assertEquals(rb.computeBodyLength(), 0);
+}
+  }
+
+  /** ARROW-10962: accept FlightData messages generated by Protobuf (which can 
omit empty fields). */
+  @Test
+  public void testProtobufSchemaCompatibility() throws Exception {
+final Schema schema = new 
Schema(Collections.singletonList(Field.nullable("foo", new ArrowType.Int(32, 
true;
+try (final BufferAllocator allocator = new 
RootAllocator(Integer.MAX_VALUE)) {
+  final MethodDescriptor.Marshaller marshaller = 
ArrowMessage.createMarshaller(allocator);
+  final ByteArrayOutputStream baos = new ByteArrayOutputStream();
+  Flight.FlightDescriptor descriptor = FlightDescriptor.command(new 
byte[0]).toProtocol();
+  try (final ArrowMessage message = new ArrowMessage(descriptor, schema, 
new IpcOption())) {

Review comment:
   I've added assertions here.

##
File path: 
java/flight/flight-core/src/test/java/org/apache/arrow/flight/TestBasicOperation.java
##
@@ -317,6 +325,71 @@ private void test(BiConsumer consumer) throws Exc
 }
   }
 
+  /** ARROW-10962: accept FlightData messages generated by Protobuf (which can 
omit empty fields). */
+  @Test
+  public void testProtobufRecordBatchCompatibility() throws Exception {
+final Schema schema = new 
Schema(Collections.singletonList(Field.nullable("foo", new ArrowType.Int(32, 
true;
+try (final BufferAllocator allocator = new 
RootAllocator(Integer.MAX_VALUE);
+ final VectorSchemaRoot root = VectorSchemaRoot.create(schema, 
allocator)) {
+  final VectorUnloader unloader = new VectorUnloader(root);
+  root.setRowCount(0);
+  final MethodDescriptor.Marshaller marshaller = 
ArrowMessage.createMarshaller(allocator);
+  final ByteArrayOutputStream baos = new ByteArrayOutputStream();
+  try (final ArrowMessage message = new 
ArrowMessage(unloader.getRecordBatch(), null, new IpcOption())) {
+Assert.assertEquals(ArrowMessage.HeaderType.RECORD_BATCH, 
message.getMessageType());
+try (final InputStream serialized = marshaller.stream(message)) {
+  final byte[] buf = new byte[1024];
+  while (true) {
+int read = serialized.read(buf);
+if (read < 0) {
+  break;
+}
+baos.write(buf, 0, read);
+  }
+}
+  }
+  final byte[] serializedMessage = baos.toByteArray();
+  final Flight.FlightData protobufData = 
Flight.FlightData.parseFrom(serializedMessage);
+  Assert.assertEquals(0, protobufData.getDataBody().size());
+  // Should not throw
+  final ArrowRecordBatch rb =
+  marshaller.parse(new 
ByteArrayInputStream(protobufData.toByteArray())).asRecordBatch();
+  Assert.assertEquals(rb.computeBodyLength(), 0);
+}
+  }
+
+  /** ARROW-10962: accept FlightData messages generated by Protobuf (which can 
omit empty fields). */
+  @Test
+  public void testProtobufSchemaCompatibility() throws Exception {
+final Schema schem

[GitHub] [arrow] bu2 commented on a change in pull request #9023: ARROW-11043: [C++] Add "is_nan" kernel

2021-01-05 Thread GitBox



bu2 commented on a change in pull request #9023:
URL: https://github.com/apache/arrow/pull/9023#discussion_r551960501



##
File path: cpp/src/arrow/compute/api_scalar.h
##
@@ -345,6 +345,18 @@ Result IsValid(const Datum& values, ExecContext* 
ctx = NULLPTR);
 ARROW_EXPORT
 Result IsNull(const Datum& values, ExecContext* ctx = NULLPTR);
 
+/// \brief IsNan returns true for each element of `values` that is NaN,
+/// false otherwise
+///
+/// \param[in] values input to look for NaN
+/// \param[in] ctx the function execution context, optional
+/// \return the resulting datum
+///
+/// \since X.X.X

Review comment:
   Thanks @pitrou for your review, I will update the PR today answering all 
points. Should I rebase first ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2021-01-05 Thread GitBox



jorisvandenbossche commented on pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#issuecomment-754672429


   Trying this out locally, I see the following strange behaviour:
   
   ```
   In [90]: size = 300
   
   In [91]: table = pa.table({'str': [str(x) for x in range(size)]})
   
   In [92]: pq.write_table(table, "test.parquet", row_group_size=100)
   
   In [93]: f = pq.ParquetFile("test.parquet")
   
   In [94]: [b.num_rows for b in f.iter_batches(batch_size=80)]
   Out[94]: [80, 80, 80, 60]
   
   In [95]: table = pa.table({'str': pd.Categorical([str(x) for x in 
range(size)])})
   
   In [96]: pq.write_table(table, "test.parquet", row_group_size=100)
   
   In [97]: f = pq.ParquetFile("test.parquet")
   
   In [98]: [b.num_rows for b in f.iter_batches(batch_size=80)]
   Out[98]: [80, 20, 60, 40, 40, 60]
   ```
   
   So by having a dictionary type, the size of the batches is rather strange. 
Now, it was already discussed above that with categorical data, it was behaving 
differently (and that's also the reason the tests don't include categorical for 
now). So since you are mostly exposing the existing feature in python, and not 
actually implementing it, I suppose this is an existing bug in the batch_size 
handling of the parquet C++ record batch reader, and it can be seen as a bug to 
report/fix separately?
   
   > I can confirm that once I merge in the latest changes from apache master, 
I am getting the batches to be the expected size (and spanning across the 
parquet chunks). I have updated the test_iter_batches_columns_reader unit test 
accordingly.
   
   That's again something tied to the C++ implementation, but I would actually 
not expect the baches to cross different row groups (so for example for a row 
group size of 1000, total size of 2000 and batch size of 300, I would expect 
batches of [300, 300, 300, 100, 300, 300, 300, 100] instead of [300, 300, 300, 
300, 300, 300, 200]. 
   But so maybe something to open a JIRA for to follow-up on.
   
   > Looking at the code, no longer think this `batch_size` parameter actually 
would affect those other read methods.
   > 
   > There are a few different "batch_size" parameters floating around the 
`reader.cc`, but there's only one reference to the one in `reader_properties_` 
(`ArrowReaderProperties`):
   > ...
   > As far as I can tell, that's exclusively on the code path for the 
RecordBatchReader, and not the other readers. So I don't think we need to add 
the parameter to those other methods.
   
   That indeed seems to be the case



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bu2 commented on a change in pull request #9023: ARROW-11043: [C++] Add "is_nan" kernel

2021-01-05 Thread GitBox



bu2 commented on a change in pull request #9023:
URL: https://github.com/apache/arrow/pull/9023#discussion_r551969566



##
File path: cpp/src/arrow/compute/kernels/common.h
##
@@ -19,6 +19,7 @@
 
 // IWYU pragma: begin_exports
 
+#include 

Review comment:
   Initially I included cmath.h to use std::isnan(). I thought this header 
would be quite common to most kernels... my assumption might be wrong though ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bu2 commented on a change in pull request #9023: ARROW-11043: [C++] Add "is_nan" kernel

2021-01-05 Thread GitBox



bu2 commented on a change in pull request #9023:
URL: https://github.com/apache/arrow/pull/9023#discussion_r551969646



##
File path: cpp/src/arrow/compute/kernels/scalar_validity.cc
##
@@ -132,6 +156,11 @@ const FunctionDoc is_null_doc("Return true if null",
   ("For each input value, emit true iff the value 
is null."),
   {"values"});
 
+const FunctionDoc is_nan_doc("Return true if NaN",
+ ("For each input value, emit true iff the value 
is NaN "
+  "(according to std::isnan(value))."),

Review comment:
   Ok





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bu2 commented on a change in pull request #9023: ARROW-11043: [C++] Add "is_nan" kernel

2021-01-05 Thread GitBox



bu2 commented on a change in pull request #9023:
URL: https://github.com/apache/arrow/pull/9023#discussion_r551970210



##
File path: cpp/src/arrow/compute/kernels/scalar_validity_test.cc
##
@@ -31,61 +31,98 @@
 namespace arrow {
 namespace compute {
 
+template 
 class TestValidityKernels : public ::testing::Test {
  protected:
-  // XXX Since IsValid and IsNull don't touch any buffers but the null bitmap
-  // testing multiple types seems redundant.
-  using ArrowType = BooleanType;
-
   static std::shared_ptr type_singleton() {
 return TypeTraits::type_singleton();
   }
 };
 
-TEST_F(TestValidityKernels, ArrayIsValid) {
+typedef TestValidityKernels TestBooleanValidityKernels;
+typedef TestValidityKernels TestFloatValidityKernels;
+typedef TestValidityKernels TestDoubleValidityKernels;
+
+TEST_F(TestBooleanValidityKernels, ArrayIsValid) {
   CheckScalarUnary("is_valid", type_singleton(), "[]", type_singleton(), "[]");
   CheckScalarUnary("is_valid", type_singleton(), "[null]", type_singleton(), 
"[false]");
   CheckScalarUnary("is_valid", type_singleton(), "[1]", type_singleton(), 
"[true]");
   CheckScalarUnary("is_valid", type_singleton(), "[null, 1, 0, null]", 
type_singleton(),
"[false, true, true, false]");
 }
 
-TEST_F(TestValidityKernels, IsValidIsNullNullType) {
+TEST_F(TestBooleanValidityKernels, IsValidIsNullNullType) {
   CheckScalarUnary("is_null", std::make_shared(5),
ArrayFromJSON(boolean(), "[true, true, true, true, true]"));
   CheckScalarUnary("is_valid", std::make_shared(5),
ArrayFromJSON(boolean(), "[false, false, false, false, 
false]"));
 }
 
-TEST_F(TestValidityKernels, ArrayIsValidBufferPassthruOptimization) {
+TEST_F(TestBooleanValidityKernels, ArrayIsValidBufferPassthruOptimization) {
   Datum arg = ArrayFromJSON(boolean(), "[null, 1, 0, null]");
   ASSERT_OK_AND_ASSIGN(auto validity, arrow::compute::IsValid(arg));
   ASSERT_EQ(validity.array()->buffers[1], arg.array()->buffers[0]);
 }
 
-TEST_F(TestValidityKernels, ScalarIsValid) {
+TEST_F(TestBooleanValidityKernels, ScalarIsValid) {
   CheckScalarUnary("is_valid", MakeScalar(19.7), MakeScalar(true));
   CheckScalarUnary("is_valid", MakeNullScalar(float64()), MakeScalar(false));
 }
 
-TEST_F(TestValidityKernels, ArrayIsNull) {
+TEST_F(TestBooleanValidityKernels, ArrayIsNull) {
   CheckScalarUnary("is_null", type_singleton(), "[]", type_singleton(), "[]");
   CheckScalarUnary("is_null", type_singleton(), "[null]", type_singleton(), 
"[true]");
   CheckScalarUnary("is_null", type_singleton(), "[1]", type_singleton(), 
"[false]");
   CheckScalarUnary("is_null", type_singleton(), "[null, 1, 0, null]", 
type_singleton(),
"[true, false, false, true]");
 }
 
-TEST_F(TestValidityKernels, IsNullSetsZeroNullCount) {
+TEST_F(TestBooleanValidityKernels, IsNullSetsZeroNullCount) {
   auto arr = ArrayFromJSON(int32(), "[1, 2, 3, 4]");
   std::shared_ptr result = (*IsNull(arr)).array();
   ASSERT_EQ(result->null_count, 0);
 }
 
-TEST_F(TestValidityKernels, ScalarIsNull) {
+TEST_F(TestBooleanValidityKernels, ScalarIsNull) {
   CheckScalarUnary("is_null", MakeScalar(19.7), MakeScalar(false));
   CheckScalarUnary("is_null", MakeNullScalar(float64()), MakeScalar(true));
 }
 
+TEST_F(TestFloatValidityKernels, FloatArrayIsNan) {
+  // All NaN
+  CheckScalarUnary("is_nan", ArrayFromJSON(float32(), "[NaN, NaN, NaN, NaN, 
NaN]"),
+   ArrayFromJSON(boolean(), "[true, true, true, true, true]"));
+  // No NaN
+  CheckScalarUnary("is_nan", ArrayFromJSON(float32(), "[0.0, 1.0, 2.0, 3.0, 
4.0, null]"),
+   ArrayFromJSON(boolean(), "[false, false, false, false, 
false, null]"));
+  // Some NaNs
+  CheckScalarUnary("is_nan", ArrayFromJSON(float32(), "[0.0, NaN, 2.0, NaN, 
4.0, null]"),
+   ArrayFromJSON(boolean(), "[false, true, false, true, false, 
null]"));
+}
+
+TEST_F(TestDoubleValidityKernels, DoubleArrayIsNan) {
+  // All NaN
+  CheckScalarUnary("is_nan", ArrayFromJSON(float64(), "[NaN, NaN, NaN, NaN, 
NaN]"),
+   ArrayFromJSON(boolean(), "[true, true, true, true, true]"));
+  // No NaN
+  CheckScalarUnary("is_nan", ArrayFromJSON(float64(), "[0.0, 1.0, 2.0, 3.0, 
4.0, null]"),
+   ArrayFromJSON(boolean(), "[false, false, false, false, 
false, null]"));
+  // Some NaNs
+  CheckScalarUnary("is_nan", ArrayFromJSON(float64(), "[0.0, NaN, 2.0, NaN, 
4.0, null]"),
+   ArrayFromJSON(boolean(), "[false, true, false, true, false, 
null]"));
+}
+
+TEST_F(TestFloatValidityKernels, FloatScalarIsNan) {
+  CheckScalarUnary("is_nan", MakeNullScalar(float32()), 
MakeNullScalar(boolean()));
+  CheckScalarUnary("is_nan", MakeScalar(42.0), MakeScalar(false));

Review comment:
   Yes, thanks.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use

[GitHub] [arrow] pitrou commented on pull request #9024: ARROW-11044: [C++] Add "replace" kernel

2021-01-05 Thread GitBox



pitrou commented on pull request #9024:
URL: https://github.com/apache/arrow/pull/9024#issuecomment-754673873


   I wouldn't expect a "replace" operation to do this. Instead, this looks more 
like a "select" operation. @wesm What do you think?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bu2 commented on a change in pull request #9023: ARROW-11043: [C++] Add "is_nan" kernel

2021-01-05 Thread GitBox



bu2 commented on a change in pull request #9023:
URL: https://github.com/apache/arrow/pull/9023#discussion_r551971142



##
File path: docs/source/cpp/compute.rst
##
@@ -453,22 +453,26 @@ Structural transforms
 
+==+++=+=+
 | fill_null| Binary | Boolean, Null, Numeric, Temporal, 
String-like  | Input type  | \(1)|
 
+--+++-+-+
-| is_null  | Unary  | Any  
  | Boolean | \(2)|
+| is_nan   | Unary  | Float, Double
  | Boolean | \(2)|
 
+--+++-+-+
-| is_valid | Unary  | Any  
  | Boolean | \(2)|
+| is_null  | Unary  | Any  
  | Boolean | \(3)|
 
+--+++-+-+
-| list_value_length| Unary  | List-like
  | Int32 or Int64  | \(4)|
+| is_valid | Unary  | Any  
  | Boolean | \(4)|
++--+++-+-+
+| list_value_length| Unary  | List-like
  | Int32 or Int64  | \(5)|
 
+--+++-+-+
 
 * \(1) First input must be an array, second input a scalar of the same type.
   Output is an array of the same type as the inputs, and with the same values
   as the first input, except for nulls replaced with the second input value.
 
-* \(2) Output is true iff the corresponding input element is non-null.
+* \(2) Output is true iff the corresponding input element is NaN according to 
std::isnan(value).

Review comment:
   Ok





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bu2 commented on a change in pull request #9023: ARROW-11043: [C++] Add "is_nan" kernel

2021-01-05 Thread GitBox



bu2 commented on a change in pull request #9023:
URL: https://github.com/apache/arrow/pull/9023#discussion_r551972306



##
File path: cpp/src/arrow/compute/kernels/scalar_validity_test.cc
##
@@ -31,61 +31,98 @@
 namespace arrow {
 namespace compute {
 
+template 
 class TestValidityKernels : public ::testing::Test {
  protected:
-  // XXX Since IsValid and IsNull don't touch any buffers but the null bitmap
-  // testing multiple types seems redundant.
-  using ArrowType = BooleanType;
-
   static std::shared_ptr type_singleton() {
 return TypeTraits::type_singleton();
   }
 };
 
-TEST_F(TestValidityKernels, ArrayIsValid) {
+typedef TestValidityKernels TestBooleanValidityKernels;
+typedef TestValidityKernels TestFloatValidityKernels;
+typedef TestValidityKernels TestDoubleValidityKernels;

Review comment:
   Ok





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bkietz closed pull request #9102: ARROW-10955: [C++] Fix JSON reading of list(null) values

2021-01-05 Thread GitBox



bkietz closed pull request #9102:
URL: https://github.com/apache/arrow/pull/9102


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bu2 commented on a change in pull request #9023: ARROW-11043: [C++] Add "is_nan" kernel

2021-01-05 Thread GitBox



bu2 commented on a change in pull request #9023:
URL: https://github.com/apache/arrow/pull/9023#discussion_r551974405



##
File path: cpp/src/arrow/compute/kernels/scalar_validity_test.cc
##
@@ -31,61 +31,98 @@
 namespace arrow {
 namespace compute {
 
+template 
 class TestValidityKernels : public ::testing::Test {
  protected:
-  // XXX Since IsValid and IsNull don't touch any buffers but the null bitmap
-  // testing multiple types seems redundant.
-  using ArrowType = BooleanType;
-
   static std::shared_ptr type_singleton() {
 return TypeTraits::type_singleton();
   }
 };
 
-TEST_F(TestValidityKernels, ArrayIsValid) {
+typedef TestValidityKernels TestBooleanValidityKernels;
+typedef TestValidityKernels TestFloatValidityKernels;
+typedef TestValidityKernels TestDoubleValidityKernels;
+
+TEST_F(TestBooleanValidityKernels, ArrayIsValid) {
   CheckScalarUnary("is_valid", type_singleton(), "[]", type_singleton(), "[]");
   CheckScalarUnary("is_valid", type_singleton(), "[null]", type_singleton(), 
"[false]");
   CheckScalarUnary("is_valid", type_singleton(), "[1]", type_singleton(), 
"[true]");
   CheckScalarUnary("is_valid", type_singleton(), "[null, 1, 0, null]", 
type_singleton(),
"[false, true, true, false]");
 }
 
-TEST_F(TestValidityKernels, IsValidIsNullNullType) {
+TEST_F(TestBooleanValidityKernels, IsValidIsNullNullType) {
   CheckScalarUnary("is_null", std::make_shared(5),
ArrayFromJSON(boolean(), "[true, true, true, true, true]"));
   CheckScalarUnary("is_valid", std::make_shared(5),
ArrayFromJSON(boolean(), "[false, false, false, false, 
false]"));
 }
 
-TEST_F(TestValidityKernels, ArrayIsValidBufferPassthruOptimization) {
+TEST_F(TestBooleanValidityKernels, ArrayIsValidBufferPassthruOptimization) {
   Datum arg = ArrayFromJSON(boolean(), "[null, 1, 0, null]");
   ASSERT_OK_AND_ASSIGN(auto validity, arrow::compute::IsValid(arg));
   ASSERT_EQ(validity.array()->buffers[1], arg.array()->buffers[0]);
 }
 
-TEST_F(TestValidityKernels, ScalarIsValid) {
+TEST_F(TestBooleanValidityKernels, ScalarIsValid) {
   CheckScalarUnary("is_valid", MakeScalar(19.7), MakeScalar(true));
   CheckScalarUnary("is_valid", MakeNullScalar(float64()), MakeScalar(false));
 }
 
-TEST_F(TestValidityKernels, ArrayIsNull) {
+TEST_F(TestBooleanValidityKernels, ArrayIsNull) {
   CheckScalarUnary("is_null", type_singleton(), "[]", type_singleton(), "[]");
   CheckScalarUnary("is_null", type_singleton(), "[null]", type_singleton(), 
"[true]");
   CheckScalarUnary("is_null", type_singleton(), "[1]", type_singleton(), 
"[false]");
   CheckScalarUnary("is_null", type_singleton(), "[null, 1, 0, null]", 
type_singleton(),
"[true, false, false, true]");
 }
 
-TEST_F(TestValidityKernels, IsNullSetsZeroNullCount) {
+TEST_F(TestBooleanValidityKernels, IsNullSetsZeroNullCount) {
   auto arr = ArrayFromJSON(int32(), "[1, 2, 3, 4]");
   std::shared_ptr result = (*IsNull(arr)).array();
   ASSERT_EQ(result->null_count, 0);
 }
 
-TEST_F(TestValidityKernels, ScalarIsNull) {
+TEST_F(TestBooleanValidityKernels, ScalarIsNull) {
   CheckScalarUnary("is_null", MakeScalar(19.7), MakeScalar(false));
   CheckScalarUnary("is_null", MakeNullScalar(float64()), MakeScalar(true));
 }
 
+TEST_F(TestFloatValidityKernels, FloatArrayIsNan) {

Review comment:
   Ok, thanks for the link.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2021-01-05 Thread GitBox



jorisvandenbossche commented on a change in pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#discussion_r551972959



##
File path: python/pyarrow/parquet.py
##
@@ -319,6 +319,44 @@ def read_row_groups(self, row_groups, columns=None, 
use_threads=True,
column_indices=column_indices,
use_threads=use_threads)
 
+def iter_batches(self, batch_size=65536, row_groups=None, columns=None,
+ use_threads=True, use_pandas_metadata=False):
+"""
+Read streaming batches from a Parquet file
+
+Parameters
+--
+batch_size: int, default 64K
+Maximum number of records to yield per batch. Batches may be
+smaller if there aren't enough rows in a rowgroup.

Review comment:
   ```suggestion
   smaller if there aren't enough rows in the file.
   ```
   
   ? (given that it currently returns row across row groups)

##
File path: python/pyarrow/parquet.py
##
@@ -319,6 +319,44 @@ def read_row_groups(self, row_groups, columns=None, 
use_threads=True,
column_indices=column_indices,
use_threads=use_threads)
 
+def iter_batches(self, batch_size=65536, row_groups=None, columns=None,
+ use_threads=True, use_pandas_metadata=False):
+"""
+Read streaming batches from a Parquet file
+
+Parameters
+--
+batch_size: int, default 64K
+Maximum number of records to yield per batch. Batches may be
+smaller if there aren't enough rows in a rowgroup.
+row_groups: list
+Only these row groups will be read from the file.
+columns: list
+If not None, only these columns will be read from the file. A
+column name may be a prefix of a nested field, e.g. 'a' will select
+'a.b', 'a.c', and 'a.d.e'
+use_threads : boolean, default True
+Perform multi-threaded column reads
+use_pandas_metadata : boolean, default False
+If True and file has custom pandas schema metadata, ensure that
+index columns are also loaded

Review comment:
   ```suggestion
   columns: list
   If not None, only these columns will be read from the file. A
   column name may be a prefix of a nested field, e.g. 'a' will 
select
   'a.b', 'a.c', and 'a.d.e'.
   use_threads : boolean, default True
   Perform multi-threaded column reads.
   use_pandas_metadata : boolean, default False
   If True and file has custom pandas schema metadata, ensure that
   index columns are also loaded.
   ```
   
   (minor nitpick on punctuation)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #8894: ARROW-10322: [C++][Dataset] Minimize Expression

2021-01-05 Thread GitBox



bkietz commented on a change in pull request #8894:
URL: https://github.com/apache/arrow/pull/8894#discussion_r551978453



##
File path: cpp/src/arrow/dataset/partition_test.cc
##
@@ -21,52 +21,51 @@
 #include 
 
 #include 
-#include 
 #include 
 #include 
 #include 
 #include 
 
-#include "arrow/dataset/file_base.h"
 #include "arrow/dataset/scanner_internal.h"
 #include "arrow/dataset/test_util.h"
-#include "arrow/filesystem/localfs.h"
 #include "arrow/filesystem/path_util.h"
 #include "arrow/status.h"
 #include "arrow/testing/gtest_util.h"
-#include "arrow/util/io_util.h"
 
 namespace arrow {
 using internal::checked_pointer_cast;
 
 namespace dataset {
 
-using E = TestExpression;
-
 class TestPartitioning : public ::testing::Test {
  public:
   void AssertParseError(const std::string& path) {
 ASSERT_RAISES(Invalid, partitioning_->Parse(path));
   }
 
-  void AssertParse(const std::string& path, E expected) {
+  void AssertParse(const std::string& path, Expression expected) {
 ASSERT_OK_AND_ASSIGN(auto parsed, partitioning_->Parse(path));
-ASSERT_EQ(E{parsed}, expected);
+ASSERT_EQ(parsed, expected);
   }
 
   template 
-  void AssertFormatError(E expr) {
-ASSERT_EQ(partitioning_->Format(*expr.expression).status().code(), code);
+  void AssertFormatError(Expression expr) {
+ASSERT_EQ(partitioning_->Format(expr).status().code(), code);
   }
 
-  void AssertFormat(E expr, const std::string& expected) {
-ASSERT_OK_AND_ASSIGN(auto formatted, 
partitioning_->Format(*expr.expression));
+  void AssertFormat(Expression expr, const std::string& expected) {
+// formatted partition expressions are bound to the schema of the dataset 
being
+// written
+ASSERT_OK_AND_ASSIGN(auto formatted, partitioning_->Format(expr));
 ASSERT_EQ(formatted, expected);
 
 // ensure the formatted path round trips the relevant components of the 
partition
 // expression: roundtripped should be a subset of expr
-ASSERT_OK_AND_ASSIGN(auto roundtripped, partitioning_->Parse(formatted));
-ASSERT_EQ(E{roundtripped->Assume(*expr.expression)}, E{scalar(true)});
+ASSERT_OK_AND_ASSIGN(Expression roundtripped, 
partitioning_->Parse(formatted));
+
+ASSERT_OK_AND_ASSIGN(roundtripped, roundtripped.Bind(*written_schema_));

Review comment:
   Hmm, actually not: hive partitioning can parse arbitrarily ordered field 
values, but will always format them in schema order.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kszucs closed pull request #9103: ARROW-11132: [CI] Use pip to install crossbow's dependencies for the comment bot

2021-01-05 Thread GitBox



kszucs closed pull request #9103:
URL: https://github.com/apache/arrow/pull/9103


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2021-01-05 Thread GitBox



jorisvandenbossche commented on pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#issuecomment-754686227


   BTW, the Dataset API also gives a way to get an iterator over record batches 
(with `Dataset.to_batches()`). The strange thing is that this seems to have 
another logic of how many rows are included in each batch when crossing row 
groups, while in the end it is also using `GetRecordBatchReader` with 
`batch_size` set in the reader properties:
   
   ```
   In [116]: table = pa.table({'str': [str(x) for x in range(size)], 'str_cat': 
pd.Categorical([str(x) for x in range(size)])})
   
   In [117]: pq.write_table(table, "test.parquet", row_group_size=100)
   
   In [118]: f = pq.ParquetFile("test.parquet")
   
   In [119]: [b.num_rows for b in f.iter_batches(batch_size=80)]
   Out[119]: [80, 20, 60, 40, 40, 60]
   
   In [120]: [b.num_rows for b in ds.dataset("test.parquet").to_batches()]
   Out[120]: [100, 100, 100]
   
   In [121]: [b.num_rows for b in 
ds.dataset("test.parquet").to_batches(batch_size=80)]
   Out[121]: [80, 20, 80, 20, 80, 20]
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2021-01-05 Thread GitBox



jorisvandenbossche commented on pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#issuecomment-754691714


   > The strange thing is that this seems to have another logic of how many 
rows are included in each batch when crossing row groups, while in the end it 
is also using `GetRecordBatchReader` with `batch_size` set in the reader 
properties:
   
   Ah, that's because the RecordBatchReader gets constructed *for each* row 
group, so that way never crossing row group boundaries



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #9023: ARROW-11043: [C++] Add "is_nan" kernel

2021-01-05 Thread GitBox



pitrou commented on a change in pull request #9023:
URL: https://github.com/apache/arrow/pull/9023#discussion_r551989364



##
File path: cpp/src/arrow/compute/kernels/common.h
##
@@ -19,6 +19,7 @@
 
 // IWYU pragma: begin_exports
 
+#include 

Review comment:
   We only add inclusion in header files when it is required by the header 
file itself, otherwise it would inflate compilation times for nothing. 
Admittedly, `cmath` is probably not a large file.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #9023: ARROW-11043: [C++] Add "is_nan" kernel

2021-01-05 Thread GitBox



pitrou commented on a change in pull request #9023:
URL: https://github.com/apache/arrow/pull/9023#discussion_r551989576



##
File path: cpp/src/arrow/compute/api_scalar.h
##
@@ -345,6 +345,18 @@ Result IsValid(const Datum& values, ExecContext* 
ctx = NULLPTR);
 ARROW_EXPORT
 Result IsNull(const Datum& values, ExecContext* ctx = NULLPTR);
 
+/// \brief IsNan returns true for each element of `values` that is NaN,
+/// false otherwise
+///
+/// \param[in] values input to look for NaN
+/// \param[in] ctx the function execution context, optional
+/// \return the resulting datum
+///
+/// \since X.X.X

Review comment:
   Yes, you can rebase (before or after the changes, as you prefer).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pokemaster7 opened a new issue #9104: Reading Feather File from Custom Offset

2021-01-05 Thread GitBox



pokemaster7 opened a new issue #9104:
URL: https://github.com/apache/arrow/issues/9104


   Is it possible to embed a feather file in another file (with known 
offset/length) and read the feather portion in a correct and performant way?
   
   Here is a naive idea of what I'm trying to do, though it throws an error for 
some reason:
   
   ``` python
   import pandas as pd
   import numpy as np
   import os
   
   df = pd.DataFrame(np.random.randint(0,100,size=(15, 4)), 
columns=list('ABCD'))
   pth = "TMP"
   with open(pth, "wb") as fh:
 fh.write(b"\x01") # custom header, one byte
 df.to_feather(fh)
   with open(pth, "rb") as gh:
 gh.read(1) # read header
 print(pd.read_feather(gh)) # throws 'pyarrow.lib.ArrowInvalid: Not a 
Feather V1 or Arrow IPC file'



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] andygrove commented on pull request #9070: ARROW-11030: [Rust][DataFusion] Concatenate left side batches to single batch in HashJoinExec

2021-01-05 Thread GitBox



andygrove commented on pull request #9070:
URL: https://github.com/apache/arrow/pull/9070#issuecomment-754694469


   @Dandandan Yes, I personally think this is fine for now (hence the approval) 
and since no-one has objected I think we can go ahead and merge this.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kszucs commented on pull request #8915: ARROW-10904: [Python] Add support for Python 3.9 macOS wheels

2021-01-05 Thread GitBox



kszucs commented on pull request #8915:
URL: https://github.com/apache/arrow/pull/8915#issuecomment-754698886







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nbruno commented on a change in pull request #9088: ARROW-11114: [Java] Fix Schema and Field metadata JSON serialization

2021-01-05 Thread GitBox



nbruno commented on a change in pull request #9088:
URL: https://github.com/apache/arrow/pull/9088#discussion_r551997285



##
File path: 
java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestField.java
##
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.types.pojo;
+
+import static org.apache.arrow.vector.types.pojo.Schema.METADATA_KEY;
+import static org.apache.arrow.vector.types.pojo.Schema.METADATA_VALUE;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+import java.io.IOException;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.Map;
+
+import org.apache.arrow.vector.types.pojo.ArrowType.Int;
+import org.junit.Test;
+
+public class TestField {
+
+  private static Field field(String name, boolean nullable, ArrowType type, 
Map metadata) {
+return new Field(name, new FieldType(nullable, type, null, metadata), 
Collections.emptyList());
+  }
+
+  @Test
+  public void testMetadata() throws IOException {
+Map metadata = new HashMap<>(1);
+metadata.put("testKey", "testValue");
+
+Schema schema = new Schema(Collections.singletonList(
+field("a", false, new Int(8, true), metadata)
+));
+contains(schema, "\"" + METADATA_KEY + "\" : \"testKey\"", "\"" + 
METADATA_VALUE + "\" : \"testValue\"");
+
+String json = schema.toJson();
+Schema actual = Schema.fromJSON(json);
+
+Map actualMetadata = 
actual.getFields().get(0).getMetadata();
+assertEquals(1, actualMetadata.size());
+assertEquals("testValue", actualMetadata.get("testKey"));
+  }
+
+  private void contains(Schema schema, String... s) {

Review comment:
   Sounds good, done.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kiszk commented on a change in pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4

2021-01-05 Thread GitBox



kiszk commented on a change in pull request #8949:
URL: https://github.com/apache/arrow/pull/8949#discussion_r551997558



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/compression/Lz4CompressionCodec.java
##
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.compression;
+
+import static org.apache.arrow.memory.util.MemoryUtil.LITTLE_ENDIAN;
+
+import java.nio.ByteBuffer;
+
+import org.apache.arrow.flatbuf.CompressionType;
+import org.apache.arrow.memory.ArrowBuf;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.util.MemoryUtil;
+import org.apache.arrow.util.Preconditions;
+
+import net.jpountz.lz4.LZ4Compressor;

Review comment:
   My guess is that this import refers to 
[this](https://github.com/lz4/lz4-java/tree/master/src/java/net/jpountz/lz4).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #9024: ARROW-11044: [C++] Add "replace" kernel

2021-01-05 Thread GitBox



jorisvandenbossche commented on pull request #9024:
URL: https://github.com/apache/arrow/pull/9024#issuecomment-754703794


   This might actually rather be the "setitem" kernel as proposed in 
https://issues.apache.org/jira/browse/ARROW-9430 ? (which is something we want 
as well)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on pull request #9099: ARROW-11129: [Rust][DataFusion] Use tokio for loading parquet

2021-01-05 Thread GitBox



jorgecarleitao commented on pull request #9099:
URL: https://github.com/apache/arrow/pull/9099#issuecomment-754707400


   Note that this is hanging in the CI, which I think is unrelated with the CI 
itself.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] andygrove commented on a change in pull request #9086: [Rust] [DataFusion] [Experiment] Blocking threads filter

2021-01-05 Thread GitBox



andygrove commented on a change in pull request #9086:
URL: https://github.com/apache/arrow/pull/9086#discussion_r552008306



##
File path: rust/datafusion/src/physical_plan/filter.rs
##
@@ -103,25 +103,23 @@ impl ExecutionPlan for FilterExec {
 }
 
 async fn execute(&self, partition: usize) -> 
Result {
-Ok(Box::pin(FilterExecStream {
-schema: self.input.schema().clone(),
-predicate: self.predicate.clone(),
-input: self.input.execute(partition).await?,
-}))
+let predicate = self.predicate.clone();
+
+let stream = self.input.execute(partition).await?;
+let stream = stream.then(move |batch| {
+let predicate = predicate.clone();
+async move {
+// Filtering batches is CPU-bounded and therefore justifies a 
dedicated thread pool
+task::spawn_blocking(move || batch_filter(&batch?, &predicate))
+.await
+.unwrap()
+}
+});
+
+Ok(Box::pin(SizedRecordBatchStream::new(stream, self.schema(

Review comment:
   Thanks @jorgecarleitao this seems to make a lot of sense.
   
   @alamb Was your comment a response to @Dandandan or an objection to using 
`spawn_blocking`? I wasn't sure.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] andygrove commented on a change in pull request #9099: ARROW-11129: [Rust][DataFusion] Use tokio for loading parquet

2021-01-05 Thread GitBox



andygrove commented on a change in pull request #9099:
URL: https://github.com/apache/arrow/pull/9099#discussion_r552010480



##
File path: rust/datafusion/src/physical_plan/parquet.rs
##
@@ -256,16 +257,21 @@ impl ExecutionPlan for ParquetExec {
 let projection = self.projection.clone();
 let batch_size = self.batch_size;
 
-thread::spawn(move || {
-if let Err(e) = read_files(
+task::spawn_blocking(move || {
+read_files(
 &filenames,
 projection.clone(),
 batch_size,
 response_tx.clone(),
-) {
-println!("Parquet reader thread terminated due to error: 
{:?}", e);
-}
-});
+)
+})
+.await

Review comment:
   I don't think we want to `await` here for the task to complete?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] andygrove commented on a change in pull request #9099: ARROW-11129: [Rust][DataFusion] Use tokio for loading parquet

2021-01-05 Thread GitBox



andygrove commented on a change in pull request #9099:
URL: https://github.com/apache/arrow/pull/9099#discussion_r552012528



##
File path: rust/datafusion/src/physical_plan/parquet.rs
##
@@ -256,16 +257,21 @@ impl ExecutionPlan for ParquetExec {
 let projection = self.projection.clone();
 let batch_size = self.batch_size;
 
-thread::spawn(move || {
-if let Err(e) = read_files(
+task::spawn_blocking(move || {
+read_files(
 &filenames,
 projection.clone(),
 batch_size,
 response_tx.clone(),
-) {
-println!("Parquet reader thread terminated due to error: 
{:?}", e);
-}
-});
+)
+})
+.await

Review comment:
   I think this will lead to deadlock because the channel's buffer will 
fill up and the reader won't be able to push new batches. The caller can't read 
any batches from the channel until this task completes?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] HedgehogCode commented on pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4

2021-01-05 Thread GitBox



HedgehogCode commented on pull request #8949:
URL: https://github.com/apache/arrow/pull/8949#issuecomment-754727763


   When I use the changes and try to compress and decompress an empty buffer 
(by using a variable sized vector with only missing values) I get a SIGSEGV:
   ```
   #
   # A fatal error has been detected by the Java Runtime Environment:
   #
   #  SIGSEGV (0xb) at pc=0x7f1209448fd0, pid=10504, tid=0x7f1208bc7700
   #
   # JRE version: OpenJDK Runtime Environment (8.0_275-b01) (build 
1.8.0_275-b01)
   # Java VM: OpenJDK 64-Bit Server VM (25.275-b01 mixed mode linux-amd64 
compressed oops)
   # Problematic frame:
   # V  [libjvm.so+0x6fdfd0]  jni_ThrowNew+0xc0
   #
   # Core dump written. Default location: /workspaces/arrow/java/vector/core or 
core.10504
   ```
   
   This can be reproduced by adding the following test to 
`TestCompressionCodec.java`:
   ```java
 @Test
 public void testEmptyBuffer() throws Exception {
   final int vecLength = 10;
   final VarBinaryVector origVec = new VarBinaryVector("vec", allocator);
   
   origVec.allocateNew(vecLength);
   
   // Do not set any values (all missing)
   origVec.setValueCount(vecLength);
   
   final List origBuffers = origVec.getFieldBuffers();
   final List compressedBuffers = compressBuffers(origBuffers);
   final List decompressedBuffers = 
deCompressBuffers(compressedBuffers);
   
   // TODO assert that the decompressed buffers are correct
   AutoCloseables.close(decompressedBuffers);
 }
   ```
   This looks like an error in the lz4-java library but I am not sure. I 
thought I should mention it here first.
   (Note that I am using OpenJDK 8 and I haven't tried OpenJDK 11 yet)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] HedgehogCode edited a comment on pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4

2021-01-05 Thread GitBox



HedgehogCode edited a comment on pull request #8949:
URL: https://github.com/apache/arrow/pull/8949#issuecomment-754727763


   When I use the changes and try to compress and decompress an empty buffer 
(by using a variable sized vector with only missing values) I get a SIGSEGV 
([hs_err_pid10504.log](https://github.com/apache/arrow/files/5771248/hs_err_pid10504.log)):
   ```
   #
   # A fatal error has been detected by the Java Runtime Environment:
   #
   #  SIGSEGV (0xb) at pc=0x7f1209448fd0, pid=10504, tid=0x7f1208bc7700
   #
   # JRE version: OpenJDK Runtime Environment (8.0_275-b01) (build 
1.8.0_275-b01)
   # Java VM: OpenJDK 64-Bit Server VM (25.275-b01 mixed mode linux-amd64 
compressed oops)
   # Problematic frame:
   # V  [libjvm.so+0x6fdfd0]  jni_ThrowNew+0xc0
   #
   # Core dump written. Default location: /workspaces/arrow/java/vector/core or 
core.10504
   ```
   
   This can be reproduced by adding the following test to 
`TestCompressionCodec.java`:
   ```java
 @Test
 public void testEmptyBuffer() throws Exception {
   final int vecLength = 10;
   final VarBinaryVector origVec = new VarBinaryVector("vec", allocator);
   
   origVec.allocateNew(vecLength);
   
   // Do not set any values (all missing)
   origVec.setValueCount(vecLength);
   
   final List origBuffers = origVec.getFieldBuffers();
   final List compressedBuffers = compressBuffers(origBuffers);
   final List decompressedBuffers = 
deCompressBuffers(compressedBuffers);
   
   // TODO assert that the decompressed buffers are correct
   AutoCloseables.close(decompressedBuffers);
 }
   ```
   This looks like an error in the lz4-java library but I am not sure. I 
thought I should mention it here first.
   (Note that I am using OpenJDK 8 and I haven't tried OpenJDK 11 yet)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8894: ARROW-10322: [C++][Dataset] Minimize Expression

2021-01-05 Thread GitBox



jorisvandenbossche commented on a change in pull request #8894:
URL: https://github.com/apache/arrow/pull/8894#discussion_r552030630



##
File path: cpp/src/arrow/dataset/scanner.h
##
@@ -62,10 +63,7 @@ class ARROW_DS_EXPORT ScanOptions {
   std::shared_ptr ReplaceSchema(std::shared_ptr schema) 
const;
 
   // Filter
-  std::shared_ptr filter = scalar(true);
-
-  // Evaluator for Filter
-  std::shared_ptr evaluator;
+  Expression filter2 = literal(true);

Review comment:
   This still needs to be renamed back?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #8894: ARROW-10322: [C++][Dataset] Minimize Expression

2021-01-05 Thread GitBox



bkietz commented on a change in pull request #8894:
URL: https://github.com/apache/arrow/pull/8894#discussion_r552038470



##
File path: cpp/src/arrow/dataset/expression.cc
##
@@ -0,0 +1,1177 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/dataset/expression.h"
+
+#include 
+#include 
+
+#include "arrow/chunked_array.h"
+#include "arrow/compute/exec_internal.h"
+#include "arrow/dataset/expression_internal.h"
+#include "arrow/io/memory.h"
+#include "arrow/ipc/reader.h"
+#include "arrow/ipc/writer.h"
+#include "arrow/util/key_value_metadata.h"
+#include "arrow/util/logging.h"
+#include "arrow/util/optional.h"
+#include "arrow/util/string.h"
+#include "arrow/util/value_parsing.h"
+
+namespace arrow {
+
+using internal::checked_cast;
+using internal::checked_pointer_cast;
+
+namespace dataset {
+
+Expression::Expression(Call call) : 
impl_(std::make_shared(std::move(call))) {}
+
+Expression::Expression(Datum literal)
+: impl_(std::make_shared(std::move(literal))) {}
+
+Expression::Expression(Parameter parameter)
+: impl_(std::make_shared(std::move(parameter))) {}
+
+Expression literal(Datum lit) { return Expression(std::move(lit)); }
+
+Expression field_ref(FieldRef ref) {
+  return Expression(Expression::Parameter{std::move(ref), {}});
+}
+
+Expression call(std::string function, std::vector arguments,
+std::shared_ptr options) {
+  Expression::Call call;
+  call.function_name = std::move(function);
+  call.arguments = std::move(arguments);
+  call.options = std::move(options);
+  return Expression(std::move(call));
+}
+
+const Datum* Expression::literal() const { return 
util::get_if(impl_.get()); }
+
+const FieldRef* Expression::field_ref() const {
+  if (auto parameter = util::get_if(impl_.get())) {
+return ¶meter->ref;
+  }
+  return nullptr;
+}
+
+const Expression::Call* Expression::call() const {
+  return util::get_if(impl_.get());
+}
+
+ValueDescr Expression::descr() const {
+  if (impl_ == nullptr) return {};
+
+  if (auto lit = literal()) {
+return lit->descr();
+  }
+
+  if (auto parameter = util::get_if(impl_.get())) {
+return parameter->descr;
+  }
+
+  return CallNotNull(*this)->descr;
+}
+
+std::string Expression::ToString() const {
+  if (auto lit = literal()) {
+if (lit->is_scalar()) {
+  switch (lit->type()->id()) {
+case Type::STRING:
+case Type::LARGE_STRING:
+  return '"' +
+ 
Escape(util::string_view(*lit->scalar_as().value)) +
+ '"';
+
+case Type::BINARY:
+case Type::FIXED_SIZE_BINARY:
+case Type::LARGE_BINARY:
+  return '"' + lit->scalar_as().value->ToHexString() 
+ '"';
+
+default:
+  break;
+  }
+  return lit->scalar()->ToString();
+}
+return lit->ToString();
+  }
+
+  if (auto ref = field_ref()) {
+if (auto name = ref->name()) {
+  return *name;
+}
+if (auto path = ref->field_path()) {
+  return path->ToString();
+}
+return ref->ToString();
+  }
+
+  auto call = CallNotNull(*this);
+  auto binary = [&](std::string op) {
+return "(" + call->arguments[0].ToString() + " " + op + " " +
+   call->arguments[1].ToString() + ")";
+  };
+
+  if (auto cmp = Comparison::Get(call->function_name)) {
+return binary(Comparison::GetOp(*cmp));
+  }
+
+  constexpr util::string_view kleene = "_kleene";
+  if (util::string_view{call->function_name}.ends_with(kleene)) {
+auto op = call->function_name.substr(0, call->function_name.size() - 
kleene.size());
+return binary(std::move(op));
+  }
+
+  if (auto options = GetStructOptions(*call)) {
+std::string out = "{";
+auto argument = call->arguments.begin();
+for (const auto& field_name : options->field_names) {
+  out += field_name + "=" + argument++->ToString() + ", ";
+}
+out.resize(out.size() - 1);
+out.back() = '}';
+return out;
+  }
+
+  std::string out = call->function_name + "(";
+  for (const auto& arg : call->arguments) {
+out += arg.ToString() + ", ";
+  }
+
+  if (call->options == nullptr) {
+out.resize(out.size() - 1);
+out.back() = ')';
+return out;
+  }
+
+  if (auto options = GetSetLookup

[GitHub] [arrow] westonpace commented on a change in pull request #9100: ARROW-11067: [C++] read_csv_arrow silently fails to read some strings and returns nulls

2021-01-05 Thread GitBox



westonpace commented on a change in pull request #9100:
URL: https://github.com/apache/arrow/pull/9100#discussion_r552044312



##
File path: cpp/src/arrow/util/trie.h
##
@@ -125,6 +126,9 @@ class ARROW_EXPORT Trie {
   int32_t Find(util::string_view s) const {
 const Node* node = &nodes_[0];
 fast_index_type pos = 0;
+if (s.length() > static_cast(kMaxIndex)) {

Review comment:
   @pitrou One small difference from your original patch.  This changed 
from >= to >.  Could you review that?  My test of the boundary cases considered 
a string of length 2^n-1 as valid and so I either needed to remove that case or 
change the comparison (and I chose the latter).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #9100: ARROW-11067: [C++] read_csv_arrow silently fails to read some strings and returns nulls

2021-01-05 Thread GitBox



westonpace commented on a change in pull request #9100:
URL: https://github.com/apache/arrow/pull/9100#discussion_r552047298



##
File path: cpp/src/arrow/util/trie_test.cc
##
@@ -175,6 +175,15 @@ TEST(Trie, EmptyString) {
   ASSERT_EQ(-1, trie.Find("x"));
 }
 
+TEST(Trie, LongString) {
+  TrieBuilder builder;
+  ASSERT_OK(builder.Append(""));
+  const Trie trie = builder.Finish();
+  auto maxlen = static_cast(std::numeric_limits::max()) + 1;

Review comment:
   I did this and split the cases into good cases and "good & bad" cases.  
For the good cases I confirmed we can insert and match.  For the "good & bad" 
cases I confirmed we don't match the empty string.  I tested on Mac and the 
test failed (it matched the empty string).  I then added your patch and it 
passed (well, it failed because of the >/>= issue but passed after that).
   
   I'm testing 2^15-2, 2^15-1, 2^14 (good) and 2^15, 2^16-2 (bad).  Do you 
think that covers it?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #8915: ARROW-10904: [Python] Add support for Python 3.9 macOS wheels

2021-01-05 Thread GitBox



github-actions[bot] commented on pull request #8915:
URL: https://github.com/apache/arrow/pull/8915#issuecomment-754749536


   Revision: 6f13c4c268be10a8c0898d9e297b0303c35dc11f
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-823](https://github.com/ursa-labs/crossbow/branches/all?query=actions-823)
   
   |Task|Status|
   ||--|
   |wheel-osx-high-sierra-cp39|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-823-github-wheel-osx-high-sierra-cp39)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-823-github-wheel-osx-high-sierra-cp39)|
   |wheel-osx-mavericks-cp39|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-823-github-wheel-osx-mavericks-cp39)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-823-github-wheel-osx-mavericks-cp39)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bu2 commented on a change in pull request #9023: ARROW-11043: [C++] Add "is_nan" kernel

2021-01-05 Thread GitBox



bu2 commented on a change in pull request #9023:
URL: https://github.com/apache/arrow/pull/9023#discussion_r552049108



##
File path: cpp/src/arrow/compute/kernels/common.h
##
@@ -19,6 +19,7 @@
 
 // IWYU pragma: begin_exports
 
+#include 

Review comment:
   Good point. Moving the include in scalar_validity.cc.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #9105: ARROW-11009: [C++] Allow changing default memory pool with an environment variable

2021-01-05 Thread GitBox



pitrou commented on a change in pull request #9105:
URL: https://github.com/apache/arrow/pull/9105#discussion_r552053668



##
File path: cpp/src/arrow/dataset/filter.cc
##
@@ -704,28 +704,38 @@ using arrow::internal::JoinStrings;
 
 std::string AndExpression::ToString() const {
   return JoinStrings(
-  {"(", left_operand_->ToString(), " and ", right_operand_->ToString(), 
")"}, "");
+  std::vector{"(", left_operand_->ToString(), " and ",
+ right_operand_->ToString(), ")"},
+  "");

Review comment:
   @bkietz Do tell if this is too verbose for your taste (to be honest, I'm 
not sure why `JoinStrings` is used for simple concatenation here).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8894: ARROW-10322: [C++][Dataset] Minimize Expression

2021-01-05 Thread GitBox



jorisvandenbossche commented on a change in pull request #8894:
URL: https://github.com/apache/arrow/pull/8894#discussion_r552053599



##
File path: python/pyarrow/tests/parquet/test_dataset.py
##
@@ -509,7 +509,7 @@ def test_filters_invalid_column(tempdir, 
use_legacy_dataset):
 
 _generate_partition_directories(fs, base_path, partition_spec, df)
 
-msg = "Field named 'non_existent_column' not found"
+msg = r"No match for FieldRef.Name\(non_existent_column\)"

Review comment:
   Doesn't necessarily need to happen in this PR, but: I personally don't 
find this a very user-friendly error message with "FieldRef.Name" in it (I 
think the previous was more readable)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #8894: ARROW-10322: [C++][Dataset] Minimize Expression

2021-01-05 Thread GitBox



nealrichardson commented on a change in pull request #8894:
URL: https://github.com/apache/arrow/pull/8894#discussion_r552055128



##
File path: python/pyarrow/tests/parquet/test_dataset.py
##
@@ -509,7 +509,7 @@ def test_filters_invalid_column(tempdir, 
use_legacy_dataset):
 
 _generate_partition_directories(fs, base_path, partition_spec, df)
 
-msg = "Field named 'non_existent_column' not found"
+msg = r"No match for FieldRef.Name\(non_existent_column\)"

Review comment:
   +1





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bu2 commented on pull request #9023: ARROW-11043: [C++] Add "is_nan" kernel

2021-01-05 Thread GitBox



bu2 commented on pull request #9023:
URL: https://github.com/apache/arrow/pull/9023#issuecomment-754755701


   PR rebased!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou opened a new pull request #9105: ARROW-11009: [C++] Allow changing default memory pool with an environment variable

2021-01-05 Thread GitBox



pitrou opened a new pull request #9105:
URL: https://github.com/apache/arrow/pull/9105


   ARROW_DEFAULT_MEMORY_POOL can take the name of the desired memory pool 
backend
   ('jemalloc', 'mimalloc', 'system').



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] terencehonles commented on pull request #8386: ARROW-10224: [Python] Add support for Python 3.9 except macOS wheel and Windows wheel

2021-01-05 Thread GitBox



terencehonles commented on pull request #8386:
URL: https://github.com/apache/arrow/pull/8386#issuecomment-754763790


   > @terencehonles Thanks a lot. I installed pyarrorw sucessfully.
   > 
   > I was just curious that when I type `pip install pyarrow` the error log 
shows that it was stopped during install numpy, which is already installed in 
my environment (maybe the version is too advanced?).
   > 
   > I thought that figure it out may help me to understand deeper about python 
package installing process.
   
   I'm assuming you mean trying to install the most recently released pyarrow 
before you installed via crossbow? That's because pyarrow has a source 
distribution and you're trying to build from source. [Pip will not have 
installed numpy at install 
time](https://pip.pypa.io/en/stable/reference/pip/#pep-517-and-518-support) and 
is working to make isolated installs the default, if you have numpy already 
installed the version you have _may_ be an issue, but I think it's more likely 
it's too old not too new. You can try using the flag `--no-use-pep517` when 
running `pip install pyarrow` to see if your system has all the dependencies 
needed to build pyarrow (likely not, and that's why the CI is building it). 
Optionally [you can make pip more 
verbose](https://pip.pypa.io/en/stable/reference/pip/#cmdoption-v) with `-v`.
   
   pyarrow is probably not the easiest to learn about Python's build process 
on, but you should read [their 
documentation](https://arrow.apache.org/docs/developers/python.html#python-development)
 if you would like to actually be able to build it from source.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #9100: ARROW-11067: [C++] Fix CSV null detection on large values

2021-01-05 Thread GitBox



pitrou commented on a change in pull request #9100:
URL: https://github.com/apache/arrow/pull/9100#discussion_r552071145



##
File path: cpp/src/arrow/util/trie.h
##
@@ -125,6 +126,9 @@ class ARROW_EXPORT Trie {
   int32_t Find(util::string_view s) const {
 const Node* node = &nodes_[0];
 fast_index_type pos = 0;
+if (s.length() > static_cast(kMaxIndex)) {

Review comment:
   Nice, thank you!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #9100: ARROW-11067: [C++] Fix CSV null detection on large values

2021-01-05 Thread GitBox



pitrou commented on a change in pull request #9100:
URL: https://github.com/apache/arrow/pull/9100#discussion_r552071844



##
File path: cpp/src/arrow/util/trie_test.cc
##
@@ -175,6 +175,15 @@ TEST(Trie, EmptyString) {
   ASSERT_EQ(-1, trie.Find("x"));
 }
 
+TEST(Trie, LongString) {
+  TrieBuilder builder;
+  ASSERT_OK(builder.Append(""));
+  const Trie trie = builder.Finish();
+  auto maxlen = static_cast(std::numeric_limits::max()) + 1;

Review comment:
   Yes, thank you.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #9100: ARROW-11067: [C++] Fix CSV null detection on large values

2021-01-05 Thread GitBox



pitrou commented on pull request #9100:
URL: https://github.com/apache/arrow/pull/9100#issuecomment-754772279


   @westonpace is Github Actions enabled on your fork?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bu2 commented on pull request #9024: ARROW-11044: [C++] Add "replace" kernel

2021-01-05 Thread GitBox



bu2 commented on pull request #9024:
URL: https://github.com/apache/arrow/pull/9024#issuecomment-754778829


   @jorisvandenbossche: Thank you for bringing up 
[ARROW-9430](https://issues.apache.org/jira/browse/ARROW-9430).
   
   Then @all please tell me what would be a good name.
   
   I may have to rework the code a bit. As mentionned in the top description, I 
started based on fill_null kernel code and the "null handling" logic came as an 
afterthought, so the code might look a bit convoluted right now...



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #9024: ARROW-11044: [C++] Add "replace" kernel

2021-01-05 Thread GitBox



pitrou commented on pull request #9024:
URL: https://github.com/apache/arrow/pull/9024#issuecomment-754786007


   "setitem" is confusing to me (I would expect something where you pass 
indices and values to set at those indices). "select" is used in 
[Numpy](https://numpy.org/doc/stable/reference/generated/numpy.select.html). 
Another possible is "cond".



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #9100: ARROW-11067: [C++] Fix CSV null detection on large values

2021-01-05 Thread GitBox



github-actions[bot] commented on pull request #9100:
URL: https://github.com/apache/arrow/pull/9100#issuecomment-754789122


   https://issues.apache.org/jira/browse/ARROW-11067



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] Dandandan commented on pull request #9099: ARROW-11129: [Rust][DataFusion] Use tokio for loading parquet

2021-01-05 Thread GitBox



Dandandan commented on pull request #9099:
URL: https://github.com/apache/arrow/pull/9099#issuecomment-754789551


   Locally seems to work now thanks @andygrove . Seems CI jobs are in queue for 
quite some time



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on pull request #9100: ARROW-11067: [C++] Fix CSV null detection on large values

2021-01-05 Thread GitBox



westonpace commented on pull request #9100:
URL: https://github.com/apache/arrow/pull/9100#issuecomment-754793203


   @pitrou I'm pretty sure it is.  Is there something I need to check?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #9100: ARROW-11067: [C++] Fix CSV null detection on large values

2021-01-05 Thread GitBox



pitrou commented on pull request #9100:
URL: https://github.com/apache/arrow/pull/9100#issuecomment-754794793


   @westonpace MinGW Windows builds I think.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou edited a comment on pull request #9100: ARROW-11067: [C++] Fix CSV null detection on large values

2021-01-05 Thread GitBox



pitrou edited a comment on pull request #9100:
URL: https://github.com/apache/arrow/pull/9100#issuecomment-754794793


   @westonpace C++ MinGW Windows builds I think.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #8805: ARROW-10725: [Python][Compute] Expose sort options in Python bindings

2021-01-05 Thread GitBox



pitrou commented on pull request #8805:
URL: https://github.com/apache/arrow/pull/8805#issuecomment-754816122


   @jorisvandenbossche Is this ready for review again?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #9105: ARROW-11009: [C++] Allow changing default memory pool with an environment variable

2021-01-05 Thread GitBox



github-actions[bot] commented on pull request #9105:
URL: https://github.com/apache/arrow/pull/9105#issuecomment-754816274


   https://issues.apache.org/jira/browse/ARROW-11009



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #9024: ARROW-11044: [C++] Add "replace" kernel

2021-01-05 Thread GitBox



jorisvandenbossche commented on pull request #9024:
URL: https://github.com/apache/arrow/pull/9024#issuecomment-754822330


   > "setitem" is confusing to me (I would expect something where you pass 
indices and values to set at those indices).
   
   That's https://issues.apache.org/jira/browse/ARROW-9431. 
   There could be a variant of "setitem" that takes integer indices, and one 
that takes a boolean mask?
   
   But it's true that for a "setitem" kernel, you would expect the "values to 
set" (`Replacement` in the example in the top post) to be of the length of the 
number of values that get set, not of the same length as the original array.
   
   So what this PR proposes, is actually more a "where" kernel (using numpy's 
terminology), for which we have 
https://issues.apache.org/jira/browse/ARROW-10640. 
   There is some discussion on that PR that this can be generalized to 
something like numpy's `np.choose`. And the `np.select`  mentioned by @pitrou 
is quite related to that as well (the exact semantics are a bit different 
between both: `select` uses boolean conditions to choose between the arrays, 
`choose` a single array of indices into the choice arrays)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou closed pull request #9023: ARROW-11043: [C++] Add "is_nan" kernel

2021-01-05 Thread GitBox



pitrou closed pull request #9023:
URL: https://github.com/apache/arrow/pull/9023


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #8805: ARROW-10725: [Python][Compute] Expose sort options in Python bindings

2021-01-05 Thread GitBox



jorisvandenbossche commented on pull request #8805:
URL: https://github.com/apache/arrow/pull/8805#issuecomment-754839607


   I didn't yet update it regarding the discussion on the keyword arguments API 
above at https://github.com/apache/arrow/pull/8805#discussion_r542440135
   
   We probably want to support the current "verbose" way of writing the options 
anyway? In which case it might be fine to defer a more ergonomic API to a 
follow-up JIRA (since this needs to get in this week)
   
   Alternatively, I could add an `order` keyword on the python side (as 
mentioned in https://github.com/apache/arrow/pull/8805#discussion_r542510825), 
and then there translate something like `sort_keys=["col1", "col2"], 
order="descending"` to `sort_keys=[("col1", "descending"), ("col2", 
"descending")]` passed to the C++ options. In that case, you only need to full 
verbose API when wanting to sort with multiple columns with a different 
ascending/descending, and thus might already help for many common cases.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nevi-me closed pull request #9093: ARROW-11125: [Rust] Logical equality for list arrays

2021-01-05 Thread GitBox



nevi-me closed pull request #9093:
URL: https://github.com/apache/arrow/pull/9093


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #9024: ARROW-11044: [C++] Add "replace" kernel

2021-01-05 Thread GitBox



pitrou commented on pull request #9024:
URL: https://github.com/apache/arrow/pull/9024#issuecomment-754836583


   "if_else", "choose", "select" are all fine with me. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #9023: ARROW-11043: [C++] Add "is_nan" kernel

2021-01-05 Thread GitBox



jorisvandenbossche commented on pull request #9023:
URL: https://github.com/apache/arrow/pull/9023#issuecomment-754846677


   Thanks a lot @bu2 !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #9097: ARROW-10881: [C++] Fix EXC_BAD_ACCESS in PutSpaced

2021-01-05 Thread GitBox



pitrou commented on pull request #9097:
URL: https://github.com/apache/arrow/pull/9097#issuecomment-754828459


   CI tests are green on my fork (except for the usual Homebrew dependency 
install failures).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 >

1 - 100 of 179 matches

Mail list logo