[GitHub] [arrow] emkornfield commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet

2020-06-25 Thread GitBox


emkornfield commented on pull request #7143:
URL: https://github.com/apache/arrow/pull/7143#issuecomment-649303617


   Travis CI seems flaky.  My branch: 
https://travis-ci.org/github/emkornfield/arrow/builds/701926099



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rymurr commented on a change in pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory

2020-06-25 Thread GitBox


rymurr commented on a change in pull request #7347:
URL: https://github.com/apache/arrow/pull/7347#discussion_r445373657



##
File path: 
java/memory/src/main/java/org/apache/arrow/memory/util/MemoryUtil.java
##
@@ -78,6 +77,55 @@ public Object run() {
   Field addressField = java.nio.Buffer.class.getDeclaredField("address");
   addressField.setAccessible(true);
   BYTE_BUFFER_ADDRESS_OFFSET = UNSAFE.objectFieldOffset(addressField);
+
+  Constructor directBufferConstructor;
+  long address = -1;
+  final ByteBuffer direct = ByteBuffer.allocateDirect(1);
+  try {
+
+final Object maybeDirectBufferConstructor =
+AccessController.doPrivileged(new PrivilegedAction() {
+  @Override
+  public Object run() {
+try {
+  final Constructor constructor =
+  direct.getClass().getDeclaredConstructor(long.class, 
int.class);
+  constructor.setAccessible(true);
+  return constructor;
+} catch (NoSuchMethodException e) {
+  return e;
+} catch (SecurityException e) {
+  return e;
+}
+  }
+});
+
+if (maybeDirectBufferConstructor instanceof Constructor) {
+  address = UNSAFE.allocateMemory(1);
+  // try to use the constructor now
+  try {
+((Constructor) 
maybeDirectBufferConstructor).newInstance(address, 1);
+directBufferConstructor = (Constructor) 
maybeDirectBufferConstructor;
+logger.debug("direct buffer constructor: available");
+  } catch (InstantiationException e) {
+directBufferConstructor = null;

Review comment:
   done

##
File path: 
java/memory/src/main/java/org/apache/arrow/memory/rounding/DefaultRoundingPolicy.java
##
@@ -31,19 +28,18 @@
 
   public final long chunkSize;
 
+  /**
+   * default chunk size from Netty.
+   */
+  private static final long DEFAULT_CHUNK_SIZE = 16777216L;

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7395: ARROW-9089: [Python] A PyFileSystem handler for fsspec-based filesystems

2020-06-25 Thread GitBox


jorisvandenbossche commented on pull request #7395:
URL: https://github.com/apache/arrow/pull/7395#issuecomment-649350542


   I am going to merge this, as I need it for some follow-ups. But if there are 
more comments, happy to address them in a follow-up!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rymurr commented on a change in pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory

2020-06-25 Thread GitBox


rymurr commented on a change in pull request #7347:
URL: https://github.com/apache/arrow/pull/7347#discussion_r445373746



##
File path: 
java/memory/src/main/java/org/apache/arrow/memory/rounding/DefaultRoundingPolicy.java
##
@@ -31,19 +28,18 @@
 
   public final long chunkSize;
 
+  /**
+   * default chunk size from Netty.
+   */
+  private static final long DEFAULT_CHUNK_SIZE = 16777216L;

Review comment:
   I ended up taking the implementation from Netty rather than the static 
value here. I thought it was a bit safer.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rymurr commented on pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory

2020-06-25 Thread GitBox


rymurr commented on pull request #7347:
URL: https://github.com/apache/arrow/pull/7347#issuecomment-649372476


   Thanks for the reviews @liyafan82 I have updated based on your comments. 
Please let me know what you think!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] idmitrievsky opened a new issue #7540: Why conversion from DoubleArray with nulls to numpy needs to copy?

2020-06-25 Thread GitBox


idmitrievsky opened a new issue #7540:
URL: https://github.com/apache/arrow/issues/7540


   Hello!
   
   There is this requirement that for `to_numpy()` conversion to be zero-copy 
arrow array mustn't contain any nulls. If I understand correctly Arrow doesn't 
specify values in slots that are marked as `null` in the corresponding bitmap.
   
   Is it possible to write numpy-specific `nan` values in places that are 
`null` in arrow buffer so that upon conversion there is no copying required?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

2020-06-25 Thread GitBox


jorisvandenbossche commented on a change in pull request #7534:
URL: https://github.com/apache/arrow/pull/7534#discussion_r445392036



##
File path: cpp/src/parquet/arrow/reader.cc
##
@@ -338,22 +348,37 @@ class RowGroupRecordBatchReader : public 
::arrow::RecordBatchReader {
 // TODO (hatemhelal): Consider refactoring this to share logic with 
ReadTable as this
 // does not currently honor the use_threads option.
 std::vector> columns(field_readers_.size());
-for (size_t i = 0; i < field_readers_.size(); ++i) {
-  RETURN_NOT_OK(field_readers_[i]->NextBatch(batch_size_, &columns[i]));
-  if (columns[i]->num_chunks() > 1) {
-return Status::NotImplemented("This class cannot yet iterate chunked 
arrays");
+int64_t num_rows = -1;
+
+if (columns.empty()) {
+  num_rows = std::min(batch_size_, *row_group_remaining_size_);

Review comment:
   might be good to add a comment here why this is being done

##
File path: python/pyarrow/tests/test_dataset.py
##
@@ -1605,3 +1605,21 @@ def test_dataset_schema_metadata(tempdir):
 assert b"pandas" in schema.metadata
 # ensure it is still there in a projected schema (with column selection)
 assert schema.equals(projected_schema, check_metadata=True)
+
+
+@pytest.mark.parquet
+@pytest.mark.pandas
+def test_dataset_project_only_partition_columns(tempdir):
+# ARROW-8729
+import pyarrow.parquet as pq
+
+table = pa.table({'part': 'a a b b'.split(), 'col': list(range(4))})
+
+path = str(tempdir / 'test_dataset')
+pq.write_to_dataset(table, path, partition_cols=['part'])
+dataset = ds.dataset(path, partitioning='hive')
+
+all_cols = dataset.to_table(use_threads=False)
+part_only = dataset.to_table(columns=['part'], use_threads=False)
+
+assert all_cols.column('part') == part_only.column('part')

Review comment:
   ```suggestion
   assert all_cols.column('part').equals(part_only.column('part'))
   ```
   
   `==` is now element-wise, and the "truthyness" of a boolean chunked array is 
always True, regardless of the values in it .. (so as long both tables have a 
"part" column, the above check would always return True even if the columns 
itself are not equal)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] cyb70289 opened a new pull request #7541: ARROW-9224: [Dev][Archery] Copy local repo on clone failure

2020-06-25 Thread GitBox


cyb70289 opened a new pull request #7541:
URL: https://github.com/apache/arrow/pull/7541


   `archery benchmark diff` runs "git clone --local" to clone source code
   to /tmp. If /tmp and source directory are not in same disk volume, it
   fails with below error messages because "git clone --local" uses hard
   links which only works on same volume. This patch retries copying with
   "--no-hardlinks" option on failure.
   
   Error log:
   fatal: failed to create link 
'/tmp/arrow-archery-lwm7k6wm/origin_master/arrow/.git/objects/2d/80a4d2da431dbfe3c736b6349eb4a0bcb6ece5':
 Invalid cross-device link
   subprocess.CalledProcessError: Command '['git', 'clone', '--local', 
PosixPath('/home/cyb/arrow'), 
'/tmp/arrow-archery-lwm7k6wm/origin_master/arrow']' returned non-zero exit 
status 128.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7541: ARROW-9224: [Dev][Archery] Copy local repo on clone failure

2020-06-25 Thread GitBox


github-actions[bot] commented on pull request #7541:
URL: https://github.com/apache/arrow/pull/7541#issuecomment-649403800


   https://issues.apache.org/jira/browse/ARROW-9224



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche closed pull request #7395: ARROW-9089: [Python] A PyFileSystem handler for fsspec-based filesystems

2020-06-25 Thread GitBox


jorisvandenbossche closed pull request #7395:
URL: https://github.com/apache/arrow/pull/7395


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-25 Thread GitBox


pitrou commented on a change in pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#discussion_r445417901



##
File path: cpp/src/arrow/compute/kernels/scalar_string_benchmark.cc
##
@@ -52,8 +55,18 @@ static void AsciiUpper(benchmark::State& state) {
   UnaryStringBenchmark(state, "ascii_upper");
 }
 
+static void Utf8Upper(benchmark::State& state) {
+  UnaryStringBenchmark(state, "utf8_upper", true);

Review comment:
   Note that `RandomArrayGenerator::String` generates an array of pure 
Ascii characters between `A` and `z`. 
   Perhaps we need two sets of benchmarks:
   * one with pure Ascii values
   * one with partly non-Ascii values

##
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##
@@ -30,6 +31,124 @@ namespace internal {
 
 namespace {
 
+// lookup tables
+std::vector lut_upper_codepoint;
+std::vector lut_lower_codepoint;
+std::once_flag flag_case_luts;
+
+constexpr uint32_t REPLACEMENT_CHAR =
+'?';  // the proper replacement char would be the 0xFFFD codepoint, but 
that can
+  // increase string length by a factor of 3
+constexpr int MAX_CODEPOINT_LUT = 0x;  // up to this codepoint is in a 
lookup table
+
+static inline void utf8_encode(uint8_t*& str, uint32_t codepoint) {
+  if (codepoint < 0x80) {
+*str++ = codepoint;
+  } else if (codepoint < 0x800) {
+*str++ = 0xC0 + (codepoint >> 6);
+*str++ = 0x80 + (codepoint & 0x3F);
+  } else if (codepoint < 0x1) {
+*str++ = 0xE0 + (codepoint >> 12);
+*str++ = 0x80 + ((codepoint >> 6) & 0x3F);
+*str++ = 0x80 + (codepoint & 0x3F);
+  } else if (codepoint < 0x20) {
+*str++ = 0xF0 + (codepoint >> 18);
+*str++ = 0x80 + ((codepoint >> 12) & 0x3F);
+*str++ = 0x80 + ((codepoint >> 6) & 0x3F);
+*str++ = 0x80 + (codepoint & 0x3F);
+  } else {
+*str++ = codepoint;
+  }
+}
+
+static inline bool utf8_is_continuation(const uint8_t codeunit) {
+  return (codeunit & 0xC0) == 0x80;  // upper two bits should be 10
+}
+
+static inline uint32_t utf8_decode(const uint8_t*& str, int64_t& length) {
+  if (*str < 0x80) {  //
+length -= 1;
+return *str++;
+  } else if (*str < 0xC0) {  // invalid non-ascii char
+length -= 1;
+str++;
+return REPLACEMENT_CHAR;

Review comment:
   Hmm, I don't really agree with this... If there's some invalid input, we 
should bail out with `Status::Invalid`, IMO.

##
File path: cpp/src/arrow/compute/kernels/scalar_string_benchmark.cc
##
@@ -41,7 +42,9 @@ static void UnaryStringBenchmark(benchmark::State& state, 
const std::string& fun
 ABORT_NOT_OK(CallFunction(func_name, {values}));
   }
   state.SetItemsProcessed(state.iterations() * array_length);
-  state.SetBytesProcessed(state.iterations() * 
values->data()->buffers[2]->size());
+  state.SetBytesProcessed(state.iterations() *
+  ((touches_offsets ? 
values->data()->buffers[1]->size() : 0) +

Review comment:
   Hmm, this looks a bit pedantic and counter-productive to me. We want 
Ascii and UTF8 numbers to be directly comparable, so let's keep the original 
calculation.

##
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##
@@ -30,6 +31,124 @@ namespace internal {
 
 namespace {
 
+// lookup tables
+std::vector lut_upper_codepoint;
+std::vector lut_lower_codepoint;
+std::once_flag flag_case_luts;
+
+constexpr uint32_t REPLACEMENT_CHAR =
+'?';  // the proper replacement char would be the 0xFFFD codepoint, but 
that can
+  // increase string length by a factor of 3
+constexpr int MAX_CODEPOINT_LUT = 0x;  // up to this codepoint is in a 
lookup table
+
+static inline void utf8_encode(uint8_t*& str, uint32_t codepoint) {

Review comment:
   By the way, all those helper functions (encoding/decoding) may also go 
into `arrow/util/utf8.h`.

##
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##
@@ -68,6 +76,64 @@ TYPED_TEST(TestStringKernels, AsciiLower) {
this->string_type(), "[\"aaazzæÆ&\", null, \"\", \"bbb\"]");
 }
 
+TEST(TestStringKernels, Utf8Upper32bitGrowth) {

Review comment:
   I'm not convinced it's a good idea to consume more than 2GB RAM in a 
test. We have something called `LARGE_MEMORY_TEST` for such tests (you can grep 
for it), though I'm not sure they get exercised on a regular basis.

##
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##
@@ -81,5 +147,40 @@ TYPED_TEST(TestStringKernels, 
StrptimeDoesNotProvideDefaultOptions) {
   ASSERT_RAISES(Invalid, CallFunction("strptime", {input}));
 }
 
+TEST(TestStringKernels, UnicodeLibraryAssumptions) {
+  uint8_t output[4];
+  for (utf8proc_int32_t codepoint = 0x100; codepoint < 0x11; codepoint++) {
+utf8proc_ssize_t encoded_nbytes = utf8proc_encode_char(codepoint, output);
+utf8proc_int32_t codepoint_upper = utf8proc_toupper(codepoint);
+utf8proc_ssize_t encoded_nbytes_upper = 
utf8proc_enc

[GitHub] [arrow] cyb70289 opened a new pull request #7542: ARROW-9225: [C++][Compute] Improve counting sort

2020-06-25 Thread GitBox


cyb70289 opened a new pull request #7542:
URL: https://github.com/apache/arrow/pull/7542


   For counting sort, using BitmapReader for null checking performs much
   better than BitBlockCount.
   This patch also drops counting nulls as it's not necessary.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] cyb70289 commented on pull request #7542: ARROW-9225: [C++][Compute] Improve counting sort

2020-06-25 Thread GitBox


cyb70289 commented on pull request #7542:
URL: https://github.com/apache/arrow/pull/7542#issuecomment-649426603


   **gcc-7.5, Intel Gold 5218**
   
   ```
  benchmark baseline
contender  change %
   9SortToIndicesInt64Count/32768/10/min_time:1.000  391.232 MiB/sec  
945.147 MiB/sec   141.582 {'null_percent': 10.0}
   3 SortToIndicesInt64Count/32768/2/min_time:1.000  336.469 MiB/sec  
617.053 MiB/sec83.391 {'null_percent': 50.0}
   13  SortToIndicesInt64Count/32768/100/min_time:1.000  539.575 MiB/sec  
945.134 MiB/sec75.163,{'null_percent': 1.0}
   14SortToIndicesInt64Count/32768/0/min_time:1.000  912.462 MiB/sec
1.061 GiB/sec19.020,{'null_percent': 0.0}
   12SortToIndicesInt64Count/32768/1/min_time:1.000  874.597 MiB/sec  
948.007 MiB/sec 8.394 {'null_percent': 0.01}
   0  SortToIndicesInt64Compare/32768/10/min_time:1.000  124.299 MiB/sec  
133.614 MiB/sec 7.494 {'null_percent': 10.0}
   4 SortToIndicesInt64Count/32768/1/min_time:1.0004.212 GiB/sec
4.441 GiB/sec 5.455 {'null_percent': 100.0}
   1 SortToIndicesInt64Compare/1048576/1/min_time:1.0004.168 GiB/sec
4.380 GiB/sec 5.103 {'null_percent': 100.0}
   6   SortToIndicesInt64Compare/32768/1/min_time:1.0004.263 GiB/sec
4.428 GiB/sec 3.885 {'null_percent': 100.0}
   8 SortToIndicesInt64Compare/32768/100/min_time:1.000  121.535 MiB/sec  
126.154 MiB/sec 3.800 {'null_percent': 1.0}
   11SortToIndicesInt64Compare/8388608/1/min_time:1.0003.833 GiB/sec
3.939 GiB/sec 2.746 {'null_percent': 100.0}
   15  SortToIndicesInt64Count/1048576/1/min_time:1.0004.189 GiB/sec
4.280 GiB/sec 2.175 {'null_percent': 100.0}
   10  SortToIndicesInt64Compare/32768/0/min_time:1.000  127.065 MiB/sec  
129.514 MiB/sec 1.927 {'null_percent': 0.0}
   2   SortToIndicesInt64Compare/32768/2/min_time:1.000  211.863 MiB/sec  
212.384 MiB/sec 0.246 {'null_percent': 50.0}
   7   SortToIndicesInt64Count/8388608/1/min_time:1.0003.881 GiB/sec
3.877 GiB/sec-0.117 {'null_percent': 100.0}
   5   SortToIndicesInt64Compare/32768/1/min_time:1.000  124.847 MiB/sec  
123.486 MiB/sec-1.090 {'null_percent': 0.01}
   ```
   
   **clang-9, Intel Gold 5218**
   ```
  benchmark baseline
 contender  change %
   4SortToIndicesInt64Count/32768/10/min_time:1.000  419.179 MiB/sec  
1019.007 MiB/sec   143.096 {'null_percent': 10.0}
   8 SortToIndicesInt64Count/32768/2/min_time:1.000  355.481 MiB/sec   
658.370 MiB/sec85.206 {'null_percent': 50.0}
   5   SortToIndicesInt64Count/32768/100/min_time:1.000  574.780 MiB/sec
 1.018 GiB/sec81.451 {'null_percent': 1.0}
   2 SortToIndicesInt64Count/32768/0/min_time:1.000  980.677 MiB/sec
 1.137 GiB/sec18.725 {'null_percent': 0.0}
   15SortToIndicesInt64Count/32768/1/min_time:1.000  939.165 MiB/sec
 1.023 GiB/sec11.535 {'null_percent': 0.01}
   13 SortToIndicesInt64Compare/32768/10/min_time:1.000  137.920 MiB/sec   
152.108 MiB/sec10.288 {'null_percent': 10.0}
   10  SortToIndicesInt64Compare/32768/2/min_time:1.000  221.024 MiB/sec   
243.047 MiB/sec 9.964 {'null_percent': 50.0}
   3 SortToIndicesInt64Compare/32768/100/min_time:1.000  132.173 MiB/sec   
141.416 MiB/sec 6.993 {'null_percent': 1.0}
   12  SortToIndicesInt64Compare/32768/0/min_time:1.000  141.528 MiB/sec   
148.206 MiB/sec 4.719 {'null_percent': 0.0}
   7   SortToIndicesInt64Compare/32768/1/min_time:1.000  135.794 MiB/sec   
140.914 MiB/sec 3.770 {'null_percent': 0.01}
   9   SortToIndicesInt64Count/8388608/1/min_time:1.0003.683 GiB/sec
 3.698 GiB/sec 0.421 {'null_percent': 100.0}
   1 SortToIndicesInt64Compare/8388608/1/min_time:1.0003.694 GiB/sec
 3.651 GiB/sec-1.159 {'null_percent': 100.0}
   0 SortToIndicesInt64Compare/1048576/1/min_time:1.0003.999 GiB/sec
 3.951 GiB/sec-1.198 {'null_percent': 100.0}
   6   SortToIndicesInt64Count/1048576/1/min_time:1.0003.999 GiB/sec
 3.949 GiB/sec-1.236 {'null_percent': 100.0}
   11SortToIndicesInt64Count/32768/1/min_time:1.0004.096 GiB/sec
 4.042 GiB/sec-1.324 {'null_percent': 100.0}
   14  SortToIndicesInt64Compare/32768/1/min_time:1.0004.129 GiB/sec
 4.018 GiB/sec-2.686 {'null_percent': 100.0}
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7537: ARROW-842: [Python] Recognize pandas.NaT as null when converting object arrays with from_pandas=True

2020-06-25 Thread GitBox


pitrou commented on a change in pull request #7537:
URL: https://github.com/apache/arrow/pull/7537#discussion_r445435253



##
File path: cpp/src/arrow/python/python_to_arrow.cc
##
@@ -1171,6 +1171,12 @@ Status GetConverterFlat(const std::shared_ptr& 
type, bool strict_conve
 
 Status GetConverter(const std::shared_ptr& type, bool from_pandas,
 bool strict_conversions, std::unique_ptr* 
out) {
+  if (from_pandas) {
+// ARROW-842: If pandas is not installed then null checks will be less
+// comprehensive, but that is okay.
+internal::InitPandasStaticData();

Review comment:
   Hmm... I think we're doing more general PyArrow initialization somewhere 
else. We should certainly not trust that only this code path will need the 
Pandas data.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ursabot commented on pull request #7542: ARROW-9225: [C++][Compute] Improve counting sort

2020-06-25 Thread GitBox


ursabot commented on pull request #7542:
URL: https://github.com/apache/arrow/pull/7542#issuecomment-649428644


   ```
   no such option: --cc
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] cyb70289 commented on pull request #7542: ARROW-9225: [C++][Compute] Improve counting sort

2020-06-25 Thread GitBox


cyb70289 commented on pull request #7542:
URL: https://github.com/apache/arrow/pull/7542#issuecomment-649428630


   @ursabot benchmark --suite-filter=arrow-compute-vector-sort-benchmark 
--cc=clang-8 --cxx=clang++-8



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7537: ARROW-842: [Python] Recognize pandas.NaT as null when converting object arrays with from_pandas=True

2020-06-25 Thread GitBox


pitrou commented on a change in pull request #7537:
URL: https://github.com/apache/arrow/pull/7537#discussion_r445435638



##
File path: cpp/src/arrow/python/helpers.cc
##
@@ -254,14 +255,45 @@ bool PyFloat_IsNaN(PyObject* obj) {
   return PyFloat_Check(obj) && std::isnan(PyFloat_AsDouble(obj));
 }
 
+namespace {
+
+static std::once_flag pandas_static_initialized;
+static PyTypeObject* pandas_NaTType = nullptr;
+
+void GetPandasStaticSymbols() {
+  OwnedRef pandas;
+  Status s = ImportModule("pandas", &pandas);
+  if (!s.ok()) {
+return;
+  }
+
+  OwnedRef nat_value;
+  s = ImportFromModule(pandas.obj(), "NaT", &nat_value);
+  if (!s.ok()) {
+return;
+  }
+  PyObject* nat_type = PyObject_Type(nat_value.obj());
+  pandas_NaTType = reinterpret_cast(nat_type);
+
+  // PyObject_Type returns a new reference but we trust that pandas.NaT will
+  // outlive our use of this PyObject*
+  Py_DECREF(nat_type);
+}
+
+}  // namespace
+
+void InitPandasStaticData() {
+  std::call_once(pandas_static_initialized, GetPandasStaticSymbols);
+}
+
 bool PandasObjectIsNull(PyObject* obj) {
   if (!MayHaveNaN(obj)) {
 return false;
   }
   if (obj == Py_None) {
 return true;
   }
-  if (PyFloat_IsNaN(obj) ||
+  if (PyFloat_IsNaN(obj) || (pandas_NaTType && PyObject_TypeCheck(obj, 
pandas_NaTType)) ||

Review comment:
   Is `pd.NaT` a singleton? If so, this could be a fast pointer equality 
check.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7263: ARROW-8927: [C++] Support dictionary memo in CUDA IPC ReadRecordBatch functions

2020-06-25 Thread GitBox


pitrou commented on pull request #7263:
URL: https://github.com/apache/arrow/pull/7263#issuecomment-649429529


   Ok, I'm gonna merge this then. Thank you!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou closed pull request #7263: ARROW-8927: [C++] Support dictionary memo in CUDA IPC ReadRecordBatch functions

2020-06-25 Thread GitBox


pitrou closed pull request #7263:
URL: https://github.com/apache/arrow/pull/7263


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] cyb70289 commented on pull request #7542: ARROW-9225: [C++][Compute] Improve counting sort

2020-06-25 Thread GitBox


cyb70289 commented on pull request #7542:
URL: https://github.com/apache/arrow/pull/7542#issuecomment-649429796


   @ursabot benchmark --suite-filter=arrow-compute-vector-sort-benchmark



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7537: ARROW-842: [Python] Recognize pandas.NaT as null when converting object arrays with from_pandas=True

2020-06-25 Thread GitBox


jorisvandenbossche commented on a change in pull request #7537:
URL: https://github.com/apache/arrow/pull/7537#discussion_r445438326



##
File path: cpp/src/arrow/python/helpers.cc
##
@@ -254,14 +255,45 @@ bool PyFloat_IsNaN(PyObject* obj) {
   return PyFloat_Check(obj) && std::isnan(PyFloat_AsDouble(obj));
 }
 
+namespace {
+
+static std::once_flag pandas_static_initialized;
+static PyTypeObject* pandas_NaTType = nullptr;
+
+void GetPandasStaticSymbols() {
+  OwnedRef pandas;
+  Status s = ImportModule("pandas", &pandas);
+  if (!s.ok()) {
+return;
+  }
+
+  OwnedRef nat_value;
+  s = ImportFromModule(pandas.obj(), "NaT", &nat_value);
+  if (!s.ok()) {
+return;
+  }
+  PyObject* nat_type = PyObject_Type(nat_value.obj());
+  pandas_NaTType = reinterpret_cast(nat_type);
+
+  // PyObject_Type returns a new reference but we trust that pandas.NaT will
+  // outlive our use of this PyObject*
+  Py_DECREF(nat_type);
+}
+
+}  // namespace
+
+void InitPandasStaticData() {
+  std::call_once(pandas_static_initialized, GetPandasStaticSymbols);
+}
+
 bool PandasObjectIsNull(PyObject* obj) {
   if (!MayHaveNaN(obj)) {
 return false;
   }
   if (obj == Py_None) {
 return true;
   }
-  if (PyFloat_IsNaN(obj) ||
+  if (PyFloat_IsNaN(obj) || (pandas_NaTType && PyObject_TypeCheck(obj, 
pandas_NaTType)) ||

Review comment:
   In practice, I think it is a singleton (there is a single value 
instantiated in the pandas code base that is used throughout), but just checked 
and it is not guaranteed in the `__new__`. So you can actually create another 
pd.NaT object with `type(pd.NaT)()` which is not identical to `pd.NaT` .. 
(probably something to change in pandas)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7541: ARROW-9224: [Dev][Archery] Copy local repo on clone failure

2020-06-25 Thread GitBox


pitrou commented on pull request #7541:
URL: https://github.com/apache/arrow/pull/7541#issuecomment-649432233


   Can you try using `git clone --shared` instead? It should avoid the copy.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7542: ARROW-9225: [C++][Compute] Improve counting sort

2020-06-25 Thread GitBox


github-actions[bot] commented on pull request #7542:
URL: https://github.com/apache/arrow/pull/7542#issuecomment-649432432


   https://issues.apache.org/jira/browse/ARROW-9225



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7537: ARROW-842: [Python] Recognize pandas.NaT as null when converting object arrays with from_pandas=True

2020-06-25 Thread GitBox


jorisvandenbossche commented on a change in pull request #7537:
URL: https://github.com/apache/arrow/pull/7537#discussion_r445438326



##
File path: cpp/src/arrow/python/helpers.cc
##
@@ -254,14 +255,45 @@ bool PyFloat_IsNaN(PyObject* obj) {
   return PyFloat_Check(obj) && std::isnan(PyFloat_AsDouble(obj));
 }
 
+namespace {
+
+static std::once_flag pandas_static_initialized;
+static PyTypeObject* pandas_NaTType = nullptr;
+
+void GetPandasStaticSymbols() {
+  OwnedRef pandas;
+  Status s = ImportModule("pandas", &pandas);
+  if (!s.ok()) {
+return;
+  }
+
+  OwnedRef nat_value;
+  s = ImportFromModule(pandas.obj(), "NaT", &nat_value);
+  if (!s.ok()) {
+return;
+  }
+  PyObject* nat_type = PyObject_Type(nat_value.obj());
+  pandas_NaTType = reinterpret_cast(nat_type);
+
+  // PyObject_Type returns a new reference but we trust that pandas.NaT will
+  // outlive our use of this PyObject*
+  Py_DECREF(nat_type);
+}
+
+}  // namespace
+
+void InitPandasStaticData() {
+  std::call_once(pandas_static_initialized, GetPandasStaticSymbols);
+}
+
 bool PandasObjectIsNull(PyObject* obj) {
   if (!MayHaveNaN(obj)) {
 return false;
   }
   if (obj == Py_None) {
 return true;
   }
-  if (PyFloat_IsNaN(obj) ||
+  if (PyFloat_IsNaN(obj) || (pandas_NaTType && PyObject_TypeCheck(obj, 
pandas_NaTType)) ||

Review comment:
   In practice, I think it will behave like a singleton (there is a single 
value instantiated in the pandas code base that is used throughout), but just 
checked and it is not guaranteed in the `__new__`. So you can actually create 
another pd.NaT object with `type(pd.NaT)()` which is not identical to `pd.NaT` 
.. (probably something to change in pandas)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7542: ARROW-9225: [C++][Compute] Speed up counting sort

2020-06-25 Thread GitBox


pitrou commented on a change in pull request #7542:
URL: https://github.com/apache/arrow/pull/7542#discussion_r445441117



##
File path: cpp/src/arrow/compute/kernels/vector_sort.cc
##
@@ -88,6 +88,28 @@ struct PartitionIndices {
   }
 };
 
+template 
+inline void VisitRawValueInline(const ArrayType& values,

Review comment:
   "VisitRawValuesInline" (plural) would be better





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7542: ARROW-9225: [C++][Compute] Speed up counting sort

2020-06-25 Thread GitBox


pitrou commented on pull request #7542:
URL: https://github.com/apache/arrow/pull/7542#issuecomment-649440284


   I see no significant performance difference here on an AMD Ryzen CPU.
   The weird thing with these performance numbers is that the `null_percent=0` 
case should probably be faster, no?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7535: [Format][DONOTMERGE] Columnar.rst changes for removing validity bitmap from union types

2020-06-25 Thread GitBox


pitrou commented on a change in pull request #7535:
URL: https://github.com/apache/arrow/pull/7535#discussion_r445449396



##
File path: docs/source/format/Columnar.rst
##
@@ -566,11 +572,6 @@ having the values: ``[{f=1.2}, null, {f=3.4}, {i=5}]``
 ::
 
 * Length: 4, Null count: 1
-* Validity bitmap buffer:
-  |Byte 0 (validity bitmap) | Bytes 1-63|
-  |-|---|
-  |1101 | 0 (padding)   |
-
 * Types buffer:
 
   |Byte 0   | Byte 1  | Byte 2   | Byte 3   | Bytes 4-63  |

Review comment:
   It seems the types and offsets buffers need to be fixed (the second 
entry isn't unspecified anymore).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7535: [Format][DONOTMERGE] Columnar.rst changes for removing validity bitmap from union types

2020-06-25 Thread GitBox


pitrou commented on a change in pull request #7535:
URL: https://github.com/apache/arrow/pull/7535#discussion_r445449521



##
File path: docs/source/format/Columnar.rst
##
@@ -586,13 +587,13 @@ having the values: ``[{f=1.2}, null, {f=3.4}, {i=5}]``
 * Children arrays:
   * Field-0 array (f: float):
 * Length: 2, nulls: 0
-* Validity bitmap buffer: Not required
+* Validity bitmap buffer: 0101
 
 * Value Buffer:
 
-  | Bytes 0-7 | Bytes 8-63  |
-  |---|-|
-  | 1.2, 3.4  | unspecified |
+  | Bytes 0-11 | Bytes 8-63  |

Review comment:
   Second column should be "bytes 12-63"





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7517: ARROW-1682: [Doc] Expand S3/MinIO fileystem dataset documentation

2020-06-25 Thread GitBox


pitrou commented on a change in pull request #7517:
URL: https://github.com/apache/arrow/pull/7517#discussion_r445450752



##
File path: docs/source/python/dataset.rst
##
@@ -325,6 +325,22 @@ The currently available classes are 
:class:`~pyarrow.fs.S3FileSystem` and
 details.
 
 
+Reading from Minio
+--
+
+In addition to cloud storage, pyarrow also supports reading from a MinIO object
+storage instance emulating S3 APIs. Paired with toxiproxy, this is useful for
+testing or benchmarking.
+
+.. code-block:: python
+
+from pyarrow import fs
+
+minio = fs.S3FileSystem(scheme="http", endpoint="localhost:9000")

Review comment:
   Add a comment that this assumes MinIO is running unencrypted on local 
port 9000?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7517: ARROW-1682: [Doc] Expand S3/MinIO fileystem dataset documentation

2020-06-25 Thread GitBox


pitrou commented on a change in pull request #7517:
URL: https://github.com/apache/arrow/pull/7517#discussion_r445450482



##
File path: docs/source/python/dataset.rst
##
@@ -325,6 +325,22 @@ The currently available classes are 
:class:`~pyarrow.fs.S3FileSystem` and
 details.
 
 
+Reading from Minio
+--
+
+In addition to cloud storage, pyarrow also supports reading from a MinIO object
+storage instance emulating S3 APIs. Paired with toxiproxy, this is useful for

Review comment:
   Can you add hyperlinks to MinIO and toxyproxy?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] cyb70289 commented on pull request #7542: ARROW-9225: [C++][Compute] Speed up counting sort

2020-06-25 Thread GitBox


cyb70289 commented on pull request #7542:
URL: https://github.com/apache/arrow/pull/7542#issuecomment-649457378


   > I see no significant performance difference here on an AMD Ryzen CPU.
   > The weird thing with these performance numbers is that the 
`null_percent=0` case should probably be faster, no?
   
   Per @wesm [comment 
](https://github.com/apache/arrow/pull/7525#issuecomment-648279829), he also 
didn't see significant performance difference after replacing bitmapreader with 
bitblockcount.
   
   But I see big differences on AMD EYPC 7251, Intel Gold 5218, and Arm server. 
Both gcc and clang. No idea what's happening.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

2020-06-25 Thread GitBox


bkietz commented on a change in pull request #7534:
URL: https://github.com/apache/arrow/pull/7534#discussion_r445470685



##
File path: cpp/src/parquet/arrow/reader.cc
##
@@ -338,22 +348,37 @@ class RowGroupRecordBatchReader : public 
::arrow::RecordBatchReader {
 // TODO (hatemhelal): Consider refactoring this to share logic with 
ReadTable as this
 // does not currently honor the use_threads option.
 std::vector> columns(field_readers_.size());
-for (size_t i = 0; i < field_readers_.size(); ++i) {
-  RETURN_NOT_OK(field_readers_[i]->NextBatch(batch_size_, &columns[i]));
-  if (columns[i]->num_chunks() > 1) {
-return Status::NotImplemented("This class cannot yet iterate chunked 
arrays");
+int64_t num_rows = -1;
+
+if (columns.empty()) {
+  num_rows = std::min(batch_size_, *row_group_remaining_size_);

Review comment:
   ```suggestion
 // num_rows cannot be derived from field_readers_ so compute it using
 // row group sizes cached from metadata
 num_rows = std::min(batch_size_, *row_group_remaining_size_);
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-25 Thread GitBox


jorisvandenbossche commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r445482607



##
File path: cpp/src/arrow/util/mutex.h
##
@@ -0,0 +1,58 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+
+#include "arrow/util/macros.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// A wrapper around std::mutex since we can't use it directly due to CPP/CLI

Review comment:
   we use `std:mutex` in several places in the code base, so this comment 
is not fully clear to me

##
File path: cpp/src/arrow/dataset/file_parquet.cc
##
@@ -357,13 +355,20 @@ static inline Result> 
AugmentRowGroups(
   return row_groups;
 }
 
-Result ParquetFileFormat::ScanFile(
-const FileSource& source, std::shared_ptr options,
-std::shared_ptr context, std::vector 
row_groups) const {
+Result 
ParquetFileFormat::ScanFile(std::shared_ptr options,
+ 
std::shared_ptr context,
+ FileFragment* fragment) 
const {
+  const auto& source = fragment->source();
+  auto row_groups = checked_cast(fragment)->row_groups();
+
   bool row_groups_are_complete = RowGroupInfosAreComplete(row_groups);
   // The following block is required to avoid any IO if all RowGroups are
   // excluded due to prior statistics knowledge.
   if (row_groups_are_complete) {
+// physical_schema should be cached at this point
+ARROW_ASSIGN_OR_RAISE(auto physical_schema, 
fragment->ReadPhysicalSchema());
+RETURN_NOT_OK(options->filter->Validate(*physical_schema));

Review comment:
   To avoid confusion about this, should we rename ReadPhysicalSchema? Or 
instead of having `ReadPhysicalSchema` which caches and 
`ReadPhysicalSchemaImpl` that always reads, a `GetPhysicalSchema` that caches 
and `ReadPhysicalSchema` that reads?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

2020-06-25 Thread GitBox


jorisvandenbossche commented on pull request #7534:
URL: https://github.com/apache/arrow/pull/7534#issuecomment-649482676


   The python dataset tests are crashing on Mac: 
https://github.com/apache/arrow/runs/806974457



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7537: ARROW-842: [Python] Recognize pandas.NaT as null when converting object arrays with from_pandas=True

2020-06-25 Thread GitBox


jorisvandenbossche commented on a change in pull request #7537:
URL: https://github.com/apache/arrow/pull/7537#discussion_r445495780



##
File path: cpp/src/arrow/python/helpers.cc
##
@@ -254,14 +255,45 @@ bool PyFloat_IsNaN(PyObject* obj) {
   return PyFloat_Check(obj) && std::isnan(PyFloat_AsDouble(obj));
 }
 
+namespace {
+
+static std::once_flag pandas_static_initialized;
+static PyTypeObject* pandas_NaTType = nullptr;
+
+void GetPandasStaticSymbols() {
+  OwnedRef pandas;
+  Status s = ImportModule("pandas", &pandas);
+  if (!s.ok()) {
+return;
+  }
+
+  OwnedRef nat_value;
+  s = ImportFromModule(pandas.obj(), "NaT", &nat_value);
+  if (!s.ok()) {
+return;
+  }
+  PyObject* nat_type = PyObject_Type(nat_value.obj());
+  pandas_NaTType = reinterpret_cast(nat_type);
+
+  // PyObject_Type returns a new reference but we trust that pandas.NaT will
+  // outlive our use of this PyObject*
+  Py_DECREF(nat_type);
+}

Review comment:
   Do you also want to add `pd.NA` here, or leave that for a follow-up JIRA?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

2020-06-25 Thread GitBox


jorisvandenbossche commented on pull request #7536:
URL: https://github.com/apache/arrow/pull/7536#issuecomment-649500017


   Currently for the ParquetDataset, it also simply uses int32 for the indices. 
   
   Now, there is a more fundamental issue I had not thought of: the actual 
dictionary of the DictionaryArray. Right now, you create a DictionaryArray with 
only the (single) value of the partition field for that specific fragment 
(because we don't keep track of all unique values of a certain partition 
level?). 
   In the python ParquetDataset, however, we create a DictionaryArray with all 
occurring values of that partition field (also for the other fragments/pieces).
   
   To illustrate with a small dataset with `part=A` and `part=B` directories:
   
   ```python
   In [1]: import pyarrow.dataset as ds
   
   In [6]: part = 
ds.HivePartitioning.discover(max_partition_dictionary_size=-1)   
   
   In [9]: dataset = ds.dataset("test_partitioned/", format="parquet", 
partitioning=part) 
   
   In [10]: fragment = list(dataset.get_fragments())[0] 
   
   In [11]: fragment.to_table(schema=dataset.schema) 
   Out[11]: 
   pyarrow.Table
   dummy: int64
   part: dictionary
   
   # only A included
   In [13]: fragment.to_table(schema=dataset.schema).column("part")  
   Out[13]: 
   
   [
   
 -- dictionary:
   [
 "A"
   ]
 -- indices:
   [
 0,
 0
   ]
   ]
   
   In [15]: import pyarrow.parquet as pq  
   
   In [16]: dataset2 = pq.ParquetDataset("test_partitioned/")  
   
   In [19]: piece = dataset2.pieces[0] 
   
   In [25]: piece.read(partitions=dataset2.partitions) 
   Out[25]: 
   pyarrow.Table
   dummy: int64
   part: dictionary
   
   # both A and B included
   In [26]: piece.read(partitions=dataset2.partitions).column("part")   
   Out[26]: 
   
   [
   
 -- dictionary:
   [
 "A",
 "B"
   ]
 -- indices:
   [
 0,
 0
   ]
   ]
   ```
   
   I think for this being valuable (eg in the context of dask, or for pandas 
where reading in only a part of the parquet dataset), it's important to get all 
values of the partition field. But I am not sure to what extent that fits in 
the Dataset design (although I think that during the discovery in the Factory, 
we could keep track of all unique values of a partition field?)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

2020-06-25 Thread GitBox


bkietz commented on pull request #7534:
URL: https://github.com/apache/arrow/pull/7534#issuecomment-649500754


   @jorisvandenbossche that build comes from the first of two `suggestion` 
commits and it doesn't seem to have crashed with both commits in place. Maybe 
it was ephemeral?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

2020-06-25 Thread GitBox


jorisvandenbossche commented on pull request #7534:
URL: https://github.com/apache/arrow/pull/7534#issuecomment-649501921


   Hmm, it failed on the last commit as well, but I restarted that one. And so 
now appears to be green indeed .. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-25 Thread GitBox


bkietz commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r445512267



##
File path: cpp/src/arrow/util/mutex.h
##
@@ -0,0 +1,58 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+
+#include "arrow/util/macros.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// A wrapper around std::mutex since we can't use it directly due to CPP/CLI

Review comment:
   We cannot use `std::mutex` in public headers due to CPP/CLI forbidding 
use of that header 
https://docs.microsoft.com/en-us/cpp/standard-library/mutex#remarks
   We don't support compilation of Arrow with the `/clr` option so it's 
acceptable to use `std::mutex` in source files and private headers (including 
`arrow/util/mutex.cc`). However arrow headers may be included by projects which 
*are* using `/clr` and usage of `std::mutex` in those public header files would 
break those projects.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-25 Thread GitBox


bkietz commented on pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#issuecomment-649506352


   > I suppose we don't / don't want to support files being changed after the 
dataset object has been constructed anyways?
   
   No, we don't support this at all. IMHO if fragments are modified that calls 
for the viewing `Dataset` to be reconstructed. @fsaintjacques 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-25 Thread GitBox


jorisvandenbossche commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r445516148



##
File path: cpp/src/arrow/util/mutex.h
##
@@ -0,0 +1,58 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+
+#include "arrow/util/macros.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// A wrapper around std::mutex since we can't use it directly due to CPP/CLI

Review comment:
   Thanks for the explanation! maybe add: "can't use it directly *in public 
headers*"





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

2020-06-25 Thread GitBox


bkietz commented on pull request #7536:
URL: https://github.com/apache/arrow/pull/7536#issuecomment-649514981


   @jorisvandenbossche okay, I'll extend the key value `Partitioning`s to 
maintain dictionaries of all unique values of a field.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7456: ARROW-9106: [Python] Allow specifying CSV file encoding

2020-06-25 Thread GitBox


pitrou commented on pull request #7456:
URL: https://github.com/apache/arrow/pull/7456#issuecomment-649522260


   Ok, I added some tests for error propagation. I'm going to merge if CI stays 
green.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm opened a new pull request #7543: ARROW-9221: [Java] account for big-endian buffers in ArrowBuf.setBytes

2020-06-25 Thread GitBox


lidavidm opened a new pull request #7543:
URL: https://github.com/apache/arrow/pull/7543


   `ArrowBuf.setBytes` has an override that uses a 8-byte-at-a-time copy loop 
if the byte buffer does not provide an array and is not direct. Unfortunately, 
this means it'll mangle data when the byte buffer is big-endian, as it then 
writes the data into the little-endian ArrowBuf. This fixes it by setting the 
byte order before copying, and then restoring it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7543: ARROW-9221: [Java] account for big-endian buffers in ArrowBuf.setBytes

2020-06-25 Thread GitBox


github-actions[bot] commented on pull request #7543:
URL: https://github.com/apache/arrow/pull/7543#issuecomment-649534666


   https://issues.apache.org/jira/browse/ARROW-9221



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7519: ARROW-9153: [C++][Python] Refactor scalar bindings

2020-06-25 Thread GitBox


jorisvandenbossche commented on a change in pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#discussion_r445523294



##
File path: python/pyarrow/_dataset.pyx
##
@@ -216,22 +216,18 @@ cdef class Expression:
 @staticmethod
 def _scalar(value):
 cdef:
-shared_ptr[CScalar] scalar
-
-if value is None:
-scalar.reset(new CNullScalar())
-elif isinstance(value, bool):
-scalar = MakeScalar(value)
-elif isinstance(value, float):
-scalar = MakeScalar(value)
-elif isinstance(value, int):
-scalar = MakeScalar(value)
-elif isinstance(value, (bytes, str)):
-scalar = MakeStringScalar(tobytes(value))

Review comment:
   I think the `MakeStringScalar` included in libarrow.pxd can then be 
removed (I don't see any other usage of it)

##
File path: python/pyarrow/tests/test_parquet.py
##
@@ -2028,7 +2028,7 @@ def test_filters_invalid_pred_op(tempdir, 
use_legacy_dataset):
 use_legacy_dataset=use_legacy_dataset)
 assert dataset.read().num_rows == 0
 
-with pytest.raises(ValueError if use_legacy_dataset else TypeError):
+with pytest.raises(ValueError if use_legacy_dataset else pa.ArrowInvalid):
 # dataset API returns TypeError when trying create invalid comparison

Review comment:
   ```suggestion
   with pytest.raises(ValueError):
   ```
   
   ArrowInvalid is a ValueError, I think, so if this changed, the above should 
be sufficient

##
File path: python/pyarrow/util.py
##
@@ -41,6 +41,24 @@ def wrapper(*args, **kwargs):
 return wrapper
 
 
+def _deprecate_class(old_name, new_class, next_version,
+ instancecheck=True):
+"""
+Raise warning if a deprecated class is used in an isinstance check.

Review comment:
   Shouldn't it also raise when instantiated?

##
File path: python/pyarrow/scalar.pxi
##
@@ -16,1198 +16,704 @@
 # under the License.
 
 
-_NULL = NA = None
-
-
 cdef class Scalar:
 """
-The base class for all array elements.
+The base class for scalars.
 """
 
+def __init__(self):
+raise TypeError("Do not call {}'s constructor directly, use "
+"pa.scalar() instead.".format(self.__class__.__name__))
 
-cdef class NullType(Scalar):
-"""
-Singleton for null array elements.
-"""
-# TODO rename this NullValue?
+cdef void init(self, const shared_ptr[CScalar]& wrapped):
+self.wrapped = wrapped
 
-def __cinit__(self):
-global NA
-if NA is not None:
-raise Exception('Cannot create multiple NAType instances')
+@staticmethod
+cdef wrap(const shared_ptr[CScalar]& wrapped):
+cdef:
+Scalar self
+Type type_id = wrapped.get().type.get().id()
+
+if type_id == _Type_NA:
+return _NULL
+
+typ = _scalar_classes[type_id]
+self = typ.__new__(typ)
+self.init(wrapped)
+
+return self
+
+cdef inline shared_ptr[CScalar] unwrap(self) nogil:
+return self.wrapped
 
-self.type = null()
+@property
+def type(self):
+return pyarrow_wrap_data_type(self.wrapped.get().type)
 
 def __repr__(self):
-return 'NULL'
+return ''.format(
+self.__class__.__name__, self.as_py()
+)
 
-def as_py(self):
-"""
-Return None
-"""
-return None
+def __str__(self):
+return str(self.as_py())
 
 def __eq__(self, other):
-return NA
+# TODO(kszucs): use c++ Equals
+if isinstance(other, Scalar):
+other = other.as_py()
+return self.as_py() == other
 
+def __hash__(self):
+# TODO(kszucs): use C++ hash if implemented for the type
+return hash(self.as_py())
+
+def as_py(self):
+raise NotImplementedError()
 
-_NULL = NA = NullType()
+
+_NULL = NA = None
 
 
-cdef class ArrayValue(Scalar):
+cdef class NullScalar(Scalar):
 """
-The base class for non-null array elements.
+Concrete class for null scalars.
 """
 
-def __init__(self):
-raise TypeError("Do not call {}'s constructor directly, use array "
-"subscription instead."
-.format(self.__class__.__name__))
-
-cdef void init(self, DataType type, const shared_ptr[CArray]& sp_array,
-   int64_t index):
-self.type = type
-self.index = index
-self._set_array(sp_array)
+def __cinit__(self):
+global NA
+if NA is not None:
+raise Exception('Cannot create multiple NAType instances')
+self.init(shared_ptr[CScalar](new CNullScalar()))
 
-cdef void _set_array(self, const shared_ptr[CArray]& sp_array):
-self.sp_array = sp_array
+def __init__(self):
+pass

[GitHub] [arrow] wesm commented on pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-25 Thread GitBox


wesm commented on pull request #7531:
URL: https://github.com/apache/arrow/pull/7531#issuecomment-649548830


   @jianxind I think the bot is broken right now because of the changes I 
recently made in ARROW-9201. @kszucs is going to update it



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-25 Thread GitBox


jorisvandenbossche commented on a change in pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#discussion_r445569265



##
File path: python/pyarrow/scalar.pxi
##
@@ -16,1198 +16,704 @@
 # under the License.
 
 
-_NULL = NA = None
-
-
 cdef class Scalar:
 """
-The base class for all array elements.
+The base class for scalars.
 """
 
+def __init__(self):
+raise TypeError("Do not call {}'s constructor directly, use "
+"pa.scalar() instead.".format(self.__class__.__name__))
 
-cdef class NullType(Scalar):
-"""
-Singleton for null array elements.
-"""
-# TODO rename this NullValue?
+cdef void init(self, const shared_ptr[CScalar]& wrapped):
+self.wrapped = wrapped
 
-def __cinit__(self):
-global NA
-if NA is not None:
-raise Exception('Cannot create multiple NAType instances')
+@staticmethod
+cdef wrap(const shared_ptr[CScalar]& wrapped):
+cdef:
+Scalar self
+Type type_id = wrapped.get().type.get().id()
+
+if type_id == _Type_NA:
+return _NULL
+
+typ = _scalar_classes[type_id]
+self = typ.__new__(typ)
+self.init(wrapped)
+
+return self
+
+cdef inline shared_ptr[CScalar] unwrap(self) nogil:
+return self.wrapped
 
-self.type = null()
+@property
+def type(self):
+return pyarrow_wrap_data_type(self.wrapped.get().type)
 
 def __repr__(self):
-return 'NULL'
+return ''.format(
+self.__class__.__name__, self.as_py()
+)
 
-def as_py(self):
-"""
-Return None
-"""
-return None
+def __str__(self):
+return str(self.as_py())
 
 def __eq__(self, other):
-return NA
+# TODO(kszucs): use c++ Equals
+if isinstance(other, Scalar):
+other = other.as_py()
+return self.as_py() == other
 
+def __hash__(self):
+# TODO(kszucs): use C++ hash if implemented for the type
+return hash(self.as_py())
+
+def as_py(self):
+raise NotImplementedError()
 
-_NULL = NA = NullType()
+
+_NULL = NA = None
 
 
-cdef class ArrayValue(Scalar):
+cdef class NullScalar(Scalar):
 """
-The base class for non-null array elements.
+Concrete class for null scalars.
 """
 
-def __init__(self):
-raise TypeError("Do not call {}'s constructor directly, use array "
-"subscription instead."
-.format(self.__class__.__name__))
-
-cdef void init(self, DataType type, const shared_ptr[CArray]& sp_array,
-   int64_t index):
-self.type = type
-self.index = index
-self._set_array(sp_array)
+def __cinit__(self):
+global NA
+if NA is not None:
+raise Exception('Cannot create multiple NAType instances')
+self.init(shared_ptr[CScalar](new CNullScalar()))
 
-cdef void _set_array(self, const shared_ptr[CArray]& sp_array):
-self.sp_array = sp_array
+def __init__(self):
+pass
 
-def __repr__(self):
-if hasattr(self, 'as_py'):
-return repr(self.as_py())
-else:
-return super(Scalar, self).__repr__()
+def __eq__(self, other):
+return NA
 
-def __str__(self):
-if hasattr(self, 'as_py'):
-return str(self.as_py())
-else:
-return super(Scalar, self).__str__()
+def as_py(self):
+"""
+Return this value as a Python None.
+"""
+return None
 
-def __eq__(self, other):
-if hasattr(self, 'as_py'):
-if isinstance(other, ArrayValue):
-other = other.as_py()
-return self.as_py() == other
-else:
-raise NotImplementedError(
-"Cannot compare Arrow values that don't support as_py()")
 
-def __hash__(self):
-return hash(self.as_py())
+_NULL = NA = NullScalar()
 
 
-cdef class BooleanValue(ArrayValue):
+cdef class BooleanScalar(Scalar):
 """
-Concrete class for boolean array elements.
+Concrete class for boolean scalars.
 """
 
 def as_py(self):
 """
 Return this value as a Python bool.
 """
-cdef CBooleanArray* ap =  self.sp_array.get()
-return ap.Value(self.index)
+cdef CBooleanScalar* sp =  self.wrapped.get()
+return sp.value if sp.is_valid else None
 
 
-cdef class Int8Value(ArrayValue):
+cdef class UInt8Scalar(Scalar):
 """
-Concrete class for int8 array elements.
+Concrete class for uint8 scalars.
 """
 
 def as_py(self):
 """
 Return this value as a Python int.
 """
-cdef CInt8Array* ap =  self.sp_array.get()
-return ap.Value(self.index)
+cdef CUInt8Scalar* sp =  self.wrapped.get()
+return sp.value if sp.i

[GitHub] [arrow] wesm commented on a change in pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-25 Thread GitBox


wesm commented on a change in pull request #7531:
URL: https://github.com/apache/arrow/pull/7531#discussion_r445569869



##
File path: cpp/src/arrow/util/spaced.h
##
@@ -0,0 +1,200 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/util/align_util.h"
+#include "arrow/util/bit_block_counter.h"
+#include "arrow/util/bitmap_reader.h"
+
+namespace arrow {
+namespace util {
+namespace internal {
+
+/// \brief Compress the buffer to spaced, excluding the null entries.
+///
+/// \param[in] src the source buffer
+/// \param[in] num_values the size of source buffer
+/// \param[in] valid_bits bitmap data indicating position of valid slots
+/// \param[in] valid_bits_offset offset into valid_bits
+/// \param[out] output the output buffer spaced
+/// \return The size of spaced buffer.
+template 
+inline int SpacedCompress(const T* src, int num_values, const uint8_t* 
valid_bits,
+  int64_t valid_bits_offset, T* output) {
+  int num_valid_values = 0;
+  int idx_src = 0;
+
+  const auto p =
+  arrow::internal::BitmapWordAlign<1>(valid_bits, valid_bits_offset, 
num_values);
+  // First handle the leading bits
+  const int leading_bits = static_cast(p.leading_bits);
+  while (idx_src < leading_bits) {
+if (BitUtil::GetBit(valid_bits, valid_bits_offset)) {
+  output[num_valid_values] = src[idx_src];
+  num_valid_values++;
+}
+idx_src++;
+valid_bits_offset++;
+  }
+
+  // The aligned parts scanned with BitBlockCounter
+  arrow::internal::BitBlockCounter data_counter(valid_bits, valid_bits_offset,
+num_values - leading_bits);
+  auto current_block = data_counter.NextWord();
+  while (idx_src < num_values) {
+if (current_block.AllSet()) {  // All true values
+  int run_length = 0;
+  // Scan forward until a block that has some false values (or the end)
+  while (current_block.length > 0 && current_block.AllSet()) {
+run_length += current_block.length;
+current_block = data_counter.NextWord();
+  }
+  // Fill all valid values of this scan
+  std::memcpy(&output[num_valid_values], &src[idx_src], run_length * 
sizeof(T));
+  num_valid_values += run_length;
+  idx_src += run_length;
+  valid_bits_offset += run_length;
+  // The current_block already computed, advance to next loop
+  continue;
+} else if (!current_block.NoneSet()) {  // Some values are null
+  arrow::internal::BitmapReader valid_bits_reader(valid_bits, 
valid_bits_offset,

Review comment:
   One question: is it actually beneficial to use `BitmapReader` here to 
process just a run of 64 bits? In other places where I have used 
BitBlockCounter, that it does not always help





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-25 Thread GitBox


github-actions[bot] commented on pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#issuecomment-649551591


   https://issues.apache.org/jira/browse/ARROW-9017



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-25 Thread GitBox


wesm commented on a change in pull request #7531:
URL: https://github.com/apache/arrow/pull/7531#discussion_r445569869



##
File path: cpp/src/arrow/util/spaced.h
##
@@ -0,0 +1,200 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/util/align_util.h"
+#include "arrow/util/bit_block_counter.h"
+#include "arrow/util/bitmap_reader.h"
+
+namespace arrow {
+namespace util {
+namespace internal {
+
+/// \brief Compress the buffer to spaced, excluding the null entries.
+///
+/// \param[in] src the source buffer
+/// \param[in] num_values the size of source buffer
+/// \param[in] valid_bits bitmap data indicating position of valid slots
+/// \param[in] valid_bits_offset offset into valid_bits
+/// \param[out] output the output buffer spaced
+/// \return The size of spaced buffer.
+template 
+inline int SpacedCompress(const T* src, int num_values, const uint8_t* 
valid_bits,
+  int64_t valid_bits_offset, T* output) {
+  int num_valid_values = 0;
+  int idx_src = 0;
+
+  const auto p =
+  arrow::internal::BitmapWordAlign<1>(valid_bits, valid_bits_offset, 
num_values);
+  // First handle the leading bits
+  const int leading_bits = static_cast(p.leading_bits);
+  while (idx_src < leading_bits) {
+if (BitUtil::GetBit(valid_bits, valid_bits_offset)) {
+  output[num_valid_values] = src[idx_src];
+  num_valid_values++;
+}
+idx_src++;
+valid_bits_offset++;
+  }
+
+  // The aligned parts scanned with BitBlockCounter
+  arrow::internal::BitBlockCounter data_counter(valid_bits, valid_bits_offset,
+num_values - leading_bits);
+  auto current_block = data_counter.NextWord();
+  while (idx_src < num_values) {
+if (current_block.AllSet()) {  // All true values
+  int run_length = 0;
+  // Scan forward until a block that has some false values (or the end)
+  while (current_block.length > 0 && current_block.AllSet()) {
+run_length += current_block.length;
+current_block = data_counter.NextWord();
+  }
+  // Fill all valid values of this scan
+  std::memcpy(&output[num_valid_values], &src[idx_src], run_length * 
sizeof(T));
+  num_valid_values += run_length;
+  idx_src += run_length;
+  valid_bits_offset += run_length;
+  // The current_block already computed, advance to next loop
+  continue;
+} else if (!current_block.NoneSet()) {  // Some values are null
+  arrow::internal::BitmapReader valid_bits_reader(valid_bits, 
valid_bits_offset,

Review comment:
   One question: is it actually beneficial to use `BitmapReader` here to 
process just a run of 64 bits? In other places where I have used 
BitBlockCounter, I have observed it does not always help





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-25 Thread GitBox


wesm commented on a change in pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#discussion_r445574693



##
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##
@@ -68,6 +76,64 @@ TYPED_TEST(TestStringKernels, AsciiLower) {
this->string_type(), "[\"aaazzæÆ&\", null, \"\", \"bbb\"]");
 }
 
+TEST(TestStringKernels, Utf8Upper32bitGrowth) {

Review comment:
   I think it's fine to have this test but it definitely should be marked 
as LARGE_MEMORY_TEST





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] cyb70289 commented on a change in pull request #7542: ARROW-9225: [C++][Compute] Speed up counting sort

2020-06-25 Thread GitBox


cyb70289 commented on a change in pull request #7542:
URL: https://github.com/apache/arrow/pull/7542#discussion_r445575078



##
File path: cpp/src/arrow/compute/kernels/vector_sort.cc
##
@@ -88,6 +88,28 @@ struct PartitionIndices {
   }
 };
 
+template 
+inline void VisitRawValueInline(const ArrayType& values,

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rladeira commented on issue #1771: pyarrow-- reading selected columns from multiindex parquet file

2020-06-25 Thread GitBox


rladeira commented on issue #1771:
URL: https://github.com/apache/arrow/issues/1771#issuecomment-649554677


   I found the same issue here, using pyarrow version '0.17.1'. I could not 
select columns from a multi index dataframe saved as a parquet file. Is there 
some way to accomplish this? Read just some columns from a multindex parquet 
file? Something like:
   
   `pd.read_parquet('file.parquet', engine='pyarrow', columns=[('a', 'b')])`



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rladeira edited a comment on issue #1771: pyarrow-- reading selected columns from multiindex parquet file

2020-06-25 Thread GitBox


rladeira edited a comment on issue #1771:
URL: https://github.com/apache/arrow/issues/1771#issuecomment-649554677


   I found the same issue here, using pyarrow version '0.17.1'. I could not 
select columns from a multi index dataframe saved as a parquet file. Is there 
some way to accomplish this? Read just some columns from a multindex parquet 
file? Something like:
   
   `pd.read_parquet('file.parquet', engine='pyarrow', columns=[('level_0_key', 
'level_1_key')])`



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module

2020-06-25 Thread GitBox


wesm commented on pull request #7539:
URL: https://github.com/apache/arrow/pull/7539#issuecomment-649561148


   This removes already more than 2MB of code from libarrow.so on Linux: great. 
I'll keep an eye on this



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] cyb70289 commented on pull request #7542: ARROW-9225: [C++][Compute] Speed up counting sort

2020-06-25 Thread GitBox


cyb70289 commented on pull request #7542:
URL: https://github.com/apache/arrow/pull/7542#issuecomment-649561938


   About `null_percent=0` case, the benchmark result I attached shows about 19% 
 improvement. Looks reasonable.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-25 Thread GitBox


fsaintjacques commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r445096637



##
File path: cpp/src/arrow/util/mutex.h
##
@@ -0,0 +1,58 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+
+#include "arrow/util/macros.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// A wrapper around std::mutex since we can't use it directly due to CPP/CLI

Review comment:
   We use it everywhere why is it a problem here?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-25 Thread GitBox


bkietz commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r445583091



##
File path: cpp/src/arrow/util/mutex.h
##
@@ -0,0 +1,58 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+
+#include "arrow/util/macros.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// A wrapper around std::mutex since we can't use it directly due to CPP/CLI

Review comment:
   https://github.com/apache/arrow/pull/7526#discussion_r445512267





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7542: ARROW-9225: [C++][Compute] Speed up counting sort

2020-06-25 Thread GitBox


wesm commented on pull request #7542:
URL: https://github.com/apache/arrow/pull/7542#issuecomment-649563024


   I'm interested to see if we could add a `BitmapReader::Advance` method so 
that we could employ a hybrid approach to use both BitBlockCounter and 
BitmapReader to get the best of both worlds



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on issue #7540: Why conversion from DoubleArray with nulls to numpy needs to copy?

2020-06-25 Thread GitBox


wesm commented on issue #7540:
URL: https://github.com/apache/arrow/issues/7540#issuecomment-649571447


   This came up in ARROW-3263 for R. If you'd like you can open a corresponding 
Python issue. On the face of it, this seems rather difficult to me to do safely



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed issue #7540: Why conversion from DoubleArray with nulls to numpy needs to copy?

2020-06-25 Thread GitBox


wesm closed issue #7540:
URL: https://github.com/apache/arrow/issues/7540


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7537: ARROW-842: [Python] Recognize pandas.NaT as null when converting object arrays with from_pandas=True

2020-06-25 Thread GitBox


wesm commented on a change in pull request #7537:
URL: https://github.com/apache/arrow/pull/7537#discussion_r445593243



##
File path: cpp/src/arrow/python/python_to_arrow.cc
##
@@ -1171,6 +1171,12 @@ Status GetConverterFlat(const std::shared_ptr& 
type, bool strict_conve
 
 Status GetConverter(const std::shared_ptr& type, bool from_pandas,
 bool strict_conversions, std::unique_ptr* 
out) {
+  if (from_pandas) {
+// ARROW-842: If pandas is not installed then null checks will be less
+// comprehensive, but that is okay.
+internal::InitPandasStaticData();

Review comment:
   I see. Well, the problem with putting the pandas initialization 
elsewhere is that we've already arranged for pandas to only be imported when 
it's needed, so if we did `import pandas` always when doing `import pyarrow` 
we'd be breaking that contract. I suggest we address this as follow up as needed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7537: ARROW-842: [Python] Recognize pandas.NaT as null when converting object arrays with from_pandas=True

2020-06-25 Thread GitBox


wesm commented on a change in pull request #7537:
URL: https://github.com/apache/arrow/pull/7537#discussion_r445593841



##
File path: cpp/src/arrow/python/helpers.cc
##
@@ -254,14 +255,45 @@ bool PyFloat_IsNaN(PyObject* obj) {
   return PyFloat_Check(obj) && std::isnan(PyFloat_AsDouble(obj));
 }
 
+namespace {
+
+static std::once_flag pandas_static_initialized;
+static PyTypeObject* pandas_NaTType = nullptr;
+
+void GetPandasStaticSymbols() {
+  OwnedRef pandas;
+  Status s = ImportModule("pandas", &pandas);
+  if (!s.ok()) {
+return;
+  }
+
+  OwnedRef nat_value;
+  s = ImportFromModule(pandas.obj(), "NaT", &nat_value);
+  if (!s.ok()) {
+return;
+  }
+  PyObject* nat_type = PyObject_Type(nat_value.obj());
+  pandas_NaTType = reinterpret_cast(nat_type);
+
+  // PyObject_Type returns a new reference but we trust that pandas.NaT will
+  // outlive our use of this PyObject*
+  Py_DECREF(nat_type);
+}
+
+}  // namespace
+
+void InitPandasStaticData() {
+  std::call_once(pandas_static_initialized, GetPandasStaticSymbols);
+}
+
 bool PandasObjectIsNull(PyObject* obj) {
   if (!MayHaveNaN(obj)) {
 return false;
   }
   if (obj == Py_None) {
 return true;
   }
-  if (PyFloat_IsNaN(obj) ||
+  if (PyFloat_IsNaN(obj) || (pandas_NaTType && PyObject_TypeCheck(obj, 
pandas_NaTType)) ||

Review comment:
   Right unfortunately you can have multiple instances of `pandas.NaT`... 
not great





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7537: ARROW-842: [Python] Recognize pandas.NaT as null when converting object arrays with from_pandas=True

2020-06-25 Thread GitBox


wesm commented on a change in pull request #7537:
URL: https://github.com/apache/arrow/pull/7537#discussion_r445594104



##
File path: cpp/src/arrow/python/helpers.cc
##
@@ -254,14 +255,45 @@ bool PyFloat_IsNaN(PyObject* obj) {
   return PyFloat_Check(obj) && std::isnan(PyFloat_AsDouble(obj));
 }
 
+namespace {
+
+static std::once_flag pandas_static_initialized;
+static PyTypeObject* pandas_NaTType = nullptr;
+
+void GetPandasStaticSymbols() {
+  OwnedRef pandas;
+  Status s = ImportModule("pandas", &pandas);
+  if (!s.ok()) {
+return;
+  }
+
+  OwnedRef nat_value;
+  s = ImportFromModule(pandas.obj(), "NaT", &nat_value);
+  if (!s.ok()) {
+return;
+  }
+  PyObject* nat_type = PyObject_Type(nat_value.obj());
+  pandas_NaTType = reinterpret_cast(nat_type);
+
+  // PyObject_Type returns a new reference but we trust that pandas.NaT will
+  // outlive our use of this PyObject*
+  Py_DECREF(nat_type);
+}

Review comment:
   Sure, I'll add it





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm edited a comment on pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

2020-06-25 Thread GitBox


wesm edited a comment on pull request #7534:
URL: https://github.com/apache/arrow/pull/7534#issuecomment-649579503


   That crash is ARROW-8999. I'm fairly confident it's a real bug given that it 
happens about 1-5% of the time



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

2020-06-25 Thread GitBox


wesm commented on pull request #7534:
URL: https://github.com/apache/arrow/pull/7534#issuecomment-649579503


   That crash is ARROW-8999. I'm fairly confident it's a real bug given that is 
happens about 1-5% of the time



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7456: ARROW-9106: [Python] Allow specifying CSV file encoding

2020-06-25 Thread GitBox


wesm closed pull request #7456:
URL: https://github.com/apache/arrow/pull/7456


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7456: ARROW-9106: [Python] Allow specifying CSV file encoding

2020-06-25 Thread GitBox


wesm commented on pull request #7456:
URL: https://github.com/apache/arrow/pull/7456#issuecomment-649580251


   Looks good here. Merging



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on issue #1771: pyarrow-- reading selected columns from multiindex parquet file

2020-06-25 Thread GitBox


wesm commented on issue #1771:
URL: https://github.com/apache/arrow/issues/1771#issuecomment-649581184


   Please direct questions to u...@arrow.apache.org or d...@arrow.apache.org. 
Thank you



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-25 Thread GitBox


kszucs commented on a change in pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#discussion_r445603562



##
File path: python/pyarrow/util.py
##
@@ -41,6 +41,24 @@ def wrapper(*args, **kwargs):
 return wrapper
 
 
+def _deprecate_class(old_name, new_class, next_version,
+ instancecheck=True):
+"""
+Raise warning if a deprecated class is used in an isinstance check.

Review comment:
   > Really nice!
   > 
   > Could we add a `is_valid` attribute to the python scalar as well? Now the 
only way to check for a null value is to do `.as_py() is None` ?
   Definitely!
   > 
   > Might not be related to the changes in this PR, but reviewing this 
triggered me to test scalar casting:
   > 
   > ```
   > In [2]: s = pa.scalar(pd.Timestamp("2012-01-01")) 
   >...:  
   >...: import pyarrow.compute as pc 
   >...: pc.cast(s, pa.timestamp('ns'))  
   > ../src/arrow/compute/kernels/scalar_cast_temporal.cc:130:  Check failed: 
(batch[0].kind()) == (Datum::ARRAY) 
   > ...
   > Aborted (core dumped)
   > ```
   I think the cast kernel doesn't support scalars yet, on the other hand the 
scalars have custom `CastTo` implementation which we may want to remove in 
favor of `compute.Cast`?
   
   cc @bkietz 
   
   

##
File path: python/pyarrow/util.py
##
@@ -41,6 +41,24 @@ def wrapper(*args, **kwargs):
 return wrapper
 
 
+def _deprecate_class(old_name, new_class, next_version,
+ instancecheck=True):
+"""
+Raise warning if a deprecated class is used in an isinstance check.

Review comment:
   > Really nice!
   > 
   > Could we add a `is_valid` attribute to the python scalar as well? Now the 
only way to check for a null value is to do `.as_py() is None` ?
   
   Definitely!
   
   > 
   > Might not be related to the changes in this PR, but reviewing this 
triggered me to test scalar casting:
   > 
   > ```
   > In [2]: s = pa.scalar(pd.Timestamp("2012-01-01")) 
   >...:  
   >...: import pyarrow.compute as pc 
   >...: pc.cast(s, pa.timestamp('ns'))  
   > ../src/arrow/compute/kernels/scalar_cast_temporal.cc:130:  Check failed: 
(batch[0].kind()) == (Datum::ARRAY) 
   > ...
   > Aborted (core dumped)
   > ```
   I think the cast kernel doesn't support scalars yet, on the other hand the 
scalars have custom `CastTo` implementation which we may want to remove in 
favor of `compute.Cast`?
   
   cc @bkietz 
   
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet

2020-06-25 Thread GitBox


wesm commented on pull request #7143:
URL: https://github.com/apache/arrow/pull/7143#issuecomment-649585468


   FWIW gcc benchmarks (sse4.2) on my machine (ubuntu 18.04 on i9-9960X)
   
   ```
   $ archery benchmark diff --cc=gcc-8 --cxx=g++-8 emkornfield/ARROW-8504 
master --suite-filter=parquet-arrow
   benchmark baselinecontender  
change %counters
   35 BM_ReadColumn/75/1  267.930 MiB/sec  275.083 MiB/sec  
   2.670   {'iterations': 2}
   6  BM_ReadColumn/99/0  238.381 MiB/sec  244.286 MiB/sec  
   2.477   {'iterations': 3}
   10   BM_ReadColumn/10/50  270.775 MiB/sec  275.249 MiB/sec  
   1.652   {'iterations': 2}
   31 BM_ReadColumn/25/5  171.549 MiB/sec  174.237 MiB/sec  
   1.567   {'iterations': 2}
   29BM_ReadColumn/-1/1  856.194 MiB/sec  866.945 MiB/sec  
   1.256  {'iterations': 11}
   7BM_WriteColumn   31.593 MiB/sec   31.845 MiB/sec  
   0.797   {'iterations': 1}
   44 BM_ReadIndividualRowGroups  849.469 MiB/sec  856.209 MiB/sec  
   0.793   {'iterations': 6}
   36BM_ReadColumn/-1/0  408.388 MiB/sec  411.490 MiB/sec  
   0.759   {'iterations': 3}
   3  BM_ReadColumn/50/0  151.600 MiB/sec  152.399 MiB/sec  
   0.527   {'iterations': 2}
   32BM_ReadColumn/30/10  268.512 MiB/sec  269.849 MiB/sec  
   0.498   {'iterations': 2}
   8   BM_ReadColumn/-1/20  772.878 MiB/sec  776.716 MiB/sec  
   0.497   {'iterations': 5}
   17   BM_ReadColumn/25/25  277.705 MiB/sec  278.957 MiB/sec  
   0.451   {'iterations': 2}
   23BM_ReadColumn/-1/11.472 GiB/sec1.478 GiB/sec  
   0.414  {'iterations': 13}
   30  BM_ReadColumn/1/1  235.809 MiB/sec  236.730 MiB/sec  
   0.391   {'iterations': 3}
   45 BM_WriteColumn   45.067 MiB/sec   45.232 MiB/sec  
   0.366   {'iterations': 1}
   41BM_ReadColumn/45/25  241.545 MiB/sec  242.427 MiB/sec  
   0.365   {'iterations': 2}
   40   BM_ReadColumn/-1/10  710.485 MiB/sec  712.181 MiB/sec  
   0.239   {'iterations': 5}
   22  BM_ReadColumn/-1/0  211.939 MiB/sec  212.433 MiB/sec  
   0.233  {'iterations': 15}
   39   BM_ReadColumn/-1/50  629.632 MiB/sec  630.955 MiB/sec  
   0.210   {'iterations': 9}
   15  BM_ReadColumn/1/20  145.490 MiB/sec  145.701 MiB/sec  
   0.145  {'iterations': 10}
   9BM_ReadColumn/-1/1  162.373 MiB/sec  162.497 MiB/sec  
   0.077   {'iterations': 4}
   11 BM_ReadColumn/-1/0  257.190 MiB/sec  257.278 MiB/sec  
   0.034   {'iterations': 3}
   13BM_WriteColumn   32.378 MiB/sec   32.386 MiB/sec  
   0.025   {'iterations': 1}
   37BM_ReadColumn/99/50  238.918 MiB/sec  238.975 MiB/sec  
   0.024   {'iterations': 3}
   20   BM_WriteColumn   54.200 MiB/sec   54.209 MiB/sec  
   0.017   {'iterations': 1}
   19  BM_ReadColumn/1/1  369.522 MiB/sec  369.199 MiB/sec  
  -0.087   {'iterations': 2}
   0BM_ReadColumn/-1/501.144 GiB/sec1.143 GiB/sec  
  -0.090  {'iterations': 10}
   34BM_ReadColumn/35/10  262.271 MiB/sec  261.890 MiB/sec  
  -0.145   {'iterations': 2}
   21 BM_ReadColumn/50/1  242.510 MiB/sec  242.067 MiB/sec  
  -0.183   {'iterations': 2}
   5 BM_ReadColumn/50/50  152.344 MiB/sec  152.000 MiB/sec  
  -0.225   {'iterations': 2}
   1 BM_ReadColumn/10/10  179.063 MiB/sec  178.583 MiB/sec  
  -0.268   {'iterations': 2}
   28BM_WriteColumn   64.162 MiB/sec   63.990 MiB/sec  
  -0.269   {'iterations': 1}
   38BM_WriteColumn   69.641 MiB/sec   69.446 MiB/sec  
  -0.280   {'iterations': 1}
   18  BM_WriteColumn   26.643 MiB/sec   26.557 MiB/sec  
  -0.324   {'iterations': 2}
   16   BM_ReadColumn/-1/02.291 GiB/sec2.281 GiB/sec  
  -0.397  {'iterations': 20}
   43 BM_WriteColumn   75.453 MiB/sec   75.050 MiB/sec  
  -0.534   {'iterations': 1}
   33   BM_ReadColumn/5/10  115.534 MiB/sec  114.574 MiB/sec  
  -0.832   {'iterations': 3}
   14BM_ReadColumn/50/50  242.071 MiB/sec  240.005 MiB/sec  
  -0.854   {'iterations': 2}
   4BM_ReadMultipleRowGroups  826.026 MiB/sec  818.709 MiB/sec  
  -0.886   {'iterations': 5}
   24   BM_ReadColumn/-1/10  386.383 MiB/sec  382.213 MiB/sec  
  -1.079   {'iterations': 6}
   12 BM_ReadColumn/-1/0  410.877 MiB/sec  404.700 MiB/sec  
  -1.503   {'iterations': 3}
   25 BM_ReadColumn/10/5  289.612 MiB/sec  284.777 MiB/sec  
  -1.670   {'iterations': 2}
   2 BM_ReadColumn/25/10  277.045 MiB/sec  271.925 MiB/sec  
  -1.848   {'iterations': 2}
   26  BM_ReadColumn/5/5  311.137 MiB/sec  295.823 MiB/sec  
  -4.922   {'iterations': 2}
   42 BM_ReadColumn/99/0  371.803 MiB/sec  347.872 MiB/sec  
  -6.436   {'iterations': 2}
   27BM_ReadColumn/99/50  381.029 MiB/sec  351.329 MiB/sec  
  -7.795   {'iterations': 3}
   ```
   
   Here's clang-11
   
   ```
   $ archery benchmark diff --cc=clang-11 --cxx=clang++-11 
emkornfield/ARROW-8504 master --suite-filter=parquet-arrow
   benchmark baselinecontender  
change %  

[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet

2020-06-25 Thread GitBox


wesm commented on pull request #7143:
URL: https://github.com/apache/arrow/pull/7143#issuecomment-649585804


   +1, thanks for fixing this @emkornfield!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet

2020-06-25 Thread GitBox


wesm closed pull request #7143:
URL: https://github.com/apache/arrow/pull/7143


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-25 Thread GitBox


kszucs commented on a change in pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#discussion_r445604627



##
File path: python/pyarrow/tests/test_scalars.py
##
@@ -17,426 +17,395 @@
 
 import datetime
 import pytest
-import unittest
 
 import numpy as np
 
 import pyarrow as pa
 
 
-class TestScalars(unittest.TestCase):
-
-def test_null_singleton(self):
-with pytest.raises(Exception):
-pa.NAType()
+@pytest.mark.parametrize(['value', 'ty', 'klass', 'deprecated'], [
+(None, None, pa.NullScalar, pa.NullType),
+(False, None, pa.BooleanScalar, pa.BooleanValue),
+(True, None, pa.BooleanScalar, pa.BooleanValue),
+(1, None, pa.Int64Scalar, pa.Int64Value),
+(-1, None, pa.Int64Scalar, pa.Int64Value),
+(1, pa.int8(), pa.Int8Scalar, pa.Int8Value),
+(1, pa.uint8(), pa.UInt8Scalar, pa.UInt8Value),
+(1, pa.int16(), pa.Int16Scalar, pa.Int16Value),
+(1, pa.uint16(), pa.UInt16Scalar, pa.UInt16Value),
+(1, pa.int32(), pa.Int32Scalar, pa.Int32Value),
+(1, pa.uint32(), pa.UInt32Scalar, pa.UInt32Value),
+(1, pa.int64(), pa.Int64Scalar, pa.Int64Value),
+(1, pa.uint64(), pa.UInt64Scalar, pa.UInt64Value),
+(1.0, None, pa.DoubleScalar, pa.DoubleValue),
+(np.float16(1.0), pa.float16(), pa.HalfFloatScalar, pa.HalfFloatValue),
+(1.0, pa.float32(), pa.FloatScalar, pa.FloatValue),
+("string", None, pa.StringScalar, pa.StringValue),
+(b"bytes", None, pa.BinaryScalar, pa.BinaryValue),
+([1, 2, 3], None, pa.ListScalar, pa.ListValue),
+([1, 2, 3, 4], pa.large_list(pa.int8()), pa.LargeListScalar,
+ pa.LargeListValue),
+# date
+# time
+])
+def test_type_inference(value, ty, klass, deprecated):
+s = pa.scalar(value, type=ty)
+assert isinstance(s, klass)
+assert s == value
+with pytest.warns(FutureWarning):
+isinstance(s, deprecated)
+
+
+def test_null_singleton():
+with pytest.raises(Exception):
+pa.NullScalar()
+
+
+def test_nulls():
+arr = pa.array([None, None])
+for v in arr:
+assert v is pa.NA
+assert v.as_py() is None
+
+
+def test_null_equality():
+assert (pa.NA == pa.NA) is pa.NA
+assert (pa.NA == 1) is pa.NA

Review comment:
   Great question, I assume it should follow the same semantics as `Null` 
does.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-25 Thread GitBox


kszucs commented on pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#issuecomment-649590873


   > Really nice!
   > 
   > Could we add a `is_valid` attribute to the python scalar as well? Now the 
only way to check for a null value is to do `.as_py() is None` ?
   
   Definitely!
   
   > 
   > Might not be related to the changes in this PR, but reviewing this 
triggered me to test scalar casting:
   > 
   > ```
   > In [2]: s = pa.scalar(pd.Timestamp("2012-01-01")) 
   >...:  
   >...: import pyarrow.compute as pc 
   >...: pc.cast(s, pa.timestamp('ns'))  
   > ../src/arrow/compute/kernels/scalar_cast_temporal.cc:130:  Check failed: 
(batch[0].kind()) == (Datum::ARRAY) 
   > ...
   > Aborted (core dumped)
   > ```
   I think the cast kernel doesn't support scalars yet, on the other hand the 
scalars have custom `CastTo` implementation which we may want to remove in 
favor of `compute.Cast`?
   
   cc @bkietz 
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-25 Thread GitBox


kszucs commented on a change in pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#discussion_r445603562



##
File path: python/pyarrow/util.py
##
@@ -41,6 +41,24 @@ def wrapper(*args, **kwargs):
 return wrapper
 
 
+def _deprecate_class(old_name, new_class, next_version,
+ instancecheck=True):
+"""
+Raise warning if a deprecated class is used in an isinstance check.

Review comment:
   The deprecated classes cannot be instantiated, so we don't need to worry 
about it for now - although we can add support for overriding `__init__` for 
later reuse.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on a change in pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

2020-06-25 Thread GitBox


fsaintjacques commented on a change in pull request #7534:
URL: https://github.com/apache/arrow/pull/7534#discussion_r445605965



##
File path: cpp/src/parquet/arrow/reader.cc
##
@@ -338,22 +348,39 @@ class RowGroupRecordBatchReader : public 
::arrow::RecordBatchReader {
 // TODO (hatemhelal): Consider refactoring this to share logic with 
ReadTable as this
 // does not currently honor the use_threads option.
 std::vector> columns(field_readers_.size());
-for (size_t i = 0; i < field_readers_.size(); ++i) {
-  RETURN_NOT_OK(field_readers_[i]->NextBatch(batch_size_, &columns[i]));
-  if (columns[i]->num_chunks() > 1) {
-return Status::NotImplemented("This class cannot yet iterate chunked 
arrays");
+int64_t num_rows = -1;
+
+if (columns.empty()) {
+  // num_rows cannot be derived from field_readers_ so compute it using
+  // row group sizes cached from metadata
+  num_rows = std::min(batch_size_, *row_group_remaining_size_);
+  *row_group_remaining_size_ -= num_rows;
+  if (*row_group_remaining_size_ == 0) {
+++row_group_remaining_size_;
   }
+} else {
+  for (size_t i = 0; i < field_readers_.size(); ++i) {
+RETURN_NOT_OK(field_readers_[i]->NextBatch(batch_size_, &columns[i]));
+if (columns[i]->num_chunks() > 1) {
+  return Status::NotImplemented("This class cannot yet iterate chunked 
arrays");
+}
+  }
+  num_rows = columns[0]->length();
 }
 
 // Create an intermediate table and use TableBatchReader as an adaptor to a
 // RecordBatch
-std::shared_ptr table = Table::Make(schema_, columns);
+std::shared_ptr table = Table::Make(schema_, columns, num_rows);
+
 RETURN_NOT_OK(table->Validate());
 ::arrow::TableBatchReader table_batch_reader(*table);
 return table_batch_reader.ReadNext(out);
   }
 
  private:
+  std::shared_ptr metadata_;

Review comment:
   `metadata_` is not used.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] cyb70289 commented on pull request #7541: ARROW-9224: [Dev][Archery] Copy local repo on clone failure

2020-06-25 Thread GitBox


cyb70289 commented on pull request #7541:
URL: https://github.com/apache/arrow/pull/7541#issuecomment-649595034


   > Can you try using `git clone --shared` instead? It should avoid the copy.
   
   `--shard` works okay per my test. What about simply replacing `git clone 
--local` with `git clone --shared`?
   From man page, the only **danger** of `--shared` is if I delete and prune 
the test branch when related test is running, looks not real.
   
   ```
  --shared, -s
  When the repository to clone is on the local machine, instead of 
using hard links, automatically setup .git/objects/info/alternates to share the
  objects with the source repository. The resulting repository 
starts out without any object of its own.
   
  NOTE: this is a possibly dangerous operation; do not use it 
unless you understand what it does. If you clone your repository using this 
option
  and then delete branches (or use any other Git command that makes 
any existing commit unreferenced) in the source repository, some objects may
  become unreferenced (or dangling). These objects may be removed 
by normal Git operations (such as git commit) which automatically call git gc
  --auto. (See git-gc(1).) If these objects are removed and were 
referenced by the cloned repository, then the cloned repository will become
  corrupt.
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-25 Thread GitBox


fsaintjacques commented on pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#issuecomment-649600179


   Correct, we don't support the file changing underneath. For now we can 
preach YAGNI on this as it is a rare use case. If this is required, the 
consumer can create a custom dataset/fragment.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-25 Thread GitBox


fsaintjacques commented on pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#issuecomment-649600834


   The test is failing on windows. 
https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/33736283/job/mshkl837y3b5v5u6



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7542: ARROW-9225: [C++][Compute] Speed up counting sort

2020-06-25 Thread GitBox


wesm commented on pull request #7542:
URL: https://github.com/apache/arrow/pull/7542#issuecomment-649601444


   Here's my benchmarks on i9-9960X
   
   ```
   $ archery benchmark diff --cc=gcc-8 --cxx=g++-8 cyb70289/sort master 
--suite-filter=vector-sort
  benchmark baseline
contender  change %   counters
   12   SortToIndicesInt64Count/32768/10/min_time:1.0001.233 GiB/sec
2.062 GiB/sec67.239{'iterations': 56417, 'null_percent': 10.0}
   1   SortToIndicesInt64Count/32768/100/min_time:1.0001.605 GiB/sec
2.033 GiB/sec26.628 {'iterations': 73247, 'null_percent': 1.0}
   3 SortToIndicesInt64Count/32768/0/min_time:1.0002.438 GiB/sec
2.850 GiB/sec16.911{'iterations': 111533, 'null_percent': 0.0}
   7 SortToIndicesInt64Count/32768/2/min_time:1.000  805.132 MiB/sec  
854.275 MiB/sec 6.104{'iterations': 36039, 'null_percent': 50.0}
   11SortToIndicesInt64Compare/32768/100/min_time:1.000  147.729 MiB/sec  
149.201 MiB/sec 0.996  {'iterations': 6588, 'null_percent': 1.0}
   2   SortToIndicesInt64Count/1048576/1/min_time:1.0004.559 GiB/sec
4.580 GiB/sec 0.458{'iterations': 6503, 'null_percent': 100.0}
   5 SortToIndicesInt64Compare/1048576/1/min_time:1.0004.554 GiB/sec
4.560 GiB/sec 0.136{'iterations': 6577, 'null_percent': 100.0}
   0   SortToIndicesInt64Compare/32768/1/min_time:1.000  148.181 MiB/sec  
148.368 MiB/sec 0.126 {'iterations': 6659, 'null_percent': 0.01}
   9   SortToIndicesInt64Compare/32768/2/min_time:1.000  245.628 MiB/sec  
245.887 MiB/sec 0.106{'iterations': 11171, 'null_percent': 50.0}
   6  SortToIndicesInt64Compare/32768/10/min_time:1.000  156.791 MiB/sec  
156.841 MiB/sec 0.032 {'iterations': 7073, 'null_percent': 10.0}
   4   SortToIndicesInt64Count/8388608/1/min_time:1.0004.276 GiB/sec
4.269 GiB/sec-0.155 {'iterations': 772, 'null_percent': 100.0}
   10SortToIndicesInt64Compare/8388608/1/min_time:1.0004.286 GiB/sec
4.273 GiB/sec-0.305 {'iterations': 769, 'null_percent': 100.0}
   8   SortToIndicesInt64Compare/32768/1/min_time:1.0004.677 GiB/sec
4.648 GiB/sec-0.634  {'iterations': 209517, 'null_percent': 100.0}
   14SortToIndicesInt64Count/32768/1/min_time:1.0004.661 GiB/sec
4.617 GiB/sec-0.945  {'iterations': 214256, 'null_percent': 100.0}
   15  SortToIndicesInt64Compare/32768/0/min_time:1.000  151.334 MiB/sec  
147.025 MiB/sec-2.847  {'iterations': 6660, 'null_percent': 0.0}
   13SortToIndicesInt64Count/32768/1/min_time:1.0002.312 GiB/sec
2.131 GiB/sec-7.844{'iterations': 91492, 'null_percent': 0.01}
   ```
   
   and clang-11
   
   ```
  benchmark baseline
contender  change %   counters
   15SortToIndicesInt64Count/32768/0/min_time:1.0002.069 GiB/sec
3.490 GiB/sec68.665 {'iterations': 95577, 'null_percent': 0.0}
   13   SortToIndicesInt64Count/32768/10/min_time:1.0001.288 GiB/sec
1.895 GiB/sec47.099{'iterations': 59504, 'null_percent': 10.0}
   10  SortToIndicesInt64Count/32768/100/min_time:1.0001.551 GiB/sec
1.993 GiB/sec28.456 {'iterations': 71755, 'null_percent': 1.0}
   2 SortToIndicesInt64Count/32768/1/min_time:1.0001.975 GiB/sec
2.069 GiB/sec 4.771{'iterations': 87376, 'null_percent': 0.01}
   3   SortToIndicesInt64Compare/32768/0/min_time:1.000  155.058 MiB/sec  
158.211 MiB/sec 2.033  {'iterations': 6935, 'null_percent': 0.0}
   9   SortToIndicesInt64Compare/32768/2/min_time:1.000  248.515 MiB/sec  
252.036 MiB/sec 1.417{'iterations': 11676, 'null_percent': 50.0}
   0   SortToIndicesInt64Compare/32768/1/min_time:1.0004.678 GiB/sec
4.735 GiB/sec 1.222  {'iterations': 214204, 'null_percent': 100.0}
   8 SortToIndicesInt64Count/32768/1/min_time:1.0004.677 GiB/sec
4.729 GiB/sec 1.110  {'iterations': 213435, 'null_percent': 100.0}
   5   SortToIndicesInt64Compare/32768/1/min_time:1.000  151.573 MiB/sec  
151.744 MiB/sec 0.113 {'iterations': 6769, 'null_percent': 0.01}
   1  SortToIndicesInt64Compare/32768/10/min_time:1.000  162.339 MiB/sec  
162.226 MiB/sec-0.069 {'iterations': 7284, 'null_percent': 10.0}
   14SortToIndicesInt64Compare/32768/100/min_time:1.000  151.879 MiB/sec  
151.589 MiB/sec-0.191  {'iterations': 6837, 'null_percent': 1.0}
   7 SortToIndicesInt64Compare/1048576/1/min_time:1.0004.570 GiB/sec
4.553 GiB/sec-0.383{'iterations': 6579, 'null_percent': 100.0}
   6   SortToIndicesInt64Count/1048576/1/min_time:1.0004.569 GiB/sec
4.550 GiB/sec-0.407{'iterations': 6459, 'null_percent

[GitHub] [arrow] wesm closed pull request #7542: ARROW-9225: [C++][Compute] Speed up counting sort

2020-06-25 Thread GitBox


wesm closed pull request #7542:
URL: https://github.com/apache/arrow/pull/7542


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

2020-06-25 Thread GitBox


bkietz commented on a change in pull request #7534:
URL: https://github.com/apache/arrow/pull/7534#discussion_r445625710



##
File path: cpp/src/parquet/arrow/reader.cc
##
@@ -338,22 +348,39 @@ class RowGroupRecordBatchReader : public 
::arrow::RecordBatchReader {
 // TODO (hatemhelal): Consider refactoring this to share logic with 
ReadTable as this
 // does not currently honor the use_threads option.
 std::vector> columns(field_readers_.size());
-for (size_t i = 0; i < field_readers_.size(); ++i) {
-  RETURN_NOT_OK(field_readers_[i]->NextBatch(batch_size_, &columns[i]));
-  if (columns[i]->num_chunks() > 1) {
-return Status::NotImplemented("This class cannot yet iterate chunked 
arrays");
+int64_t num_rows = -1;
+
+if (columns.empty()) {
+  // num_rows cannot be derived from field_readers_ so compute it using
+  // row group sizes cached from metadata
+  num_rows = std::min(batch_size_, *row_group_remaining_size_);
+  *row_group_remaining_size_ -= num_rows;
+  if (*row_group_remaining_size_ == 0) {
+++row_group_remaining_size_;
   }
+} else {
+  for (size_t i = 0; i < field_readers_.size(); ++i) {
+RETURN_NOT_OK(field_readers_[i]->NextBatch(batch_size_, &columns[i]));
+if (columns[i]->num_chunks() > 1) {
+  return Status::NotImplemented("This class cannot yet iterate chunked 
arrays");
+}
+  }
+  num_rows = columns[0]->length();
 }
 
 // Create an intermediate table and use TableBatchReader as an adaptor to a
 // RecordBatch
-std::shared_ptr table = Table::Make(schema_, columns);
+std::shared_ptr table = Table::Make(schema_, columns, num_rows);
+
 RETURN_NOT_OK(table->Validate());
 ::arrow::TableBatchReader table_batch_reader(*table);
 return table_batch_reader.ReadNext(out);
   }
 
  private:
+  std::shared_ptr metadata_;

Review comment:
   I'll remove it and rebase





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

2020-06-25 Thread GitBox


fsaintjacques commented on pull request #7536:
URL: https://github.com/apache/arrow/pull/7536#issuecomment-649607214


   I'm also of the opinion that we should stick with int32_t. That's what 
parquet uses for dict column, that's what R uses for factor columns, that's 
what we use by default in DictType, etc... I suspect the short-time net effect 
of this is uncovering index type issues we have around.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7537: ARROW-842: [Python] Recognize pandas.NaT as null when converting object arrays with from_pandas=True

2020-06-25 Thread GitBox


wesm commented on pull request #7537:
URL: https://github.com/apache/arrow/pull/7537#issuecomment-649612663


   +1, will merge this once build is green



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 opened a new pull request #7544: ARROW-7285: [C++] ensure C++ implementation meets clarified dictionary spec

2020-06-25 Thread GitBox


liyafan82 opened a new pull request #7544:
URL: https://github.com/apache/arrow/pull/7544


   https://issues.apache.org/jira/browse/ARROW-7285



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7544: ARROW-7285: [C++] ensure C++ implementation meets clarified dictionary spec

2020-06-25 Thread GitBox


github-actions[bot] commented on pull request #7544:
URL: https://github.com/apache/arrow/pull/7544#issuecomment-649627454


   https://issues.apache.org/jira/browse/ARROW-7285



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche opened a new pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

2020-06-25 Thread GitBox


jorisvandenbossche opened a new pull request #7545:
URL: https://github.com/apache/arrow/pull/7545


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

2020-06-25 Thread GitBox


jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-649631989


   Still some work:
   
   Need to add tests for the different filesystems that can be passed.
   
   There are still some skipped tests:
   
   * `ARROW:schema` is not yet removed from the metadata -> ARROW-9009
   * Partition fields as dictionary keys
   * Specifying `metadata` object (not very important IMO)
   
   One of the `large_memory` tests is also failing 
(`test_binary_array_overflow_to_chunked`):
   
   ```
   $ pytest python/pyarrow/tests/test_parquet.py -v -r s -m large_memory 
--enable-large_memory
   
===
 test session starts 
===
   platform linux -- Python 3.7.3, pytest-5.2.1, py-1.8.0, pluggy-0.12.0 -- 
/home/joris/miniconda3/envs/arrow-dev/bin/python
   cachedir: .pytest_cache
   hypothesis profile 'dev' -> max_examples=10, 
database=DirectoryBasedExampleDatabase('/home/joris/scipy/repos/arrow/.hypothesis/examples')
   rootdir: /home/joris/scipy/repos/arrow/python, inifile: setup.cfg
   plugins: hypothesis-4.47.5, lazy-fixture-0.6.1
   collected 277 items / 273 deselected / 4 selected

 
   
   python/pyarrow/tests/test_parquet.py::test_large_table_int32_overflow PASSED 

   [ 25%]
   python/pyarrow/tests/test_parquet.py::test_byte_array_exactly_2gb PASSED 

   [ 50%]
   python/pyarrow/tests/test_parquet.py::test_binary_array_overflow_to_chunked 
FAILED  
[ 75%]
   python/pyarrow/tests/test_parquet.py::test_list_of_binary_large_cell PASSED  

   [100%]
   
   

 FAILURES 
=
   
__
 test_binary_array_overflow_to_chunked 
__
   
   assert t.equals(result)
   
   
   @pytest.mark.pandas
   @pytest.mark.large_memory
   def test_binary_array_overflow_to_chunked():
   # ARROW-3762
   
   # 2^31 + 1 bytes
   values = [b'x'] + [
   b'x' * (1 << 20)
   ] * 2 * (1 << 10)
   >   df = pd.DataFrame({'byte_col': values})
   
   python/pyarrow/tests/test_parquet.py:3043: 
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   python/pyarrow/tests/test_parquet.py:3010: in _simple_table_roundtrip
   stream = pa.BufferOutputStream()
   python/pyarrow/tests/test_parquet.py:82: in _read_table
   return pq.read_table(*args, **kwargs)
   python/pyarrow/parquet.py:1555: in read_table
   raise ValueError(
   python/pyarrow/parquet.py:1468: in read
   use_threads=use_threads
   pyarrow/_dataset.pyx:403: in pyarrow._dataset.Dataset.to_table
   ???
   pyarrow/_dataset.pyx:1893: in pyarrow._dataset.Scanner.to_table
   ???
   pyarrow/error.pxi:122: in pyarrow.lib.pyarrow_internal_check_status
   ???
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   
   >   ???
   E   pyarrow.lib.ArrowNotImplementedError: This class cannot yet iterate 
chunked arrays
   
   pyarrow/error.pxi:105: ArrowNotImplementedError
   
= 1 
failed, 3 passed, 273 deselected in 512.87s (0:08:32) 
=
   ```
   
   
   
   
   
   
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.

[GitHub] [arrow] bkietz opened a new pull request #7546: ARROW-8733: [C++][Dataset][Python] Expose RowGroupInfo statistics values

2020-06-25 Thread GitBox


bkietz opened a new pull request #7546:
URL: https://github.com/apache/arrow/pull/7546


   ```python
   stats = parquet_fragment.row_groups[0].statistics
   assert stats == {
 'normal_column': {'min': 1, 'max': 2},
 'all_null_column': {'min': None, 'max': None},
 'column_without_stats': None,
   }
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7547: ARROW-8950: [C++] Avoid HEAD when possible in S3 filesystem

2020-06-25 Thread GitBox


pitrou commented on pull request #7547:
URL: https://github.com/apache/arrow/pull/7547#issuecomment-649634658


   @bkietz Would like your input on the dataset changes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou opened a new pull request #7547: ARROW-8950: [C++] Avoid HEAD when possible in S3 filesystem

2020-06-25 Thread GitBox


pitrou opened a new pull request #7547:
URL: https://github.com/apache/arrow/pull/7547


   Add FileSystem::OpenInput{Stream,File} overrides that accept a FileInfo 
parameter.
   This can be used to optimize file opening when it the file size and 
existence is already known.  Concretely, avoids a HEAD request in S3.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7546: ARROW-8733: [C++][Dataset][Python] Expose RowGroupInfo statistics values

2020-06-25 Thread GitBox


github-actions[bot] commented on pull request #7546:
URL: https://github.com/apache/arrow/pull/7546#issuecomment-649636223


   https://issues.apache.org/jira/browse/ARROW-8733



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

2020-06-25 Thread GitBox


github-actions[bot] commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-649636228


   https://issues.apache.org/jira/browse/ARROW-9139



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7547: ARROW-8950: [C++] Avoid HEAD when possible in S3 filesystem

2020-06-25 Thread GitBox


pitrou commented on pull request #7547:
URL: https://github.com/apache/arrow/pull/7547#issuecomment-649636318


   Note to self: instead of basing the default `OpenInputStream(const 
FileInfo&)` on `OpenInputStream(const std::string&)`, it would be more logical 
to do the reverse, so that no subclass needs to override both.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7547: ARROW-8950: [C++] Avoid HEAD when possible in S3 filesystem

2020-06-25 Thread GitBox


github-actions[bot] commented on pull request #7547:
URL: https://github.com/apache/arrow/pull/7547#issuecomment-649636224


   https://issues.apache.org/jira/browse/ARROW-8950



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7546: ARROW-8733: [C++][Dataset][Python] Expose RowGroupInfo statistics values

2020-06-25 Thread GitBox


jorisvandenbossche commented on a change in pull request #7546:
URL: https://github.com/apache/arrow/pull/7546#discussion_r445661743



##
File path: python/pyarrow/_dataset.pyx
##
@@ -845,6 +845,28 @@ cdef class RowGroupInfo:
 def num_rows(self):
 return self.info.num_rows()
 
+@property
+def statistics(self):
+if not self.info.HasStatistics():
+return None
+
+cdef:
+CStructScalar* c_statistics
+CStructScalar* c_minmax
+
+statistics = dict()
+c_statistics = self.info.statistics().get()
+for i in range(c_statistics.value.size()):
+name = frombytes(c_statistics.type.get().field(i).get().name())
+c_minmax =  c_statistics.value[i].get()
+
+statistics[name] = {
+'min': pyarrow_wrap_scalar(c_minmax.value[0]).as_py(),
+'max': pyarrow_wrap_scalar(c_minmax.value[1]).as_py(),
+}
+
+return statistics
+

Review comment:
   Do we want to expose the `statistics_expression` as well? (not fully 
sure if it would have a use case, so maybe we should only do that if we have 
one)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche closed pull request #7523: ARROW-8733: [Python][Dataset] Expose statistics of ParquetFileFragment::RowGroupInfo

2020-06-25 Thread GitBox


jorisvandenbossche closed pull request #7523:
URL: https://github.com/apache/arrow/pull/7523


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7523: ARROW-8733: [Python][Dataset] Expose statistics of ParquetFileFragment::RowGroupInfo

2020-06-25 Thread GitBox


jorisvandenbossche commented on pull request #7523:
URL: https://github.com/apache/arrow/pull/7523#issuecomment-649644745


   @rjzamora indeed something like that. I am not sure that you need to keep 
track of the path as well, unless maybe to have it working with existing 
functions to determine sorted columns out of this (but that's more something to 
discuss on the dask issue/PR)
   
   Now, in the meantime, @bkietz did the more proper implementation -> 
https://github.com/apache/arrow/pull/7546



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




  1   2   >