[GitHub] [arrow] ovr commented on pull request #9449: ARROW-11563: [Rust] Support Cast(Utf8, TimeStamp(Nanoseconds, None))

2021-02-12 Thread GitBox


ovr commented on pull request #9449:
URL: https://github.com/apache/arrow/pull/9449#issuecomment-778066965


   Just a notice: Rebased



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4

2021-02-12 Thread GitBox


liyafan82 commented on a change in pull request #8949:
URL: https://github.com/apache/arrow/pull/8949#discussion_r575091316



##
File path: java/vector/pom.xml
##
@@ -74,6 +74,11 @@
   org.slf4j
   slf4j-api
 
+

Review comment:
   @emkornfield Sounds reasonable. I will try to revise the PR accordingly. 
Thanks for your good suggestion. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4

2021-02-12 Thread GitBox


liyafan82 commented on pull request #8949:
URL: https://github.com/apache/arrow/pull/8949#issuecomment-778084083


   > @liyafan82 could you enable the java integration test to confirm that 
reading the files generated by C++ works before we merge (once we verify it is 
working I can take a final look)
   
   Sure. I will do some tests for that. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

2021-02-12 Thread GitBox


alamb commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-778126649


   @seddonm1  -- what do you think about merge order of this PR and #9243 ? 
(which will conflict)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on pull request #9449: ARROW-11563: [Rust] Support Cast(Utf8, TimeStamp(Nanoseconds, None))

2021-02-12 Thread GitBox


alamb commented on pull request #9449:
URL: https://github.com/apache/arrow/pull/9449#issuecomment-778131728


   Thanks again @ovr !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb closed pull request #9449: ARROW-11563: [Rust] Support Cast(Utf8, TimeStamp(Nanoseconds, None))

2021-02-12 Thread GitBox


alamb closed pull request #9449:
URL: https://github.com/apache/arrow/pull/9449


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] seddonm1 commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

2021-02-12 Thread GitBox


seddonm1 commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-778133305


   Unfortunately (for me) this logically does go first as being able to 
identify ScalarValue would give a huge performance advantage. 
   
   I am happy to rework the other one after this is merged.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on pull request #9402: ARROW-11481: [Rust] More cast implementations

2021-02-12 Thread GitBox


alamb commented on pull request #9402:
URL: https://github.com/apache/arrow/pull/9402#issuecomment-778133435


   @jorgecarleitao  do you think this one is ready to go ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb closed pull request #9445: ARROW-11557: [Rust][Datafusion] Add deregister_table

2021-02-12 Thread GitBox


alamb closed pull request #9445:
URL: https://github.com/apache/arrow/pull/9445


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on pull request #9445: ARROW-11557: [Rust][Datafusion] Add deregister_table

2021-02-12 Thread GitBox


alamb commented on pull request #9445:
URL: https://github.com/apache/arrow/pull/9445#issuecomment-778134379


   Thanks @marcprux !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on pull request #9412: ARROW-11491: [Rust] support JSON schema inference for nested list and struct

2021-02-12 Thread GitBox


alamb commented on pull request #9412:
URL: https://github.com/apache/arrow/pull/9412#issuecomment-778134781


   @houqp  -- I think this one is ready to go other than 
   
   > The only question I think should be answered / explained before this is 
merged is why the checked in file test/data/mixed_arrays.json ends up with one 
fewer column in the inferred schema after this PR.
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on pull request #9353: ARROW-11420: [Rust] Added support to length of Binary and List.

2021-02-12 Thread GitBox


alamb commented on pull request #9353:
URL: https://github.com/apache/arrow/pull/9353#issuecomment-778135054


   @jorgecarleitao  -- this one needs a rebase and then I think it is ready to 
go



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] aitor94 commented on pull request #8491: ARROW-10349: [Python] linux aarch64 wheels

2021-02-12 Thread GitBox


aitor94 commented on pull request #8491:
URL: https://github.com/apache/arrow/pull/8491#issuecomment-778135205


   I'm trying to install pyarrow in a m6g instance on AWS and I can't. This PR 
may help to solve my issue. 
[https://stackoverflow.com/questions/64928357/how-to-install-pyarrow-in-amazon-linux-2-m6g-instance?noredirect=1#comment116973612_64928357](stack_overflow)
   
   Thanks



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on pull request #9233: ARROW-11289: [Rust][DataFusion] Implement GROUP BY support for Dictionary Encoded columns

2021-02-12 Thread GitBox


alamb commented on pull request #9233:
URL: https://github.com/apache/arrow/pull/9233#issuecomment-778135913


   This one is now ready for review. cc @andygrove @seddonm1  @jhorstmann and 
@Dandandan 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on a change in pull request #9233: ARROW-11289: [Rust][DataFusion] Implement GROUP BY support for Dictionary Encoded columns

2021-02-12 Thread GitBox


alamb commented on a change in pull request #9233:
URL: https://github.com/apache/arrow/pull/9233#discussion_r575154628



##
File path: rust/datafusion/src/physical_plan/hash_aggregate.rs
##
@@ -398,97 +405,165 @@ fn group_aggregate_batch(
 Ok(accumulators)
 }
 
-/// Create a key `Vec` that is used as key for the hashmap
-pub(crate) fn create_key(
-group_by_keys: &[ArrayRef],
+/// Appends a sequence of [u8] bytes for the value in `col[row]` to
+/// `vec` to be used as a key into the hash map for a dictionary type
+///
+/// Note that ideally, for dictionary encoded columns, we would be
+/// able to simply use the dictionary idicies themselves (no need to
+/// look up values) or possibly simply build the hash table entirely
+/// on the dictionary indexes.
+///
+/// This aproach would likely work (very) well for the common case,
+/// but it also has to to handle the case where the dictionary itself
+/// is not the same across all record batches (and thus indexes in one
+/// record batch may not correspond to the same index in another)
+fn dictionary_create_key_for_col(
+col: &ArrayRef,
 row: usize,
 vec: &mut Vec,
 ) -> Result<()> {
-vec.clear();
-for col in group_by_keys {
-match col.data_type() {
-DataType::Boolean => {
-let array = 
col.as_any().downcast_ref::().unwrap();
-vec.extend_from_slice(&[array.value(row) as u8]);
-}
-DataType::Float32 => {
-let array = 
col.as_any().downcast_ref::().unwrap();
-vec.extend_from_slice(&array.value(row).to_le_bytes());
-}
-DataType::Float64 => {
-let array = 
col.as_any().downcast_ref::().unwrap();
-vec.extend_from_slice(&array.value(row).to_le_bytes());
-}
-DataType::UInt8 => {
-let array = col.as_any().downcast_ref::().unwrap();
-vec.extend_from_slice(&array.value(row).to_le_bytes());
-}
-DataType::UInt16 => {
-let array = 
col.as_any().downcast_ref::().unwrap();
-vec.extend_from_slice(&array.value(row).to_le_bytes());
-}
-DataType::UInt32 => {
-let array = 
col.as_any().downcast_ref::().unwrap();
-vec.extend_from_slice(&array.value(row).to_le_bytes());
-}
-DataType::UInt64 => {
-let array = 
col.as_any().downcast_ref::().unwrap();
-vec.extend_from_slice(&array.value(row).to_le_bytes());
-}
+let dict_col = col.as_any().downcast_ref::>().unwrap();
+
+// look up the index in the values dictionary
+let keys_col = dict_col.keys_array();
+let values_index = keys_col.value(row).to_usize().ok_or_else(|| {
+DataFusionError::Internal(format!(
+"Can not convert index to usize in dictionary of type creating 
group by value {:?}",
+keys_col.data_type()
+))
+})?;
+
+create_key_for_col(&dict_col.values(), values_index, vec)
+}
+
+/// Appends a sequence of [u8] bytes for the value in `col[row]` to
+/// `vec` to be used as a key into the hash map
+fn create_key_for_col(col: &ArrayRef, row: usize, vec: &mut Vec) -> 
Result<()> {

Review comment:
   This PR looks larger than it really is -- all it does is lift the per 
column treatment of arrays into its own function (so it can be called 
recursively) and then adds handling for dictionary support. 
   
   So while github renders the diff as a large change, it is very small logic 
change: 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

2021-02-12 Thread GitBox


alamb commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-778137269


   Sounds good -- @jorgecarleitao  let's get it merged ! It looks like it needs 
another rebase and then we'll get it in



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me commented on a change in pull request #9353: ARROW-11420: [Rust] Added support to length of Binary and List.

2021-02-12 Thread GitBox


nevi-me commented on a change in pull request #9353:
URL: https://github.com/apache/arrow/pull/9353#discussion_r575168065



##
File path: rust/arrow/src/compute/kernels/length.rs
##
@@ -24,42 +24,77 @@ use crate::{
 };
 use std::sync::Arc;
 
-fn length_string(array: &Array, data_type: DataType) -> 
Result
-where
-OffsetSize: OffsetSizeTrait,
-{
-// note: offsets are stored as u8, but they can be interpreted as 
OffsetSize
-let offsets = &array.data_ref().buffers()[0];
-// this is a 30% improvement over iterating over u8s and building 
OffsetSize, which
-// justifies the usage of `unsafe`.
-let slice: &[OffsetSize] =
-&unsafe { offsets.typed_data::() }[array.offset()..];
+fn clone_null_buffer(array: &impl Array) -> Option {
+array
+.data_ref()
+.null_bitmap()

Review comment:
   `.null_buffer().cloned()` ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm closed pull request #9433: ARROW-11539: [Developer][Archery] Change items_per_seconds units

2021-02-12 Thread GitBox


lidavidm closed pull request #9433:
URL: https://github.com/apache/arrow/pull/9433


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] marcprux opened a new pull request #9479: ARROW-11586: [Rust][Datafusion][WIP] Remove force unwrap

2021-02-12 Thread GitBox


marcprux opened a new pull request #9479:
URL: https://github.com/apache/arrow/pull/9479


   Fix for https://issues.apache.org/jira/browse/ARROW-11586



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9479: ARROW-11586: [Rust][Datafusion][WIP] Remove force unwrap

2021-02-12 Thread GitBox


github-actions[bot] commented on pull request #9479:
URL: https://github.com/apache/arrow/pull/9479#issuecomment-778196690


   https://issues.apache.org/jira/browse/ARROW-11586



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] codecov-io commented on pull request #9479: ARROW-11586: [Rust][Datafusion][WIP] Remove force unwrap

2021-02-12 Thread GitBox


codecov-io commented on pull request #9479:
URL: https://github.com/apache/arrow/pull/9479#issuecomment-778207974


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9479?src=pr&el=h1) Report
   > Merging 
[#9479](https://codecov.io/gh/apache/arrow/pull/9479?src=pr&el=desc) (ed800b7) 
into 
[master](https://codecov.io/gh/apache/arrow/commit/5e3fcfabf471fd3790e114b2245690c9c08ff743?el=desc)
 (5e3fcfa) will **decrease** coverage by `0.01%`.
   > The diff coverage is `88.57%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow/pull/9479/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9479?src=pr&el=tree)
   
   ```diff
   @@Coverage Diff @@
   ##   master#9479  +/-   ##
   ==
   - Coverage   82.32%   82.31%   -0.02% 
   ==
 Files 233  234   +1 
 Lines   5444654461  +15 
   ==
   + Hits4482344828   +5 
   - Misses   9623 9633  +10 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow/pull/9479?src=pr&el=tree) | Coverage 
Δ | |
   |---|---|---|
   | 
[rust/arrow/src/array/builder.rs](https://codecov.io/gh/apache/arrow/pull/9479/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYnVpbGRlci5ycw==)
 | `84.93% <ø> (ø)` | |
   | 
[rust/arrow/src/compute/kernels/sort.rs](https://codecov.io/gh/apache/arrow/pull/9479/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL3NvcnQucnM=)
 | `93.56% <ø> (ø)` | |
   | 
[rust/arrow/src/compute/kernels/substring.rs](https://codecov.io/gh/apache/arrow/pull/9479/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL3N1YnN0cmluZy5ycw==)
 | `98.29% <ø> (ø)` | |
   | 
[rust/arrow/src/compute/util.rs](https://codecov.io/gh/apache/arrow/pull/9479/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS91dGlsLnJz)
 | `98.92% <ø> (ø)` | |
   | 
[rust/arrow/src/json/reader.rs](https://codecov.io/gh/apache/arrow/pull/9479/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvanNvbi9yZWFkZXIucnM=)
 | `82.82% <ø> (ø)` | |
   | 
[rust/benchmarks/src/bin/tpch.rs](https://codecov.io/gh/apache/arrow/pull/9479/diff?src=pr&el=tree#diff-cnVzdC9iZW5jaG1hcmtzL3NyYy9iaW4vdHBjaC5ycw==)
 | `38.33% <0.00%> (+0.18%)` | :arrow_up: |
   | 
[rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/9479/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==)
 | `80.00% <ø> (ø)` | |
   | 
[rust/datafusion/src/logical\_plan/dfschema.rs](https://codecov.io/gh/apache/arrow/pull/9479/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZGZzY2hlbWEucnM=)
 | `85.52% <ø> (ø)` | |
   | 
[...tafusion/src/physical\_plan/datetime\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9479/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2RhdGV0aW1lX2V4cHJlc3Npb25zLnJz)
 | `93.28% <ø> (+0.24%)` | :arrow_up: |
   | 
[...atafusion/src/physical\_plan/expressions/in\_list.rs](https://codecov.io/gh/apache/arrow/pull/9479/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zL2luX2xpc3QucnM=)
 | `75.37% <ø> (ø)` | |
   | ... and [41 
more](https://codecov.io/gh/apache/arrow/pull/9479/diff?src=pr&el=tree-more) | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/arrow/pull/9479?src=pr&el=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/arrow/pull/9479?src=pr&el=footer). Last 
update 
[34e7671...ed800b7](https://codecov.io/gh/apache/arrow/pull/9479?src=pr&el=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] abreis commented on pull request #9454: ARROW-11572: [Rust] Add a kernel for division by single scalar

2021-02-12 Thread GitBox


abreis commented on pull request #9454:
URL: https://github.com/apache/arrow/pull/9454#issuecomment-778208765


   > @abreis , I am not convinced that that is sufficient, unfortunately, 
because it excludes all types that are not Numeric (i.e. all dates and times 
for primitives, as well as all other logical types).
   > 
   > We could of course offer a `DatumTemporal` and a `DatumList` and so one 
and so forth, but IMO that is introducing more complexity, as we now start 
having `struct`s for all kinds of logical types.
   
   I agree. I just realized that `ArrowPrimitiveType` exists though, so Datum 
can be generic over that instead.
   
   Here is a more fleshed-out mock of a Datum type. The signature of 
`datum_math_divide` enforces a consistent `T` over array/array and array/scalar 
combinations, and the method's output.
   
   Base type:
   ```rust
   #[derive(Debug)]
   pub enum Datum<'a, T>
   where
   T: ArrowPrimitiveType,
   {
   Array(&'a PrimitiveArray),
   Scalar(Option),
   }
   ```
   
   `From` impls for both arrays and nullable scalars:
   ```rust
   impl<'a, T> From<&'a PrimitiveArray> for Datum<'a, T>
   where
   T: ArrowPrimitiveType,
   {
   fn from(array: &'a PrimitiveArray) -> Self {
   Datum::Array(array)
   }
   }
   
   impl<'a, T> From> for Datum<'a, T>
   where
   T: ArrowPrimitiveType,
   {
   fn from(scalar: Option) -> Self {
   Datum::Scalar(scalar)
   }
   }
   ```
   
   A (user-facing) method for math division:
   ```rust
   pub fn math_divide<'a1, 'a2, T, DL, DR>(
   left: DL,
   right: DR,
   ) -> Result>
   where
   T: ArrowNumericType,
   T::Native: Div + Zero,
   DL: Into>, // left and right may have different lifetimes
   DR: Into>, // but `T` must be the same
   {
   use Datum::*;
   match (left.into(), right.into()) {
   (Array(left), Array(right)) => todo!(), // array/array
   (Array(array), Scalar(divisor)) => todo!(), // array/scalar
   _ => todo!(),
   }
   }
   ```
   
   Test code:
   ```rust
   fn test_datum_divide() {
   let array1 = Int32Array::from(vec![15, 15, 8, 1, 9]);
   let array2 = Int32Array::from(vec![5, 6, 8, 9, 1]);
   let scalar = Some(8i32);
   
   let a_over_a = math_divide(&array1, &array2).unwrap(); // works
   let a_over_s = math_divide(&array1, scalar).unwrap(); // also works
   }
   ```
   
   @jorgecarleitao I'm aware this doesn't address the second half of your 
comment. I'm just exploring this idea for the current version of arrow.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] abreis edited a comment on pull request #9454: ARROW-11572: [Rust] Add a kernel for division by single scalar

2021-02-12 Thread GitBox


abreis edited a comment on pull request #9454:
URL: https://github.com/apache/arrow/pull/9454#issuecomment-778208765


   > @abreis , I am not convinced that that is sufficient, unfortunately, 
because it excludes all types that are not Numeric (i.e. all dates and times 
for primitives, as well as all other logical types).
   > 
   > We could of course offer a `DatumTemporal` and a `DatumList` and so one 
and so forth, but IMO that is introducing more complexity, as we now start 
having `struct`s for all kinds of logical types.
   
   I agree. I just realized that `ArrowPrimitiveType` exists though, so Datum 
can be generic over that instead.
   
   Here is a more fleshed-out mock of a Datum type. The signature of 
`math_divide` enforces a consistent `T` over array/array and array/scalar 
combinations, and the method's output.
   
   Base type:
   ```rust
   #[derive(Debug)]
   pub enum Datum<'a, T>
   where
   T: ArrowPrimitiveType,
   {
   Array(&'a PrimitiveArray),
   Scalar(Option),
   }
   ```
   
   `From` impls for both arrays and nullable scalars:
   ```rust
   impl<'a, T> From<&'a PrimitiveArray> for Datum<'a, T>
   where
   T: ArrowPrimitiveType,
   {
   fn from(array: &'a PrimitiveArray) -> Self {
   Datum::Array(array)
   }
   }
   
   impl<'a, T> From> for Datum<'a, T>
   where
   T: ArrowPrimitiveType,
   {
   fn from(scalar: Option) -> Self {
   Datum::Scalar(scalar)
   }
   }
   ```
   
   A (user-facing) method for math division:
   ```rust
   pub fn math_divide<'a1, 'a2, T, DL, DR>(
   left: DL,
   right: DR,
   ) -> Result>
   where
   T: ArrowNumericType,
   T::Native: Div + Zero,
   DL: Into>, // left and right may have different lifetimes
   DR: Into>, // but `T` must be the same
   {
   use Datum::*;
   match (left.into(), right.into()) {
   (Array(left), Array(right)) => todo!(), // array/array
   (Array(array), Scalar(divisor)) => todo!(), // array/scalar
   _ => todo!(),
   }
   }
   ```
   
   Test code:
   ```rust
   fn test_datum_divide() {
   let array1 = Int32Array::from(vec![15, 15, 8, 1, 9]);
   let array2 = Int32Array::from(vec![5, 6, 8, 9, 1]);
   let scalar = Some(8i32);
   
   let a_over_a = math_divide(&array1, &array2).unwrap(); // works
   let a_over_s = math_divide(&array1, scalar).unwrap(); // also works
   }
   ```
   
   @jorgecarleitao I'm aware this doesn't address the second half of your 
comment. I'm just exploring this idea for the current version of arrow.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm opened a new pull request #9480: ARROW-11596: [Python][Dataset] make ScanTask.execute() eager

2021-02-12 Thread GitBox


lidavidm opened a new pull request #9480:
URL: https://github.com/apache/arrow/pull/9480


   This changes 2 things:
   - ScanTask.execute() is now eagerly evaluated, so any work involved in 
creating a record batch reader is done up front. For example, a Parquet file 
will actually begin reading (but not decoding) data. This mirrors the behavior 
in C++. This also makes it easier to separate stages of work when pipelining 
reads (by manually dispatching scan tasks).
   - This has the side effect of working around a SIGSEGV in the Cython 
generator, caused because Cython raises StopIteration without holding the GIL 
when you have a generator that uses "with [no]gil". In my tests this is fixed 
on Cython master but not any stable or development release.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9480: ARROW-11596: [Python][Dataset] make ScanTask.execute() eager

2021-02-12 Thread GitBox


github-actions[bot] commented on pull request #9480:
URL: https://github.com/apache/arrow/pull/9480#issuecomment-778212915


   https://issues.apache.org/jira/browse/ARROW-11596



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #9480: ARROW-11596: [Python][Dataset] make ScanTask.execute() eager

2021-02-12 Thread GitBox


jorisvandenbossche commented on a change in pull request #9480:
URL: https://github.com/apache/arrow/pull/9480#discussion_r575250046



##
File path: python/pyarrow/_dataset.pyx
##
@@ -2125,12 +2125,44 @@ cdef class ScanTask(_Weakrefable):
 ---
 record_batches : iterator of RecordBatch
 """
-cdef shared_ptr[CRecordBatch] record_batch
+# Return an explicit iterator object instead of using a
+# generator so that this method is eagerly evaluated (a
+# generator would mean no work gets done until the first
+# iteration). This also works around a bug in Cython's
+# generator.
+cdef CRecordBatchIterator iterator
 with nogil:
-for maybe_batch in GetResultValue(self.task.Execute()):
-record_batch = GetResultValue(move(maybe_batch))
-with gil:
-yield pyarrow_wrap_batch(record_batch)
+iterator = move(GetResultValue(self.task.Execute()))
+with gil:
+return RecordBatchIterator.wrap(self, move(iterator))

Review comment:
   This can also be put after the nogil context (so you don't need a `with 
gil`)? (or does that something different in cython?)

##
File path: python/pyarrow/_dataset.pyx
##
@@ -2125,12 +2125,44 @@ cdef class ScanTask(_Weakrefable):
 ---
 record_batches : iterator of RecordBatch
 """
-cdef shared_ptr[CRecordBatch] record_batch
+# Return an explicit iterator object instead of using a
+# generator so that this method is eagerly evaluated (a
+# generator would mean no work gets done until the first
+# iteration). This also works around a bug in Cython's
+# generator.
+cdef CRecordBatchIterator iterator
 with nogil:
-for maybe_batch in GetResultValue(self.task.Execute()):
-record_batch = GetResultValue(move(maybe_batch))
-with gil:
-yield pyarrow_wrap_batch(record_batch)
+iterator = move(GetResultValue(self.task.Execute()))
+with gil:
+return RecordBatchIterator.wrap(self, move(iterator))
+
+
+cdef class RecordBatchIterator(_Weakrefable):
+"""An iterator over a sequence of record batches."""
+cdef:
+ScanTask task
+CRecordBatchIterator iterator
+
+def __init__(self):
+_forbid_instantiation(self.__class__, subclasses_instead=False)
+
+@staticmethod
+cdef wrap(ScanTask task, CRecordBatchIterator iterator):
+cdef RecordBatchIterator self = \
+RecordBatchIterator.__new__(RecordBatchIterator)
+self.task = task
+self.iterator = move(iterator)
+return self
+
+def __iter__(self):
+return self
+
+def __next__(self):
+cdef shared_ptr[CRecordBatch] record_batch
+record_batch = GetResultValue(move(self.iterator.Next()))

Review comment:
   Does this need a nogil context?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on a change in pull request #9480: ARROW-11596: [Python][Dataset] make ScanTask.execute() eager

2021-02-12 Thread GitBox


lidavidm commented on a change in pull request #9480:
URL: https://github.com/apache/arrow/pull/9480#discussion_r575252412



##
File path: python/pyarrow/_dataset.pyx
##
@@ -2125,12 +2125,44 @@ cdef class ScanTask(_Weakrefable):
 ---
 record_batches : iterator of RecordBatch
 """
-cdef shared_ptr[CRecordBatch] record_batch
+# Return an explicit iterator object instead of using a
+# generator so that this method is eagerly evaluated (a
+# generator would mean no work gets done until the first
+# iteration). This also works around a bug in Cython's
+# generator.
+cdef CRecordBatchIterator iterator
 with nogil:
-for maybe_batch in GetResultValue(self.task.Execute()):
-record_batch = GetResultValue(move(maybe_batch))
-with gil:
-yield pyarrow_wrap_batch(record_batch)
+iterator = move(GetResultValue(self.task.Execute()))
+with gil:
+return RecordBatchIterator.wrap(self, move(iterator))
+
+
+cdef class RecordBatchIterator(_Weakrefable):
+"""An iterator over a sequence of record batches."""
+cdef:
+ScanTask task
+CRecordBatchIterator iterator
+
+def __init__(self):
+_forbid_instantiation(self.__class__, subclasses_instead=False)
+
+@staticmethod
+cdef wrap(ScanTask task, CRecordBatchIterator iterator):
+cdef RecordBatchIterator self = \
+RecordBatchIterator.__new__(RecordBatchIterator)
+self.task = task
+self.iterator = move(iterator)
+return self
+
+def __iter__(self):
+return self
+
+def __next__(self):
+cdef shared_ptr[CRecordBatch] record_batch
+record_batch = GetResultValue(move(self.iterator.Next()))

Review comment:
   You're right on both counts, I pushed a fixed version. (The `with gil` 
was because I was mimicking the original code but that doesn't make any sense 
anymore.)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on pull request #9474: ARROW-10420: [C++] Refactor io and filesystem APIs to take an IOContext

2021-02-12 Thread GitBox


emkornfield commented on pull request #9474:
URL: https://github.com/apache/arrow/pull/9474#issuecomment-778245429


   Drive by naming nit: IoContext



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield edited a comment on pull request #9474: ARROW-10420: [C++] Refactor io and filesystem APIs to take an IOContext

2021-02-12 Thread GitBox


emkornfield edited a comment on pull request #9474:
URL: https://github.com/apache/arrow/pull/9474#issuecomment-778245429


naming nit: IoContext?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #9386: ARROW-11373: [Python][Docs] Add example of specifying type for a column when reading csv file

2021-02-12 Thread GitBox


jorisvandenbossche commented on a change in pull request #9386:
URL: https://github.com/apache/arrow/pull/9386#discussion_r575303498



##
File path: docs/source/python/csv.rst
##
@@ -75,7 +75,22 @@ Customized conversion
 -
 
 To alter how CSV data is converted to Arrow types and data, you should create
-a :class:`ConvertOptions` instance and pass it to :func:`read_csv`.
+a :class:`ConvertOptions` instance and pass it to :func:`read_csv`::
+   import pyarrow as pa
+   import pyarrow.csv as csv
+
+   table = csv.read_csv('tips.csv.gz', convert_options=pa.csv.ConvertOptions(
+column_types={
+'total_bill': pa.decimal128(precision=10, scale=2),
+'tip': pa.decimal128(precision=10, scale=2),
+'sex': pa.string(),
+'smoker': pa.string(),
+'day': pa.string(),
+'time': pa.string(),
+'size': pa.int64()

Review comment:
   I think `column_type` allows to specify the type for only a subset of 
the columns. So that might actually be useful to show in the example (eg only 
for total_bill/tip), otherwise readers might get the impression that they need 
to specify all columns)

##
File path: docs/source/python/csv.rst
##
@@ -75,7 +75,22 @@ Customized conversion
 -
 
 To alter how CSV data is converted to Arrow types and data, you should create
-a :class:`ConvertOptions` instance and pass it to :func:`read_csv`.
+a :class:`ConvertOptions` instance and pass it to :func:`read_csv`::
+   import pyarrow as pa

Review comment:
   Can you leave a blank line between the code block and the paragraph 
above? (restructuredtext formatting is quite strict about those details ..)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #9466: ARROW-11379: [C++][Dataset] Better formatting for timestamp scalars

2021-02-12 Thread GitBox


jorisvandenbossche commented on pull request #9466:
URL: https://github.com/apache/arrow/pull/9466#issuecomment-778278059


   The diff shows a change in the testing submodule, is that expected?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #9470: ARROW-11480: [Python] Test filtering on INT96 timestamps

2021-02-12 Thread GitBox


jorisvandenbossche commented on a change in pull request #9470:
URL: https://github.com/apache/arrow/pull/9470#discussion_r575324170



##
File path: python/pyarrow/tests/parquet/test_dataset.py
##
@@ -352,6 +352,25 @@ def test_filters_cutoff_exclusive_datetime(tempdir, 
use_legacy_dataset):
 assert result_df['dates'].values == expected
 
 
+@pytest.mark.pandas
+def test_filters_inclusive_datetime(tempdir):
+# ARROW-11480
+path = tempdir / 'timestamps.parquet'
+
+pd.DataFrame({
+"dates": pd.date_range("2020-01-01", periods=10, freq="D"),
+"id": range(10)
+}).to_parquet(path, use_deprecated_int96_timestamps=True)
+
+import datetime

Review comment:
   Small nit: datetime is already imported at the top, so can be removed 
here





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on pull request #9466: ARROW-11379: [C++][Dataset] Better formatting for timestamp scalars

2021-02-12 Thread GitBox


bkietz commented on pull request #9466:
URL: https://github.com/apache/arrow/pull/9466#issuecomment-778279411


   No, there should be no change to `testing/`.I'll fix that



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #9470: ARROW-11480: [Python] Test filtering on INT96 timestamps

2021-02-12 Thread GitBox


bkietz commented on a change in pull request #9470:
URL: https://github.com/apache/arrow/pull/9470#discussion_r575325063



##
File path: python/pyarrow/tests/parquet/test_dataset.py
##
@@ -352,6 +352,25 @@ def test_filters_cutoff_exclusive_datetime(tempdir, 
use_legacy_dataset):
 assert result_df['dates'].values == expected
 
 
+@pytest.mark.pandas
+def test_filters_inclusive_datetime(tempdir):
+# ARROW-11480
+path = tempdir / 'timestamps.parquet'
+
+pd.DataFrame({
+"dates": pd.date_range("2020-01-01", periods=10, freq="D"),
+"id": range(10)
+}).to_parquet(path, use_deprecated_int96_timestamps=True)
+
+import datetime

Review comment:
   will do





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] andygrove opened a new pull request #9481: ARROW-11606: [Rust] [DataFusion] Add input schema to HashAggregateExec [WIP]

2021-02-12 Thread GitBox


andygrove opened a new pull request #9481:
URL: https://github.com/apache/arrow/pull/9481


   To make it easier to implement serde for `HashAggregateExec` we need access 
to the schema that the aggregate expressions are compiled against. For 
`Partial` aggregates this is the same as the schema of the input. However, for 
`Final` aggregates it is not, which is why we need to add this additional 
schema information.
   
   TODO: this is untested, and I need to add tests in this PR.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9481: ARROW-11606: [Rust] [DataFusion] Add input schema to HashAggregateExec [WIP]

2021-02-12 Thread GitBox


github-actions[bot] commented on pull request #9481:
URL: https://github.com/apache/arrow/pull/9481#issuecomment-778280588


   https://issues.apache.org/jira/browse/ARROW-11606



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] andygrove commented on pull request #9481: ARROW-11606: [Rust] [DataFusion] Add input schema to HashAggregateExec [WIP]

2021-02-12 Thread GitBox


andygrove commented on pull request #9481:
URL: https://github.com/apache/arrow/pull/9481#issuecomment-778280783


   @jorgecarleitao Does this make sense?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

2021-02-12 Thread GitBox


jorisvandenbossche commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r575328856



##
File path: python/pyarrow/tests/test_dataset.py
##
@@ -1587,33 +1587,54 @@ def test_open_dataset_non_existing_file():
 
 @pytest.mark.parquet
 @pytest.mark.parametrize('partitioning', ["directory", "hive"])
+@pytest.mark.parametrize('null_fallback', ['xyz', None])
 @pytest.mark.parametrize('partition_keys', [
 (["A", "B", "C"], [1, 2, 3]),
 ([1, 2, 3], ["A", "B", "C"]),
 (["A", "B", "C"], ["D", "E", "F"]),
 ([1, 2, 3], [4, 5, 6]),
+([1, None, 3], ["A", "B", "C"]),
+([1, 2, 3], ["A", None, "C"]),
+([None, 2, 3], [None, 2, 3]),
 ])
-def test_open_dataset_partitioned_dictionary_type(tempdir, partitioning,
-  partition_keys):
+def test_open_dataset_partitioned_dictionary_type(
+tempdir, partitioning, null_fallback, partition_keys
+):
 # ARROW-9288 / ARROW-9476
 import pyarrow.parquet as pq
-table = pa.table({'a': range(9), 'b': [0.] * 4 + [1.] * 5})
+
+table = pa.table({'a': range(9), 'b': [0.0] * 4 + [1.0] * 5})
+
+if None in partition_keys[0] or None in partition_keys[1]:
+# Directory partitioning can't handle the first part being null
+return

Review comment:
   only return here if `partitioning == "directory"` ?

##
File path: python/pyarrow/tests/test_dataset.py
##
@@ -1587,33 +1587,54 @@ def test_open_dataset_non_existing_file():
 
 @pytest.mark.parquet
 @pytest.mark.parametrize('partitioning', ["directory", "hive"])
+@pytest.mark.parametrize('null_fallback', ['xyz', None])

Review comment:
   What does `null_fallback=None` mean? (based on the docstring above it 
seems it can only be a string?)

##
File path: cpp/src/arrow/dataset/partition.cc
##
@@ -74,15 +74,26 @@ Status KeyValuePartitioning::SetDefaultValuesFromKeys(const 
Expression& expr,
   RecordBatchProjector* 
projector) {
   ARROW_ASSIGN_OR_RAISE(auto known_values, ExtractKnownFieldValues(expr));
   for (const auto& ref_value : known_values) {
-if (!ref_value.second.is_scalar()) {
-  return Status::Invalid("non-scalar partition key ", 
ref_value.second.ToString());
+const auto& known_value = ref_value.second;
+if (known_value.concrete() && !known_value.datum.is_scalar()) {
+  return Status::Invalid("non-scalar partition key ", 
known_value.datum.ToString());
 }
 
 ARROW_ASSIGN_OR_RAISE(auto match,
   ref_value.first.FindOneOrNone(*projector->schema()));
 
 if (match.empty()) continue;
-RETURN_NOT_OK(projector->SetDefaultValue(match, 
ref_value.second.scalar()));
+
+const auto& field = projector->schema()->field(match[0]);
+if (known_value.concrete()) {
+  RETURN_NOT_OK(projector->SetDefaultValue(match, 
known_value.datum.scalar()));
+} else if (known_value.valid) {
+  return Status::Invalid(
+  "Partition expression not defined enough to set default value for ",

Review comment:
   What does "not defined enough" in practice mean? (or what would be an 
example?)

##
File path: python/pyarrow/tests/test_dataset.py
##
@@ -1587,33 +1587,54 @@ def test_open_dataset_non_existing_file():
 
 @pytest.mark.parquet
 @pytest.mark.parametrize('partitioning', ["directory", "hive"])
+@pytest.mark.parametrize('null_fallback', ['xyz', None])
 @pytest.mark.parametrize('partition_keys', [
 (["A", "B", "C"], [1, 2, 3]),
 ([1, 2, 3], ["A", "B", "C"]),
 (["A", "B", "C"], ["D", "E", "F"]),
 ([1, 2, 3], [4, 5, 6]),
+([1, None, 3], ["A", "B", "C"]),
+([1, 2, 3], ["A", None, "C"]),
+([None, 2, 3], [None, 2, 3]),
 ])
-def test_open_dataset_partitioned_dictionary_type(tempdir, partitioning,
-  partition_keys):
+def test_open_dataset_partitioned_dictionary_type(

Review comment:
   you added this to a test that is specifically about reading partitioned 
datasets while inferring the partition fields as dictionary. Which is fine (as 
this case also needs to be able to hand that), but this should also work (and 
so be tested) in the default case not inferring dictionary type? 
   And should we also have a test for the writing part? (this one only tests 
reading)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm opened a new pull request #9482: ARROW-11601: [C++][Python][Dataset] expose Parquet pre-buffer option

2021-02-12 Thread GitBox


lidavidm opened a new pull request #9482:
URL: https://github.com/apache/arrow/pull/9482


   This exposes the pre-buffering option that was implemented for the base 
Parquet reader in Datasets.
   
   To summarize, the option coalesces and buffers ranges of a file based on the 
columns and row groups read, issuing a single read operation for multiple read 
requests, by combining adjacent and "nearby" ranges into a single request. This 
means it handles both cases where the entire file is being read, as well as a 
subset of columns and/or row groups.
   
   There's not a benchmark for Datasets in the repo, but here's a quick 
comparison for loading data from S3:
   
   ```
   Without buffering:
   Data read: 1300.00 MiB
   Mean : 7.68 s
   Median   : 7.48 s
   Stdev: 0.76 s
   Mean rate: 170.57 MiB/s
   
   With buffering:
   Data read: 1300.00 MiB
   Mean : 3.88 s
   Median   : 3.89 s
   Stdev: 0.21 s
   Mean rate: 335.95 MiB/s
   ```
   
   The code being benchmarked is essentially:
   
   ```python
   dataset = pyarrow.dataset.FileSystemDataset.from_paths(
   paths,
   schema=schema,
   format=format,
   filesystem=fs,
   )
   table = dataset.to_table()
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9482: ARROW-11601: [C++][Python][Dataset] expose Parquet pre-buffer option

2021-02-12 Thread GitBox


github-actions[bot] commented on pull request #9482:
URL: https://github.com/apache/arrow/pull/9482#issuecomment-778300420


   https://issues.apache.org/jira/browse/ARROW-11601



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorgecarleitao commented on pull request #9481: ARROW-11606: [Rust] [DataFusion] Add input schema to HashAggregateExec [WIP]

2021-02-12 Thread GitBox


jorgecarleitao commented on pull request #9481:
URL: https://github.com/apache/arrow/pull/9481#issuecomment-778320725


   Makes a lot of sense to me. 👍  Does it address the issue on Ballista?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] andygrove commented on pull request #9481: ARROW-11606: [Rust] [DataFusion] Add input schema to HashAggregateExec [WIP]

2021-02-12 Thread GitBox


andygrove commented on pull request #9481:
URL: https://github.com/apache/arrow/pull/9481#issuecomment-778323450


   Yes, I think so. See https://github.com/ballista-compute/ballista/pull/505 
... I am going to finish testing this over the weekend.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] marcprux commented on pull request #9479: ARROW-11586: [Rust][Datafusion] Remove force unwrap

2021-02-12 Thread GitBox


marcprux commented on pull request #9479:
URL: https://github.com/apache/arrow/pull/9479#issuecomment-778369173


   I've un-drafted it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] westonpace commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

2021-02-12 Thread GitBox


westonpace commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r575451640



##
File path: cpp/src/arrow/dataset/partition.cc
##
@@ -74,15 +74,26 @@ Status KeyValuePartitioning::SetDefaultValuesFromKeys(const 
Expression& expr,
   RecordBatchProjector* 
projector) {
   ARROW_ASSIGN_OR_RAISE(auto known_values, ExtractKnownFieldValues(expr));
   for (const auto& ref_value : known_values) {
-if (!ref_value.second.is_scalar()) {
-  return Status::Invalid("non-scalar partition key ", 
ref_value.second.ToString());
+const auto& known_value = ref_value.second;
+if (known_value.concrete() && !known_value.datum.is_scalar()) {
+  return Status::Invalid("non-scalar partition key ", 
known_value.datum.ToString());
 }
 
 ARROW_ASSIGN_OR_RAISE(auto match,
   ref_value.first.FindOneOrNone(*projector->schema()));
 
 if (match.empty()) continue;
-RETURN_NOT_OK(projector->SetDefaultValue(match, 
ref_value.second.scalar()));
+
+const auto& field = projector->schema()->field(match[0]);
+if (known_value.concrete()) {
+  RETURN_NOT_OK(projector->SetDefaultValue(match, 
known_value.datum.scalar()));
+} else if (known_value.valid) {
+  return Status::Invalid(
+  "Partition expression not defined enough to set default value for ",

Review comment:
   Today we only get in this state if the expression is something like 
`pa.field("a").is_valid()` although technically this could be inferred from 
expressions like `pa.field("a") > 7` as well.  Any expression that only 
evaluates true if the field is non-null gives us some hint at least that the 
result will be non-null.
   
   Although, on second review, it's probably better to just `continue` in this 
case rather than return an error.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] codecov-io commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

2021-02-12 Thread GitBox


codecov-io commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-778382279


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=h1) Report
   > Merging 
[#9376](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=desc) (41e8f26) 
into 
[master](https://codecov.io/gh/apache/arrow/commit/5e3fcfabf471fd3790e114b2245690c9c08ff743?el=desc)
 (5e3fcfa) will **decrease** coverage by `0.17%`.
   > The diff coverage is `74.50%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow/pull/9376/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=tree)
   
   ```diff
   @@Coverage Diff @@
   ##   master#9376  +/-   ##
   ==
   - Coverage   82.32%   82.14%   -0.18% 
   ==
 Files 233  235   +2 
 Lines   5444654582 +136 
   ==
   + Hits4482344838  +15 
   - Misses   9623 9744 +121 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=tree) | Coverage 
Δ | |
   |---|---|---|
   | 
[rust/arrow/src/array/builder.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYnVpbGRlci5ycw==)
 | `84.93% <ø> (ø)` | |
   | 
[rust/arrow/src/compute/kernels/sort.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL3NvcnQucnM=)
 | `93.56% <ø> (ø)` | |
   | 
[rust/arrow/src/compute/kernels/substring.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL3N1YnN0cmluZy5ycw==)
 | `98.29% <ø> (ø)` | |
   | 
[rust/arrow/src/compute/util.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS91dGlsLnJz)
 | `98.92% <ø> (ø)` | |
   | 
[rust/arrow/src/json/reader.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvanNvbi9yZWFkZXIucnM=)
 | `82.82% <ø> (ø)` | |
   | 
[rust/benchmarks/src/bin/tpch.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9iZW5jaG1hcmtzL3NyYy9iaW4vdHBjaC5ycw==)
 | `38.33% <0.00%> (+0.18%)` | :arrow_up: |
   | 
[rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==)
 | `80.00% <ø> (ø)` | |
   | 
[rust/datafusion/src/execution/dataframe\_impl.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9leGVjdXRpb24vZGF0YWZyYW1lX2ltcGwucnM=)
 | `93.83% <ø> (ø)` | |
   | 
[rust/datafusion/src/logical\_plan/dfschema.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZGZzY2hlbWEucnM=)
 | `85.52% <ø> (ø)` | |
   | 
[...datafusion/src/physical\_plan/expressions/binary.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zL2JpbmFyeS5ycw==)
 | `85.53% <ø> (-0.32%)` | :arrow_down: |
   | ... and [55 
more](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree-more) | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=footer). Last 
update 
[34e7671...cd79e6e](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] westonpace commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

2021-02-12 Thread GitBox


westonpace commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r575453002



##
File path: python/pyarrow/tests/test_dataset.py
##
@@ -1587,33 +1587,54 @@ def test_open_dataset_non_existing_file():
 
 @pytest.mark.parquet
 @pytest.mark.parametrize('partitioning', ["directory", "hive"])
+@pytest.mark.parametrize('null_fallback', ['xyz', None])

Review comment:
   In that case it does not pass in anything when it creates the 
partitioning and ensures that it defaults to the correct default value.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] dianaclarke commented on pull request #9272: [WIP] Benchmark placebo

2021-02-12 Thread GitBox


dianaclarke commented on pull request #9272:
URL: https://github.com/apache/arrow/pull/9272#issuecomment-778424894


   @ursabot please benchmark



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] seddonm1 commented on a change in pull request #9233: ARROW-11289: [Rust][DataFusion] Implement GROUP BY support for Dictionary Encoded columns

2021-02-12 Thread GitBox


seddonm1 commented on a change in pull request #9233:
URL: https://github.com/apache/arrow/pull/9233#discussion_r575535244



##
File path: rust/datafusion/src/physical_plan/hash_aggregate.rs
##
@@ -398,97 +405,165 @@ fn group_aggregate_batch(
 Ok(accumulators)
 }
 
-/// Create a key `Vec` that is used as key for the hashmap
-pub(crate) fn create_key(
-group_by_keys: &[ArrayRef],
+/// Appends a sequence of [u8] bytes for the value in `col[row]` to
+/// `vec` to be used as a key into the hash map for a dictionary type
+///
+/// Note that ideally, for dictionary encoded columns, we would be
+/// able to simply use the dictionary idicies themselves (no need to

Review comment:
   Bikeshedding: `idicies` -> `indices`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] seddonm1 commented on a change in pull request #9233: ARROW-11289: [Rust][DataFusion] Implement GROUP BY support for Dictionary Encoded columns

2021-02-12 Thread GitBox


seddonm1 commented on a change in pull request #9233:
URL: https://github.com/apache/arrow/pull/9233#discussion_r575535374



##
File path: rust/datafusion/src/physical_plan/hash_aggregate.rs
##
@@ -398,97 +405,165 @@ fn group_aggregate_batch(
 Ok(accumulators)
 }
 
-/// Create a key `Vec` that is used as key for the hashmap
-pub(crate) fn create_key(
-group_by_keys: &[ArrayRef],
+/// Appends a sequence of [u8] bytes for the value in `col[row]` to
+/// `vec` to be used as a key into the hash map for a dictionary type
+///
+/// Note that ideally, for dictionary encoded columns, we would be
+/// able to simply use the dictionary idicies themselves (no need to
+/// look up values) or possibly simply build the hash table entirely
+/// on the dictionary indexes.
+///
+/// This aproach would likely work (very) well for the common case,

Review comment:
   `aproach` -> `approach`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #9272: [WIP] Benchmark placebo

2021-02-12 Thread GitBox


nealrichardson commented on pull request #9272:
URL: https://github.com/apache/arrow/pull/9272#issuecomment-778471641


   @github-actions rebase



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] seddonm1 commented on a change in pull request #9233: ARROW-11289: [Rust][DataFusion] Implement GROUP BY support for Dictionary Encoded columns

2021-02-12 Thread GitBox


seddonm1 commented on a change in pull request #9233:
URL: https://github.com/apache/arrow/pull/9233#discussion_r575535635



##
File path: rust/datafusion/src/physical_plan/hash_aggregate.rs
##
@@ -398,97 +405,165 @@ fn group_aggregate_batch(
 Ok(accumulators)
 }
 
-/// Create a key `Vec` that is used as key for the hashmap
-pub(crate) fn create_key(
-group_by_keys: &[ArrayRef],
+/// Appends a sequence of [u8] bytes for the value in `col[row]` to
+/// `vec` to be used as a key into the hash map for a dictionary type
+///
+/// Note that ideally, for dictionary encoded columns, we would be
+/// able to simply use the dictionary idicies themselves (no need to
+/// look up values) or possibly simply build the hash table entirely
+/// on the dictionary indexes.
+///
+/// This aproach would likely work (very) well for the common case,
+/// but it also has to to handle the case where the dictionary itself
+/// is not the same across all record batches (and thus indexes in one
+/// record batch may not correspond to the same index in another)
+fn dictionary_create_key_for_col(

Review comment:
   this makes sense and the comment really helps so :+1:





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson opened a new pull request #9483: ARROW-11610: [C++] Download boost from sourceforge instead of bintray

2021-02-12 Thread GitBox


nealrichardson opened a new pull request #9483:
URL: https://github.com/apache/arrow/pull/9483


   This also removes some bintray URLs where we had mirrored dependency 
versions for redundancy (cf. ARROW-11611); they're not active anymore anyway 
because we've bumped the dependency versions and didn't update our mirroring. 
   
   For what it's worth, our builds won't touch the updated boost on sourceforge 
because we still have our trimmed boost bundle in front of it (updating that is 
ARROW-11612), so we'll have to confirm this by other means. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] seddonm1 commented on pull request #9233: ARROW-11289: [Rust][DataFusion] Implement GROUP BY support for Dictionary Encoded columns

2021-02-12 Thread GitBox


seddonm1 commented on pull request #9233:
URL: https://github.com/apache/arrow/pull/9233#issuecomment-778472259


   @alamb looks good to me 👍 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9483: ARROW-11610: [C++] Download boost from sourceforge instead of bintray

2021-02-12 Thread GitBox


github-actions[bot] commented on pull request #9483:
URL: https://github.com/apache/arrow/pull/9483#issuecomment-778472380


   https://issues.apache.org/jira/browse/ARROW-11610



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9484: ARROW-11614: Fix round() logic to return positive zero when argument is zero

2021-02-12 Thread GitBox


github-actions[bot] commented on pull request #9484:
URL: https://github.com/apache/arrow/pull/9484#issuecomment-778482095


   https://issues.apache.org/jira/browse/ARROW-11614



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sagnikc-dremio opened a new pull request #9484: ARROW-11614: Fix round() logic to return positive zero when argument is zero

2021-02-12 Thread GitBox


sagnikc-dremio opened a new pull request #9484:
URL: https://github.com/apache/arrow/pull/9484


   Previously, round(0.0) and round(0.0, out_scale) were returning -0.0, with 
this patch round() returns +0.0



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb closed pull request #9481: ARROW-11606: [Rust] [DataFusion] Add input schema to HashAggregateExec

2021-02-12 Thread GitBox


alamb closed pull request #9481:
URL: https://github.com/apache/arrow/pull/9481


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb closed pull request #9479: ARROW-11586: [Rust][Datafusion] Remove force unwrap

2021-02-12 Thread GitBox


alamb closed pull request #9479:
URL: https://github.com/apache/arrow/pull/9479


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on pull request #9479: ARROW-11586: [Rust][Datafusion] Remove force unwrap

2021-02-12 Thread GitBox


alamb commented on pull request #9479:
URL: https://github.com/apache/arrow/pull/9479#issuecomment-778491382


   Thanks @marcprux 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on pull request #9481: ARROW-11606: [Rust] [DataFusion] Add input schema to HashAggregateExec

2021-02-12 Thread GitBox


alamb commented on pull request #9481:
URL: https://github.com/apache/arrow/pull/9481#issuecomment-778492158


   FWIW @jorgecarleitao  and @andygrove  -- I have been using DataFusion quite 
heavily in IOx (e.g. https://github.com/influxdata/influxdb_iox/pull/795) and 
it is doing great. DataFusion is a really cool piece of software



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] andygrove commented on pull request #9481: ARROW-11606: [Rust] [DataFusion] Add input schema to HashAggregateExec

2021-02-12 Thread GitBox


andygrove commented on pull request #9481:
URL: https://github.com/apache/arrow/pull/9481#issuecomment-778503743


   Thanks for the quick review and merge @jorgecarleitao and @alamb. I think 
distributed query execution will be working well enough in Ballista to support 
some TPC-H queries this weekend! The great news is that Ballista is now a tiny 
codebase. DataFusion is working out really well thanks to all the great work 
that went into 3.0.0 :heart: 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] BryanCutler commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overf

2021-02-12 Thread GitBox


BryanCutler commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-778506839


   @WeichenXu123 we would just need to add the following API to 
`BaseVariableWidthVector` (and possibly `BaseRepeatedValueVector`)
   
   ```Java
   public int getBufferSizeFor(final int valueCount, double density)
   ```
   The argument `density` is used elsewhere, so we should stick with that to be 
consistent. Also, the method `public double getDensity()` already exists in 
`BaseVariableWidthVector`.
   
   As for it being static, it could be, but just as the related methods it 
might be a good idea to not do this in case a subclass would want to override 
it in the future. Is it a requirement in your use case to be static?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ursabot commented on pull request #9272: [WIP] Benchmark placebo

2021-02-12 Thread GitBox


ursabot commented on pull request #9272:
URL: https://github.com/apache/arrow/pull/9272#issuecomment-778520842


   ubuntu-20.04-x86_64: 
https://conbench.ursa.dev/compare/runs/139e0ea9-58ba-4b90-9eb7-c1e91f4daebd...223ea2f6-3e1f-4bc2-86d9-36113b073403/
   dgx-ubuntu-18.04-x86_64: 
https://conbench.ursa.dev/compare/runs/0c8b2a20-56a1-462d-aeea-ca76785728d2...064f63db-c6f6-480a-b9ae-200342921fb1/



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #9423: ARROW-9856: [R] Add bindings for string compute functions

2021-02-12 Thread GitBox


nealrichardson commented on a change in pull request #9423:
URL: https://github.com/apache/arrow/pull/9423#discussion_r575584575



##
File path: r/tests/testthat/test-dplyr.R
##
@@ -256,6 +262,29 @@ test_that("filter() with %in%", {
   )
 })
 
+test_that("filter() with string ops", {
+  expect_dplyr_equal(
+input %>%
+  filter(dbl > 2, toupper(chr) %in% c("D", "F")) %>%
+  collect(),
+tbl
+  )

Review comment:
   I looked into this. You can't just define `as.character` in the test 
because you need to replace the one that `toupper` calls. You can't use 
`testthat::with_mock` to mock base functions anymore, and my other hack of 
using `trace` to insert a crash seems to crash too many things (presumably 
legitimate places where as.character gets called elsewhere in the dplyr code). 
`mockery` might allow this more targeted but I haven't had great success 
setting it up before; there are other ways we could instrument our code to make 
sure the Arrow function is called, but I'm not sure it's worth it. It 
_definitely_ won't work if you do this on a Dataset, so we get some related 
test coverage there.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #9423: ARROW-9856: [R] Add bindings for string compute functions

2021-02-12 Thread GitBox


nealrichardson commented on a change in pull request #9423:
URL: https://github.com/apache/arrow/pull/9423#discussion_r575584575



##
File path: r/tests/testthat/test-dplyr.R
##
@@ -256,6 +262,29 @@ test_that("filter() with %in%", {
   )
 })
 
+test_that("filter() with string ops", {
+  expect_dplyr_equal(
+input %>%
+  filter(dbl > 2, toupper(chr) %in% c("D", "F")) %>%
+  collect(),
+tbl
+  )

Review comment:
   I looked into this. You can't just define `as.character` in the test 
because you need to replace the one that `toupper` calls. You can't use 
`testthat::with_mock` to mock base functions anymore, and my other hack of 
using `trace` to insert a crash seems to crash too many things (presumably 
legitimate places where as.character gets called elsewhere in the dplyr code). 
`mockery` might allow this more targeted but I haven't had great success 
setting it up before; there are other ways we could instrument our code to make 
sure the Arrow function is called, but I'm not sure it's worth it. It 
_definitely_ won't work if you called `toupper` on a Dataset expression, so we 
get some related test coverage there.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] houqp commented on pull request #9412: ARROW-11491: [Rust] support JSON schema inference for nested list and struct

2021-02-12 Thread GitBox


houqp commented on pull request #9412:
URL: https://github.com/apache/arrow/pull/9412#issuecomment-778534603


   sorry for the delay, i have been busy lately. will update the PR this 
weekend to address all the feedback commented so far.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] seddonm1 opened a new pull request #9485: ARROW-11616: [Rust][DataFusion] Add collect_partitioned on DataFrame

2021-02-12 Thread GitBox


seddonm1 opened a new pull request #9485:
URL: https://github.com/apache/arrow/pull/9485


   The DataFrame API has a `collect` method which invokes the `collect(plan: 
Arc) -> Result>` function which will 
collect records into a single vector of RecordBatches removing any partitioning 
via `MergeExec`.
   
   This PR adds the DataFrame `collect_partitioned` method so that partitioning 
can be maintained. This allows easy passing into a new `MemTable`.
   
   @andygrove 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9485: ARROW-11616: [Rust][DataFusion] Add collect_partitioned on DataFrame

2021-02-12 Thread GitBox


github-actions[bot] commented on pull request #9485:
URL: https://github.com/apache/arrow/pull/9485#issuecomment-778550493


   https://issues.apache.org/jira/browse/ARROW-11616



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] codecov-io commented on pull request #9485: ARROW-11616: [Rust][DataFusion] Add collect_partitioned on DataFrame

2021-02-12 Thread GitBox


codecov-io commented on pull request #9485:
URL: https://github.com/apache/arrow/pull/9485#issuecomment-778552128


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9485?src=pr&el=h1) Report
   > Merging 
[#9485](https://codecov.io/gh/apache/arrow/pull/9485?src=pr&el=desc) (5ed4321) 
into 
[master](https://codecov.io/gh/apache/arrow/commit/7660a22090fcf1d0230ca1e700a4b98f647b0c48?el=desc)
 (7660a22) will **decrease** coverage by `0.00%`.
   > The diff coverage is `0.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow/pull/9485/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9485?src=pr&el=tree)
   
   ```diff
   @@Coverage Diff @@
   ##   master#9485  +/-   ##
   ==
   - Coverage   82.31%   82.30%   -0.01% 
   ==
 Files 234  234  
 Lines   5448254488   +6 
   ==
 Hits4484944849  
   - Misses   9633 9639   +6 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow/pull/9485?src=pr&el=tree) | Coverage 
Δ | |
   |---|---|---|
   | 
[rust/datafusion/src/execution/dataframe\_impl.rs](https://codecov.io/gh/apache/arrow/pull/9485/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9leGVjdXRpb24vZGF0YWZyYW1lX2ltcGwucnM=)
 | `90.13% <0.00%> (-3.71%)` | :arrow_down: |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/arrow/pull/9485?src=pr&el=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/arrow/pull/9485?src=pr&el=footer). Last 
update 
[7660a22...5ed4321](https://codecov.io/gh/apache/arrow/pull/9485?src=pr&el=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4

2021-02-12 Thread GitBox


emkornfield commented on pull request #8949:
URL: https://github.com/apache/arrow/pull/8949#issuecomment-778562980


   > > efore we merge (once we verify it is working I can take a final look)
   > 
   > Sure. I will do some tests for that.
   
   To run tests it should be sufficient to unskip the Java implementation in 
archery.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] projjal opened a new pull request #9486: ARROW-11617: [C++][Gandiva] Fix nested if-else optimisation in gandiva

2021-02-12 Thread GitBox


projjal opened a new pull request #9486:
URL: https://github.com/apache/arrow/pull/9486


   In gandiva, when we have nested if-else statements we reuse the local bitmap 
and treat it is a single logical if - elseif - .. - --else condition. However, 
when he have say another function between them like
   IF
   THEN
   ELSE
  function(
IF
THEN
ELSE
 )
   
   in such cases also currently we are doing same thing which can lead to 
incorrect results



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #9486: ARROW-11617: [C++][Gandiva] Fix nested if-else optimisation in gandiva

2021-02-12 Thread GitBox


github-actions[bot] commented on pull request #9486:
URL: https://github.com/apache/arrow/pull/9486#issuecomment-778565617


   https://issues.apache.org/jira/browse/ARROW-11617



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me commented on a change in pull request #9425: ARROW-11504: [Rust] Added checks to List DataType.

2021-02-12 Thread GitBox


nevi-me commented on a change in pull request #9425:
URL: https://github.com/apache/arrow/pull/9425#discussion_r575625396



##
File path: rust/arrow/src/array/array_list.rs
##
@@ -31,12 +31,19 @@ use crate::datatypes::*;
 
 /// trait declaring an offset size, relevant for i32 vs i64 array types.
 pub trait OffsetSizeTrait: ArrowNativeType + Num + Ord + std::ops::AddAssign {
+fn is_large() -> bool;
+
 fn prefix() -> &'static str;
 
 fn to_isize(&self) -> isize;
 }
 
 impl OffsetSizeTrait for i32 {
+#[inline]
+fn is_large() -> bool {
+false
+}
+
 fn prefix() -> &'static str {

Review comment:
   We might still need it, we also use it for formatting in `Display`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] codecov-io commented on pull request #9469: ARROW-11599: [Rust] Add function to create array with all nulls

2021-02-12 Thread GitBox


codecov-io commented on pull request #9469:
URL: https://github.com/apache/arrow/pull/9469#issuecomment-778569205


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9469?src=pr&el=h1) Report
   > Merging 
[#9469](https://codecov.io/gh/apache/arrow/pull/9469?src=pr&el=desc) (8c45e47) 
into 
[master](https://codecov.io/gh/apache/arrow/commit/7660a22090fcf1d0230ca1e700a4b98f647b0c48?el=desc)
 (7660a22) will **decrease** coverage by `0.03%`.
   > The diff coverage is `56.86%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow/pull/9469/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9469?src=pr&el=tree)
   
   ```diff
   @@Coverage Diff @@
   ##   master#9469  +/-   ##
   ==
   - Coverage   82.31%   82.28%   -0.04% 
   ==
 Files 234  234  
 Lines   5448254594 +112 
   ==
   + Hits4484944920  +71 
   - Misses   9633 9674  +41 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow/pull/9469?src=pr&el=tree) | Coverage 
Δ | |
   |---|---|---|
   | 
[rust/datafusion/src/scalar.rs](https://codecov.io/gh/apache/arrow/pull/9469/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zY2FsYXIucnM=)
 | `56.73% <15.00%> (+1.04%)` | :arrow_up: |
   | 
[rust/arrow/src/array/array.rs](https://codecov.io/gh/apache/arrow/pull/9469/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXkucnM=)
 | `75.50% <62.99%> (-13.03%)` | :arrow_down: |
   | 
[rust/arrow/src/array/null.rs](https://codecov.io/gh/apache/arrow/pull/9469/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvbnVsbC5ycw==)
 | `87.50% <66.66%> (-5.10%)` | :arrow_down: |
   | 
[rust/datafusion/src/physical\_plan/parquet.rs](https://codecov.io/gh/apache/arrow/pull/9469/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3BhcnF1ZXQucnM=)
 | `88.05% <66.66%> (-0.06%)` | :arrow_down: |
   | 
[rust/arrow/src/array/equal/utils.rs](https://codecov.io/gh/apache/arrow/pull/9469/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvdXRpbHMucnM=)
 | `76.47% <0.00%> (+0.98%)` | :arrow_up: |
   | 
[rust/arrow/src/array/transform/fixed\_binary.rs](https://codecov.io/gh/apache/arrow/pull/9469/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvdHJhbnNmb3JtL2ZpeGVkX2JpbmFyeS5ycw==)
 | `84.21% <0.00%> (+5.26%)` | :arrow_up: |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/arrow/pull/9469?src=pr&el=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/arrow/pull/9469?src=pr&el=footer). Last 
update 
[7660a22...8c45e47](https://codecov.io/gh/apache/arrow/pull/9469?src=pr&el=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] projjal closed pull request #9486: ARROW-11617: [C++][Gandiva] Fix nested if-else optimisation in gandiva

2021-02-12 Thread GitBox


projjal closed pull request #9486:
URL: https://github.com/apache/arrow/pull/9486


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org