[GitHub] [arrow] ovr commented on pull request #9449: ARROW-11563: [Rust] Support Cast(Utf8, TimeStamp(Nanoseconds, None))
ovr commented on pull request #9449: URL: https://github.com/apache/arrow/pull/9449#issuecomment-778066965 Just a notice: Rebased This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on a change in pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4
liyafan82 commented on a change in pull request #8949: URL: https://github.com/apache/arrow/pull/8949#discussion_r575091316 ## File path: java/vector/pom.xml ## @@ -74,6 +74,11 @@ org.slf4j slf4j-api + Review comment: @emkornfield Sounds reasonable. I will try to revise the PR accordingly. Thanks for your good suggestion. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4
liyafan82 commented on pull request #8949: URL: https://github.com/apache/arrow/pull/8949#issuecomment-778084083 > @liyafan82 could you enable the java integration test to confirm that reading the files generated by C++ works before we merge (once we verify it is working I can take a final look) Sure. I will do some tests for that. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.
alamb commented on pull request #9376: URL: https://github.com/apache/arrow/pull/9376#issuecomment-778126649 @seddonm1 -- what do you think about merge order of this PR and #9243 ? (which will conflict) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9449: ARROW-11563: [Rust] Support Cast(Utf8, TimeStamp(Nanoseconds, None))
alamb commented on pull request #9449: URL: https://github.com/apache/arrow/pull/9449#issuecomment-778131728 Thanks again @ovr ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb closed pull request #9449: ARROW-11563: [Rust] Support Cast(Utf8, TimeStamp(Nanoseconds, None))
alamb closed pull request #9449: URL: https://github.com/apache/arrow/pull/9449 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] seddonm1 commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.
seddonm1 commented on pull request #9376: URL: https://github.com/apache/arrow/pull/9376#issuecomment-778133305 Unfortunately (for me) this logically does go first as being able to identify ScalarValue would give a huge performance advantage. I am happy to rework the other one after this is merged. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9402: ARROW-11481: [Rust] More cast implementations
alamb commented on pull request #9402: URL: https://github.com/apache/arrow/pull/9402#issuecomment-778133435 @jorgecarleitao do you think this one is ready to go ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb closed pull request #9445: ARROW-11557: [Rust][Datafusion] Add deregister_table
alamb closed pull request #9445: URL: https://github.com/apache/arrow/pull/9445 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9445: ARROW-11557: [Rust][Datafusion] Add deregister_table
alamb commented on pull request #9445: URL: https://github.com/apache/arrow/pull/9445#issuecomment-778134379 Thanks @marcprux ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9412: ARROW-11491: [Rust] support JSON schema inference for nested list and struct
alamb commented on pull request #9412: URL: https://github.com/apache/arrow/pull/9412#issuecomment-778134781 @houqp -- I think this one is ready to go other than > The only question I think should be answered / explained before this is merged is why the checked in file test/data/mixed_arrays.json ends up with one fewer column in the inferred schema after this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9353: ARROW-11420: [Rust] Added support to length of Binary and List.
alamb commented on pull request #9353: URL: https://github.com/apache/arrow/pull/9353#issuecomment-778135054 @jorgecarleitao -- this one needs a rebase and then I think it is ready to go This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] aitor94 commented on pull request #8491: ARROW-10349: [Python] linux aarch64 wheels
aitor94 commented on pull request #8491: URL: https://github.com/apache/arrow/pull/8491#issuecomment-778135205 I'm trying to install pyarrow in a m6g instance on AWS and I can't. This PR may help to solve my issue. [https://stackoverflow.com/questions/64928357/how-to-install-pyarrow-in-amazon-linux-2-m6g-instance?noredirect=1#comment116973612_64928357](stack_overflow) Thanks This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9233: ARROW-11289: [Rust][DataFusion] Implement GROUP BY support for Dictionary Encoded columns
alamb commented on pull request #9233: URL: https://github.com/apache/arrow/pull/9233#issuecomment-778135913 This one is now ready for review. cc @andygrove @seddonm1 @jhorstmann and @Dandandan This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on a change in pull request #9233: ARROW-11289: [Rust][DataFusion] Implement GROUP BY support for Dictionary Encoded columns
alamb commented on a change in pull request #9233: URL: https://github.com/apache/arrow/pull/9233#discussion_r575154628 ## File path: rust/datafusion/src/physical_plan/hash_aggregate.rs ## @@ -398,97 +405,165 @@ fn group_aggregate_batch( Ok(accumulators) } -/// Create a key `Vec` that is used as key for the hashmap -pub(crate) fn create_key( -group_by_keys: &[ArrayRef], +/// Appends a sequence of [u8] bytes for the value in `col[row]` to +/// `vec` to be used as a key into the hash map for a dictionary type +/// +/// Note that ideally, for dictionary encoded columns, we would be +/// able to simply use the dictionary idicies themselves (no need to +/// look up values) or possibly simply build the hash table entirely +/// on the dictionary indexes. +/// +/// This aproach would likely work (very) well for the common case, +/// but it also has to to handle the case where the dictionary itself +/// is not the same across all record batches (and thus indexes in one +/// record batch may not correspond to the same index in another) +fn dictionary_create_key_for_col( +col: &ArrayRef, row: usize, vec: &mut Vec, ) -> Result<()> { -vec.clear(); -for col in group_by_keys { -match col.data_type() { -DataType::Boolean => { -let array = col.as_any().downcast_ref::().unwrap(); -vec.extend_from_slice(&[array.value(row) as u8]); -} -DataType::Float32 => { -let array = col.as_any().downcast_ref::().unwrap(); -vec.extend_from_slice(&array.value(row).to_le_bytes()); -} -DataType::Float64 => { -let array = col.as_any().downcast_ref::().unwrap(); -vec.extend_from_slice(&array.value(row).to_le_bytes()); -} -DataType::UInt8 => { -let array = col.as_any().downcast_ref::().unwrap(); -vec.extend_from_slice(&array.value(row).to_le_bytes()); -} -DataType::UInt16 => { -let array = col.as_any().downcast_ref::().unwrap(); -vec.extend_from_slice(&array.value(row).to_le_bytes()); -} -DataType::UInt32 => { -let array = col.as_any().downcast_ref::().unwrap(); -vec.extend_from_slice(&array.value(row).to_le_bytes()); -} -DataType::UInt64 => { -let array = col.as_any().downcast_ref::().unwrap(); -vec.extend_from_slice(&array.value(row).to_le_bytes()); -} +let dict_col = col.as_any().downcast_ref::>().unwrap(); + +// look up the index in the values dictionary +let keys_col = dict_col.keys_array(); +let values_index = keys_col.value(row).to_usize().ok_or_else(|| { +DataFusionError::Internal(format!( +"Can not convert index to usize in dictionary of type creating group by value {:?}", +keys_col.data_type() +)) +})?; + +create_key_for_col(&dict_col.values(), values_index, vec) +} + +/// Appends a sequence of [u8] bytes for the value in `col[row]` to +/// `vec` to be used as a key into the hash map +fn create_key_for_col(col: &ArrayRef, row: usize, vec: &mut Vec) -> Result<()> { Review comment: This PR looks larger than it really is -- all it does is lift the per column treatment of arrays into its own function (so it can be called recursively) and then adds handling for dictionary support. So while github renders the diff as a large change, it is very small logic change: This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.
alamb commented on pull request #9376: URL: https://github.com/apache/arrow/pull/9376#issuecomment-778137269 Sounds good -- @jorgecarleitao let's get it merged ! It looks like it needs another rebase and then we'll get it in This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nevi-me commented on a change in pull request #9353: ARROW-11420: [Rust] Added support to length of Binary and List.
nevi-me commented on a change in pull request #9353: URL: https://github.com/apache/arrow/pull/9353#discussion_r575168065 ## File path: rust/arrow/src/compute/kernels/length.rs ## @@ -24,42 +24,77 @@ use crate::{ }; use std::sync::Arc; -fn length_string(array: &Array, data_type: DataType) -> Result -where -OffsetSize: OffsetSizeTrait, -{ -// note: offsets are stored as u8, but they can be interpreted as OffsetSize -let offsets = &array.data_ref().buffers()[0]; -// this is a 30% improvement over iterating over u8s and building OffsetSize, which -// justifies the usage of `unsafe`. -let slice: &[OffsetSize] = -&unsafe { offsets.typed_data::() }[array.offset()..]; +fn clone_null_buffer(array: &impl Array) -> Option { +array +.data_ref() +.null_bitmap() Review comment: `.null_buffer().cloned()` ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm closed pull request #9433: ARROW-11539: [Developer][Archery] Change items_per_seconds units
lidavidm closed pull request #9433: URL: https://github.com/apache/arrow/pull/9433 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] marcprux opened a new pull request #9479: ARROW-11586: [Rust][Datafusion][WIP] Remove force unwrap
marcprux opened a new pull request #9479: URL: https://github.com/apache/arrow/pull/9479 Fix for https://issues.apache.org/jira/browse/ARROW-11586 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9479: ARROW-11586: [Rust][Datafusion][WIP] Remove force unwrap
github-actions[bot] commented on pull request #9479: URL: https://github.com/apache/arrow/pull/9479#issuecomment-778196690 https://issues.apache.org/jira/browse/ARROW-11586 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] codecov-io commented on pull request #9479: ARROW-11586: [Rust][Datafusion][WIP] Remove force unwrap
codecov-io commented on pull request #9479: URL: https://github.com/apache/arrow/pull/9479#issuecomment-778207974 # [Codecov](https://codecov.io/gh/apache/arrow/pull/9479?src=pr&el=h1) Report > Merging [#9479](https://codecov.io/gh/apache/arrow/pull/9479?src=pr&el=desc) (ed800b7) into [master](https://codecov.io/gh/apache/arrow/commit/5e3fcfabf471fd3790e114b2245690c9c08ff743?el=desc) (5e3fcfa) will **decrease** coverage by `0.01%`. > The diff coverage is `88.57%`. [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9479/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9479?src=pr&el=tree) ```diff @@Coverage Diff @@ ## master#9479 +/- ## == - Coverage 82.32% 82.31% -0.02% == Files 233 234 +1 Lines 5444654461 +15 == + Hits4482344828 +5 - Misses 9623 9633 +10 ``` | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9479?src=pr&el=tree) | Coverage Δ | | |---|---|---| | [rust/arrow/src/array/builder.rs](https://codecov.io/gh/apache/arrow/pull/9479/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYnVpbGRlci5ycw==) | `84.93% <ø> (ø)` | | | [rust/arrow/src/compute/kernels/sort.rs](https://codecov.io/gh/apache/arrow/pull/9479/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL3NvcnQucnM=) | `93.56% <ø> (ø)` | | | [rust/arrow/src/compute/kernels/substring.rs](https://codecov.io/gh/apache/arrow/pull/9479/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL3N1YnN0cmluZy5ycw==) | `98.29% <ø> (ø)` | | | [rust/arrow/src/compute/util.rs](https://codecov.io/gh/apache/arrow/pull/9479/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS91dGlsLnJz) | `98.92% <ø> (ø)` | | | [rust/arrow/src/json/reader.rs](https://codecov.io/gh/apache/arrow/pull/9479/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvanNvbi9yZWFkZXIucnM=) | `82.82% <ø> (ø)` | | | [rust/benchmarks/src/bin/tpch.rs](https://codecov.io/gh/apache/arrow/pull/9479/diff?src=pr&el=tree#diff-cnVzdC9iZW5jaG1hcmtzL3NyYy9iaW4vdHBjaC5ycw==) | `38.33% <0.00%> (+0.18%)` | :arrow_up: | | [rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/9479/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==) | `80.00% <ø> (ø)` | | | [rust/datafusion/src/logical\_plan/dfschema.rs](https://codecov.io/gh/apache/arrow/pull/9479/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZGZzY2hlbWEucnM=) | `85.52% <ø> (ø)` | | | [...tafusion/src/physical\_plan/datetime\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9479/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2RhdGV0aW1lX2V4cHJlc3Npb25zLnJz) | `93.28% <ø> (+0.24%)` | :arrow_up: | | [...atafusion/src/physical\_plan/expressions/in\_list.rs](https://codecov.io/gh/apache/arrow/pull/9479/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zL2luX2xpc3QucnM=) | `75.37% <ø> (ø)` | | | ... and [41 more](https://codecov.io/gh/apache/arrow/pull/9479/diff?src=pr&el=tree-more) | | -- [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9479?src=pr&el=continue). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9479?src=pr&el=footer). Last update [34e7671...ed800b7](https://codecov.io/gh/apache/arrow/pull/9479?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] abreis commented on pull request #9454: ARROW-11572: [Rust] Add a kernel for division by single scalar
abreis commented on pull request #9454: URL: https://github.com/apache/arrow/pull/9454#issuecomment-778208765 > @abreis , I am not convinced that that is sufficient, unfortunately, because it excludes all types that are not Numeric (i.e. all dates and times for primitives, as well as all other logical types). > > We could of course offer a `DatumTemporal` and a `DatumList` and so one and so forth, but IMO that is introducing more complexity, as we now start having `struct`s for all kinds of logical types. I agree. I just realized that `ArrowPrimitiveType` exists though, so Datum can be generic over that instead. Here is a more fleshed-out mock of a Datum type. The signature of `datum_math_divide` enforces a consistent `T` over array/array and array/scalar combinations, and the method's output. Base type: ```rust #[derive(Debug)] pub enum Datum<'a, T> where T: ArrowPrimitiveType, { Array(&'a PrimitiveArray), Scalar(Option), } ``` `From` impls for both arrays and nullable scalars: ```rust impl<'a, T> From<&'a PrimitiveArray> for Datum<'a, T> where T: ArrowPrimitiveType, { fn from(array: &'a PrimitiveArray) -> Self { Datum::Array(array) } } impl<'a, T> From> for Datum<'a, T> where T: ArrowPrimitiveType, { fn from(scalar: Option) -> Self { Datum::Scalar(scalar) } } ``` A (user-facing) method for math division: ```rust pub fn math_divide<'a1, 'a2, T, DL, DR>( left: DL, right: DR, ) -> Result> where T: ArrowNumericType, T::Native: Div + Zero, DL: Into>, // left and right may have different lifetimes DR: Into>, // but `T` must be the same { use Datum::*; match (left.into(), right.into()) { (Array(left), Array(right)) => todo!(), // array/array (Array(array), Scalar(divisor)) => todo!(), // array/scalar _ => todo!(), } } ``` Test code: ```rust fn test_datum_divide() { let array1 = Int32Array::from(vec![15, 15, 8, 1, 9]); let array2 = Int32Array::from(vec![5, 6, 8, 9, 1]); let scalar = Some(8i32); let a_over_a = math_divide(&array1, &array2).unwrap(); // works let a_over_s = math_divide(&array1, scalar).unwrap(); // also works } ``` @jorgecarleitao I'm aware this doesn't address the second half of your comment. I'm just exploring this idea for the current version of arrow. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] abreis edited a comment on pull request #9454: ARROW-11572: [Rust] Add a kernel for division by single scalar
abreis edited a comment on pull request #9454: URL: https://github.com/apache/arrow/pull/9454#issuecomment-778208765 > @abreis , I am not convinced that that is sufficient, unfortunately, because it excludes all types that are not Numeric (i.e. all dates and times for primitives, as well as all other logical types). > > We could of course offer a `DatumTemporal` and a `DatumList` and so one and so forth, but IMO that is introducing more complexity, as we now start having `struct`s for all kinds of logical types. I agree. I just realized that `ArrowPrimitiveType` exists though, so Datum can be generic over that instead. Here is a more fleshed-out mock of a Datum type. The signature of `math_divide` enforces a consistent `T` over array/array and array/scalar combinations, and the method's output. Base type: ```rust #[derive(Debug)] pub enum Datum<'a, T> where T: ArrowPrimitiveType, { Array(&'a PrimitiveArray), Scalar(Option), } ``` `From` impls for both arrays and nullable scalars: ```rust impl<'a, T> From<&'a PrimitiveArray> for Datum<'a, T> where T: ArrowPrimitiveType, { fn from(array: &'a PrimitiveArray) -> Self { Datum::Array(array) } } impl<'a, T> From> for Datum<'a, T> where T: ArrowPrimitiveType, { fn from(scalar: Option) -> Self { Datum::Scalar(scalar) } } ``` A (user-facing) method for math division: ```rust pub fn math_divide<'a1, 'a2, T, DL, DR>( left: DL, right: DR, ) -> Result> where T: ArrowNumericType, T::Native: Div + Zero, DL: Into>, // left and right may have different lifetimes DR: Into>, // but `T` must be the same { use Datum::*; match (left.into(), right.into()) { (Array(left), Array(right)) => todo!(), // array/array (Array(array), Scalar(divisor)) => todo!(), // array/scalar _ => todo!(), } } ``` Test code: ```rust fn test_datum_divide() { let array1 = Int32Array::from(vec![15, 15, 8, 1, 9]); let array2 = Int32Array::from(vec![5, 6, 8, 9, 1]); let scalar = Some(8i32); let a_over_a = math_divide(&array1, &array2).unwrap(); // works let a_over_s = math_divide(&array1, scalar).unwrap(); // also works } ``` @jorgecarleitao I'm aware this doesn't address the second half of your comment. I'm just exploring this idea for the current version of arrow. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm opened a new pull request #9480: ARROW-11596: [Python][Dataset] make ScanTask.execute() eager
lidavidm opened a new pull request #9480: URL: https://github.com/apache/arrow/pull/9480 This changes 2 things: - ScanTask.execute() is now eagerly evaluated, so any work involved in creating a record batch reader is done up front. For example, a Parquet file will actually begin reading (but not decoding) data. This mirrors the behavior in C++. This also makes it easier to separate stages of work when pipelining reads (by manually dispatching scan tasks). - This has the side effect of working around a SIGSEGV in the Cython generator, caused because Cython raises StopIteration without holding the GIL when you have a generator that uses "with [no]gil". In my tests this is fixed on Cython master but not any stable or development release. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9480: ARROW-11596: [Python][Dataset] make ScanTask.execute() eager
github-actions[bot] commented on pull request #9480: URL: https://github.com/apache/arrow/pull/9480#issuecomment-778212915 https://issues.apache.org/jira/browse/ARROW-11596 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #9480: ARROW-11596: [Python][Dataset] make ScanTask.execute() eager
jorisvandenbossche commented on a change in pull request #9480: URL: https://github.com/apache/arrow/pull/9480#discussion_r575250046 ## File path: python/pyarrow/_dataset.pyx ## @@ -2125,12 +2125,44 @@ cdef class ScanTask(_Weakrefable): --- record_batches : iterator of RecordBatch """ -cdef shared_ptr[CRecordBatch] record_batch +# Return an explicit iterator object instead of using a +# generator so that this method is eagerly evaluated (a +# generator would mean no work gets done until the first +# iteration). This also works around a bug in Cython's +# generator. +cdef CRecordBatchIterator iterator with nogil: -for maybe_batch in GetResultValue(self.task.Execute()): -record_batch = GetResultValue(move(maybe_batch)) -with gil: -yield pyarrow_wrap_batch(record_batch) +iterator = move(GetResultValue(self.task.Execute())) +with gil: +return RecordBatchIterator.wrap(self, move(iterator)) Review comment: This can also be put after the nogil context (so you don't need a `with gil`)? (or does that something different in cython?) ## File path: python/pyarrow/_dataset.pyx ## @@ -2125,12 +2125,44 @@ cdef class ScanTask(_Weakrefable): --- record_batches : iterator of RecordBatch """ -cdef shared_ptr[CRecordBatch] record_batch +# Return an explicit iterator object instead of using a +# generator so that this method is eagerly evaluated (a +# generator would mean no work gets done until the first +# iteration). This also works around a bug in Cython's +# generator. +cdef CRecordBatchIterator iterator with nogil: -for maybe_batch in GetResultValue(self.task.Execute()): -record_batch = GetResultValue(move(maybe_batch)) -with gil: -yield pyarrow_wrap_batch(record_batch) +iterator = move(GetResultValue(self.task.Execute())) +with gil: +return RecordBatchIterator.wrap(self, move(iterator)) + + +cdef class RecordBatchIterator(_Weakrefable): +"""An iterator over a sequence of record batches.""" +cdef: +ScanTask task +CRecordBatchIterator iterator + +def __init__(self): +_forbid_instantiation(self.__class__, subclasses_instead=False) + +@staticmethod +cdef wrap(ScanTask task, CRecordBatchIterator iterator): +cdef RecordBatchIterator self = \ +RecordBatchIterator.__new__(RecordBatchIterator) +self.task = task +self.iterator = move(iterator) +return self + +def __iter__(self): +return self + +def __next__(self): +cdef shared_ptr[CRecordBatch] record_batch +record_batch = GetResultValue(move(self.iterator.Next())) Review comment: Does this need a nogil context? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on a change in pull request #9480: ARROW-11596: [Python][Dataset] make ScanTask.execute() eager
lidavidm commented on a change in pull request #9480: URL: https://github.com/apache/arrow/pull/9480#discussion_r575252412 ## File path: python/pyarrow/_dataset.pyx ## @@ -2125,12 +2125,44 @@ cdef class ScanTask(_Weakrefable): --- record_batches : iterator of RecordBatch """ -cdef shared_ptr[CRecordBatch] record_batch +# Return an explicit iterator object instead of using a +# generator so that this method is eagerly evaluated (a +# generator would mean no work gets done until the first +# iteration). This also works around a bug in Cython's +# generator. +cdef CRecordBatchIterator iterator with nogil: -for maybe_batch in GetResultValue(self.task.Execute()): -record_batch = GetResultValue(move(maybe_batch)) -with gil: -yield pyarrow_wrap_batch(record_batch) +iterator = move(GetResultValue(self.task.Execute())) +with gil: +return RecordBatchIterator.wrap(self, move(iterator)) + + +cdef class RecordBatchIterator(_Weakrefable): +"""An iterator over a sequence of record batches.""" +cdef: +ScanTask task +CRecordBatchIterator iterator + +def __init__(self): +_forbid_instantiation(self.__class__, subclasses_instead=False) + +@staticmethod +cdef wrap(ScanTask task, CRecordBatchIterator iterator): +cdef RecordBatchIterator self = \ +RecordBatchIterator.__new__(RecordBatchIterator) +self.task = task +self.iterator = move(iterator) +return self + +def __iter__(self): +return self + +def __next__(self): +cdef shared_ptr[CRecordBatch] record_batch +record_batch = GetResultValue(move(self.iterator.Next())) Review comment: You're right on both counts, I pushed a fixed version. (The `with gil` was because I was mimicking the original code but that doesn't make any sense anymore.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on pull request #9474: ARROW-10420: [C++] Refactor io and filesystem APIs to take an IOContext
emkornfield commented on pull request #9474: URL: https://github.com/apache/arrow/pull/9474#issuecomment-778245429 Drive by naming nit: IoContext This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield edited a comment on pull request #9474: ARROW-10420: [C++] Refactor io and filesystem APIs to take an IOContext
emkornfield edited a comment on pull request #9474: URL: https://github.com/apache/arrow/pull/9474#issuecomment-778245429 naming nit: IoContext? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #9386: ARROW-11373: [Python][Docs] Add example of specifying type for a column when reading csv file
jorisvandenbossche commented on a change in pull request #9386: URL: https://github.com/apache/arrow/pull/9386#discussion_r575303498 ## File path: docs/source/python/csv.rst ## @@ -75,7 +75,22 @@ Customized conversion - To alter how CSV data is converted to Arrow types and data, you should create -a :class:`ConvertOptions` instance and pass it to :func:`read_csv`. +a :class:`ConvertOptions` instance and pass it to :func:`read_csv`:: + import pyarrow as pa + import pyarrow.csv as csv + + table = csv.read_csv('tips.csv.gz', convert_options=pa.csv.ConvertOptions( +column_types={ +'total_bill': pa.decimal128(precision=10, scale=2), +'tip': pa.decimal128(precision=10, scale=2), +'sex': pa.string(), +'smoker': pa.string(), +'day': pa.string(), +'time': pa.string(), +'size': pa.int64() Review comment: I think `column_type` allows to specify the type for only a subset of the columns. So that might actually be useful to show in the example (eg only for total_bill/tip), otherwise readers might get the impression that they need to specify all columns) ## File path: docs/source/python/csv.rst ## @@ -75,7 +75,22 @@ Customized conversion - To alter how CSV data is converted to Arrow types and data, you should create -a :class:`ConvertOptions` instance and pass it to :func:`read_csv`. +a :class:`ConvertOptions` instance and pass it to :func:`read_csv`:: + import pyarrow as pa Review comment: Can you leave a blank line between the code block and the paragraph above? (restructuredtext formatting is quite strict about those details ..) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on pull request #9466: ARROW-11379: [C++][Dataset] Better formatting for timestamp scalars
jorisvandenbossche commented on pull request #9466: URL: https://github.com/apache/arrow/pull/9466#issuecomment-778278059 The diff shows a change in the testing submodule, is that expected? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #9470: ARROW-11480: [Python] Test filtering on INT96 timestamps
jorisvandenbossche commented on a change in pull request #9470: URL: https://github.com/apache/arrow/pull/9470#discussion_r575324170 ## File path: python/pyarrow/tests/parquet/test_dataset.py ## @@ -352,6 +352,25 @@ def test_filters_cutoff_exclusive_datetime(tempdir, use_legacy_dataset): assert result_df['dates'].values == expected +@pytest.mark.pandas +def test_filters_inclusive_datetime(tempdir): +# ARROW-11480 +path = tempdir / 'timestamps.parquet' + +pd.DataFrame({ +"dates": pd.date_range("2020-01-01", periods=10, freq="D"), +"id": range(10) +}).to_parquet(path, use_deprecated_int96_timestamps=True) + +import datetime Review comment: Small nit: datetime is already imported at the top, so can be removed here This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on pull request #9466: ARROW-11379: [C++][Dataset] Better formatting for timestamp scalars
bkietz commented on pull request #9466: URL: https://github.com/apache/arrow/pull/9466#issuecomment-778279411 No, there should be no change to `testing/`.I'll fix that This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #9470: ARROW-11480: [Python] Test filtering on INT96 timestamps
bkietz commented on a change in pull request #9470: URL: https://github.com/apache/arrow/pull/9470#discussion_r575325063 ## File path: python/pyarrow/tests/parquet/test_dataset.py ## @@ -352,6 +352,25 @@ def test_filters_cutoff_exclusive_datetime(tempdir, use_legacy_dataset): assert result_df['dates'].values == expected +@pytest.mark.pandas +def test_filters_inclusive_datetime(tempdir): +# ARROW-11480 +path = tempdir / 'timestamps.parquet' + +pd.DataFrame({ +"dates": pd.date_range("2020-01-01", periods=10, freq="D"), +"id": range(10) +}).to_parquet(path, use_deprecated_int96_timestamps=True) + +import datetime Review comment: will do This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] andygrove opened a new pull request #9481: ARROW-11606: [Rust] [DataFusion] Add input schema to HashAggregateExec [WIP]
andygrove opened a new pull request #9481: URL: https://github.com/apache/arrow/pull/9481 To make it easier to implement serde for `HashAggregateExec` we need access to the schema that the aggregate expressions are compiled against. For `Partial` aggregates this is the same as the schema of the input. However, for `Final` aggregates it is not, which is why we need to add this additional schema information. TODO: this is untested, and I need to add tests in this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9481: ARROW-11606: [Rust] [DataFusion] Add input schema to HashAggregateExec [WIP]
github-actions[bot] commented on pull request #9481: URL: https://github.com/apache/arrow/pull/9481#issuecomment-778280588 https://issues.apache.org/jira/browse/ARROW-11606 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] andygrove commented on pull request #9481: ARROW-11606: [Rust] [DataFusion] Add input schema to HashAggregateExec [WIP]
andygrove commented on pull request #9481: URL: https://github.com/apache/arrow/pull/9481#issuecomment-778280783 @jorgecarleitao Does this make sense? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls
jorisvandenbossche commented on a change in pull request #9323: URL: https://github.com/apache/arrow/pull/9323#discussion_r575328856 ## File path: python/pyarrow/tests/test_dataset.py ## @@ -1587,33 +1587,54 @@ def test_open_dataset_non_existing_file(): @pytest.mark.parquet @pytest.mark.parametrize('partitioning', ["directory", "hive"]) +@pytest.mark.parametrize('null_fallback', ['xyz', None]) @pytest.mark.parametrize('partition_keys', [ (["A", "B", "C"], [1, 2, 3]), ([1, 2, 3], ["A", "B", "C"]), (["A", "B", "C"], ["D", "E", "F"]), ([1, 2, 3], [4, 5, 6]), +([1, None, 3], ["A", "B", "C"]), +([1, 2, 3], ["A", None, "C"]), +([None, 2, 3], [None, 2, 3]), ]) -def test_open_dataset_partitioned_dictionary_type(tempdir, partitioning, - partition_keys): +def test_open_dataset_partitioned_dictionary_type( +tempdir, partitioning, null_fallback, partition_keys +): # ARROW-9288 / ARROW-9476 import pyarrow.parquet as pq -table = pa.table({'a': range(9), 'b': [0.] * 4 + [1.] * 5}) + +table = pa.table({'a': range(9), 'b': [0.0] * 4 + [1.0] * 5}) + +if None in partition_keys[0] or None in partition_keys[1]: +# Directory partitioning can't handle the first part being null +return Review comment: only return here if `partitioning == "directory"` ? ## File path: python/pyarrow/tests/test_dataset.py ## @@ -1587,33 +1587,54 @@ def test_open_dataset_non_existing_file(): @pytest.mark.parquet @pytest.mark.parametrize('partitioning', ["directory", "hive"]) +@pytest.mark.parametrize('null_fallback', ['xyz', None]) Review comment: What does `null_fallback=None` mean? (based on the docstring above it seems it can only be a string?) ## File path: cpp/src/arrow/dataset/partition.cc ## @@ -74,15 +74,26 @@ Status KeyValuePartitioning::SetDefaultValuesFromKeys(const Expression& expr, RecordBatchProjector* projector) { ARROW_ASSIGN_OR_RAISE(auto known_values, ExtractKnownFieldValues(expr)); for (const auto& ref_value : known_values) { -if (!ref_value.second.is_scalar()) { - return Status::Invalid("non-scalar partition key ", ref_value.second.ToString()); +const auto& known_value = ref_value.second; +if (known_value.concrete() && !known_value.datum.is_scalar()) { + return Status::Invalid("non-scalar partition key ", known_value.datum.ToString()); } ARROW_ASSIGN_OR_RAISE(auto match, ref_value.first.FindOneOrNone(*projector->schema())); if (match.empty()) continue; -RETURN_NOT_OK(projector->SetDefaultValue(match, ref_value.second.scalar())); + +const auto& field = projector->schema()->field(match[0]); +if (known_value.concrete()) { + RETURN_NOT_OK(projector->SetDefaultValue(match, known_value.datum.scalar())); +} else if (known_value.valid) { + return Status::Invalid( + "Partition expression not defined enough to set default value for ", Review comment: What does "not defined enough" in practice mean? (or what would be an example?) ## File path: python/pyarrow/tests/test_dataset.py ## @@ -1587,33 +1587,54 @@ def test_open_dataset_non_existing_file(): @pytest.mark.parquet @pytest.mark.parametrize('partitioning', ["directory", "hive"]) +@pytest.mark.parametrize('null_fallback', ['xyz', None]) @pytest.mark.parametrize('partition_keys', [ (["A", "B", "C"], [1, 2, 3]), ([1, 2, 3], ["A", "B", "C"]), (["A", "B", "C"], ["D", "E", "F"]), ([1, 2, 3], [4, 5, 6]), +([1, None, 3], ["A", "B", "C"]), +([1, 2, 3], ["A", None, "C"]), +([None, 2, 3], [None, 2, 3]), ]) -def test_open_dataset_partitioned_dictionary_type(tempdir, partitioning, - partition_keys): +def test_open_dataset_partitioned_dictionary_type( Review comment: you added this to a test that is specifically about reading partitioned datasets while inferring the partition fields as dictionary. Which is fine (as this case also needs to be able to hand that), but this should also work (and so be tested) in the default case not inferring dictionary type? And should we also have a test for the writing part? (this one only tests reading) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm opened a new pull request #9482: ARROW-11601: [C++][Python][Dataset] expose Parquet pre-buffer option
lidavidm opened a new pull request #9482: URL: https://github.com/apache/arrow/pull/9482 This exposes the pre-buffering option that was implemented for the base Parquet reader in Datasets. To summarize, the option coalesces and buffers ranges of a file based on the columns and row groups read, issuing a single read operation for multiple read requests, by combining adjacent and "nearby" ranges into a single request. This means it handles both cases where the entire file is being read, as well as a subset of columns and/or row groups. There's not a benchmark for Datasets in the repo, but here's a quick comparison for loading data from S3: ``` Without buffering: Data read: 1300.00 MiB Mean : 7.68 s Median : 7.48 s Stdev: 0.76 s Mean rate: 170.57 MiB/s With buffering: Data read: 1300.00 MiB Mean : 3.88 s Median : 3.89 s Stdev: 0.21 s Mean rate: 335.95 MiB/s ``` The code being benchmarked is essentially: ```python dataset = pyarrow.dataset.FileSystemDataset.from_paths( paths, schema=schema, format=format, filesystem=fs, ) table = dataset.to_table() ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9482: ARROW-11601: [C++][Python][Dataset] expose Parquet pre-buffer option
github-actions[bot] commented on pull request #9482: URL: https://github.com/apache/arrow/pull/9482#issuecomment-778300420 https://issues.apache.org/jira/browse/ARROW-11601 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on pull request #9481: ARROW-11606: [Rust] [DataFusion] Add input schema to HashAggregateExec [WIP]
jorgecarleitao commented on pull request #9481: URL: https://github.com/apache/arrow/pull/9481#issuecomment-778320725 Makes a lot of sense to me. 👍 Does it address the issue on Ballista? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] andygrove commented on pull request #9481: ARROW-11606: [Rust] [DataFusion] Add input schema to HashAggregateExec [WIP]
andygrove commented on pull request #9481: URL: https://github.com/apache/arrow/pull/9481#issuecomment-778323450 Yes, I think so. See https://github.com/ballista-compute/ballista/pull/505 ... I am going to finish testing this over the weekend. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] marcprux commented on pull request #9479: ARROW-11586: [Rust][Datafusion] Remove force unwrap
marcprux commented on pull request #9479: URL: https://github.com/apache/arrow/pull/9479#issuecomment-778369173 I've un-drafted it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls
westonpace commented on a change in pull request #9323: URL: https://github.com/apache/arrow/pull/9323#discussion_r575451640 ## File path: cpp/src/arrow/dataset/partition.cc ## @@ -74,15 +74,26 @@ Status KeyValuePartitioning::SetDefaultValuesFromKeys(const Expression& expr, RecordBatchProjector* projector) { ARROW_ASSIGN_OR_RAISE(auto known_values, ExtractKnownFieldValues(expr)); for (const auto& ref_value : known_values) { -if (!ref_value.second.is_scalar()) { - return Status::Invalid("non-scalar partition key ", ref_value.second.ToString()); +const auto& known_value = ref_value.second; +if (known_value.concrete() && !known_value.datum.is_scalar()) { + return Status::Invalid("non-scalar partition key ", known_value.datum.ToString()); } ARROW_ASSIGN_OR_RAISE(auto match, ref_value.first.FindOneOrNone(*projector->schema())); if (match.empty()) continue; -RETURN_NOT_OK(projector->SetDefaultValue(match, ref_value.second.scalar())); + +const auto& field = projector->schema()->field(match[0]); +if (known_value.concrete()) { + RETURN_NOT_OK(projector->SetDefaultValue(match, known_value.datum.scalar())); +} else if (known_value.valid) { + return Status::Invalid( + "Partition expression not defined enough to set default value for ", Review comment: Today we only get in this state if the expression is something like `pa.field("a").is_valid()` although technically this could be inferred from expressions like `pa.field("a") > 7` as well. Any expression that only evaluates true if the field is non-null gives us some hint at least that the result will be non-null. Although, on second review, it's probably better to just `continue` in this case rather than return an error. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] codecov-io commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.
codecov-io commented on pull request #9376: URL: https://github.com/apache/arrow/pull/9376#issuecomment-778382279 # [Codecov](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=h1) Report > Merging [#9376](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=desc) (41e8f26) into [master](https://codecov.io/gh/apache/arrow/commit/5e3fcfabf471fd3790e114b2245690c9c08ff743?el=desc) (5e3fcfa) will **decrease** coverage by `0.17%`. > The diff coverage is `74.50%`. [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9376/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=tree) ```diff @@Coverage Diff @@ ## master#9376 +/- ## == - Coverage 82.32% 82.14% -0.18% == Files 233 235 +2 Lines 5444654582 +136 == + Hits4482344838 +15 - Misses 9623 9744 +121 ``` | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=tree) | Coverage Δ | | |---|---|---| | [rust/arrow/src/array/builder.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYnVpbGRlci5ycw==) | `84.93% <ø> (ø)` | | | [rust/arrow/src/compute/kernels/sort.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL3NvcnQucnM=) | `93.56% <ø> (ø)` | | | [rust/arrow/src/compute/kernels/substring.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL3N1YnN0cmluZy5ycw==) | `98.29% <ø> (ø)` | | | [rust/arrow/src/compute/util.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS91dGlsLnJz) | `98.92% <ø> (ø)` | | | [rust/arrow/src/json/reader.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvanNvbi9yZWFkZXIucnM=) | `82.82% <ø> (ø)` | | | [rust/benchmarks/src/bin/tpch.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9iZW5jaG1hcmtzL3NyYy9iaW4vdHBjaC5ycw==) | `38.33% <0.00%> (+0.18%)` | :arrow_up: | | [rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==) | `80.00% <ø> (ø)` | | | [rust/datafusion/src/execution/dataframe\_impl.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9leGVjdXRpb24vZGF0YWZyYW1lX2ltcGwucnM=) | `93.83% <ø> (ø)` | | | [rust/datafusion/src/logical\_plan/dfschema.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZGZzY2hlbWEucnM=) | `85.52% <ø> (ø)` | | | [...datafusion/src/physical\_plan/expressions/binary.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zL2JpbmFyeS5ycw==) | `85.53% <ø> (-0.32%)` | :arrow_down: | | ... and [55 more](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree-more) | | -- [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=continue). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=footer). Last update [34e7671...cd79e6e](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls
westonpace commented on a change in pull request #9323: URL: https://github.com/apache/arrow/pull/9323#discussion_r575453002 ## File path: python/pyarrow/tests/test_dataset.py ## @@ -1587,33 +1587,54 @@ def test_open_dataset_non_existing_file(): @pytest.mark.parquet @pytest.mark.parametrize('partitioning', ["directory", "hive"]) +@pytest.mark.parametrize('null_fallback', ['xyz', None]) Review comment: In that case it does not pass in anything when it creates the partitioning and ensures that it defaults to the correct default value. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] dianaclarke commented on pull request #9272: [WIP] Benchmark placebo
dianaclarke commented on pull request #9272: URL: https://github.com/apache/arrow/pull/9272#issuecomment-778424894 @ursabot please benchmark This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] seddonm1 commented on a change in pull request #9233: ARROW-11289: [Rust][DataFusion] Implement GROUP BY support for Dictionary Encoded columns
seddonm1 commented on a change in pull request #9233: URL: https://github.com/apache/arrow/pull/9233#discussion_r575535244 ## File path: rust/datafusion/src/physical_plan/hash_aggregate.rs ## @@ -398,97 +405,165 @@ fn group_aggregate_batch( Ok(accumulators) } -/// Create a key `Vec` that is used as key for the hashmap -pub(crate) fn create_key( -group_by_keys: &[ArrayRef], +/// Appends a sequence of [u8] bytes for the value in `col[row]` to +/// `vec` to be used as a key into the hash map for a dictionary type +/// +/// Note that ideally, for dictionary encoded columns, we would be +/// able to simply use the dictionary idicies themselves (no need to Review comment: Bikeshedding: `idicies` -> `indices` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] seddonm1 commented on a change in pull request #9233: ARROW-11289: [Rust][DataFusion] Implement GROUP BY support for Dictionary Encoded columns
seddonm1 commented on a change in pull request #9233: URL: https://github.com/apache/arrow/pull/9233#discussion_r575535374 ## File path: rust/datafusion/src/physical_plan/hash_aggregate.rs ## @@ -398,97 +405,165 @@ fn group_aggregate_batch( Ok(accumulators) } -/// Create a key `Vec` that is used as key for the hashmap -pub(crate) fn create_key( -group_by_keys: &[ArrayRef], +/// Appends a sequence of [u8] bytes for the value in `col[row]` to +/// `vec` to be used as a key into the hash map for a dictionary type +/// +/// Note that ideally, for dictionary encoded columns, we would be +/// able to simply use the dictionary idicies themselves (no need to +/// look up values) or possibly simply build the hash table entirely +/// on the dictionary indexes. +/// +/// This aproach would likely work (very) well for the common case, Review comment: `aproach` -> `approach` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #9272: [WIP] Benchmark placebo
nealrichardson commented on pull request #9272: URL: https://github.com/apache/arrow/pull/9272#issuecomment-778471641 @github-actions rebase This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] seddonm1 commented on a change in pull request #9233: ARROW-11289: [Rust][DataFusion] Implement GROUP BY support for Dictionary Encoded columns
seddonm1 commented on a change in pull request #9233: URL: https://github.com/apache/arrow/pull/9233#discussion_r575535635 ## File path: rust/datafusion/src/physical_plan/hash_aggregate.rs ## @@ -398,97 +405,165 @@ fn group_aggregate_batch( Ok(accumulators) } -/// Create a key `Vec` that is used as key for the hashmap -pub(crate) fn create_key( -group_by_keys: &[ArrayRef], +/// Appends a sequence of [u8] bytes for the value in `col[row]` to +/// `vec` to be used as a key into the hash map for a dictionary type +/// +/// Note that ideally, for dictionary encoded columns, we would be +/// able to simply use the dictionary idicies themselves (no need to +/// look up values) or possibly simply build the hash table entirely +/// on the dictionary indexes. +/// +/// This aproach would likely work (very) well for the common case, +/// but it also has to to handle the case where the dictionary itself +/// is not the same across all record batches (and thus indexes in one +/// record batch may not correspond to the same index in another) +fn dictionary_create_key_for_col( Review comment: this makes sense and the comment really helps so :+1: This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson opened a new pull request #9483: ARROW-11610: [C++] Download boost from sourceforge instead of bintray
nealrichardson opened a new pull request #9483: URL: https://github.com/apache/arrow/pull/9483 This also removes some bintray URLs where we had mirrored dependency versions for redundancy (cf. ARROW-11611); they're not active anymore anyway because we've bumped the dependency versions and didn't update our mirroring. For what it's worth, our builds won't touch the updated boost on sourceforge because we still have our trimmed boost bundle in front of it (updating that is ARROW-11612), so we'll have to confirm this by other means. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] seddonm1 commented on pull request #9233: ARROW-11289: [Rust][DataFusion] Implement GROUP BY support for Dictionary Encoded columns
seddonm1 commented on pull request #9233: URL: https://github.com/apache/arrow/pull/9233#issuecomment-778472259 @alamb looks good to me 👍 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9483: ARROW-11610: [C++] Download boost from sourceforge instead of bintray
github-actions[bot] commented on pull request #9483: URL: https://github.com/apache/arrow/pull/9483#issuecomment-778472380 https://issues.apache.org/jira/browse/ARROW-11610 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9484: ARROW-11614: Fix round() logic to return positive zero when argument is zero
github-actions[bot] commented on pull request #9484: URL: https://github.com/apache/arrow/pull/9484#issuecomment-778482095 https://issues.apache.org/jira/browse/ARROW-11614 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sagnikc-dremio opened a new pull request #9484: ARROW-11614: Fix round() logic to return positive zero when argument is zero
sagnikc-dremio opened a new pull request #9484: URL: https://github.com/apache/arrow/pull/9484 Previously, round(0.0) and round(0.0, out_scale) were returning -0.0, with this patch round() returns +0.0 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb closed pull request #9481: ARROW-11606: [Rust] [DataFusion] Add input schema to HashAggregateExec
alamb closed pull request #9481: URL: https://github.com/apache/arrow/pull/9481 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb closed pull request #9479: ARROW-11586: [Rust][Datafusion] Remove force unwrap
alamb closed pull request #9479: URL: https://github.com/apache/arrow/pull/9479 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9479: ARROW-11586: [Rust][Datafusion] Remove force unwrap
alamb commented on pull request #9479: URL: https://github.com/apache/arrow/pull/9479#issuecomment-778491382 Thanks @marcprux This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9481: ARROW-11606: [Rust] [DataFusion] Add input schema to HashAggregateExec
alamb commented on pull request #9481: URL: https://github.com/apache/arrow/pull/9481#issuecomment-778492158 FWIW @jorgecarleitao and @andygrove -- I have been using DataFusion quite heavily in IOx (e.g. https://github.com/influxdata/influxdb_iox/pull/795) and it is doing great. DataFusion is a really cool piece of software This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] andygrove commented on pull request #9481: ARROW-11606: [Rust] [DataFusion] Add input schema to HashAggregateExec
andygrove commented on pull request #9481: URL: https://github.com/apache/arrow/pull/9481#issuecomment-778503743 Thanks for the quick review and merge @jorgecarleitao and @alamb. I think distributed query execution will be working well enough in Ballista to support some TPC-H queries this weekend! The great news is that Ballista is now a tiny codebase. DataFusion is working out really well thanks to all the great work that went into 3.0.0 :heart: This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] BryanCutler commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overf
BryanCutler commented on pull request #9187: URL: https://github.com/apache/arrow/pull/9187#issuecomment-778506839 @WeichenXu123 we would just need to add the following API to `BaseVariableWidthVector` (and possibly `BaseRepeatedValueVector`) ```Java public int getBufferSizeFor(final int valueCount, double density) ``` The argument `density` is used elsewhere, so we should stick with that to be consistent. Also, the method `public double getDensity()` already exists in `BaseVariableWidthVector`. As for it being static, it could be, but just as the related methods it might be a good idea to not do this in case a subclass would want to override it in the future. Is it a requirement in your use case to be static? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #9272: [WIP] Benchmark placebo
ursabot commented on pull request #9272: URL: https://github.com/apache/arrow/pull/9272#issuecomment-778520842 ubuntu-20.04-x86_64: https://conbench.ursa.dev/compare/runs/139e0ea9-58ba-4b90-9eb7-c1e91f4daebd...223ea2f6-3e1f-4bc2-86d9-36113b073403/ dgx-ubuntu-18.04-x86_64: https://conbench.ursa.dev/compare/runs/0c8b2a20-56a1-462d-aeea-ca76785728d2...064f63db-c6f6-480a-b9ae-200342921fb1/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #9423: ARROW-9856: [R] Add bindings for string compute functions
nealrichardson commented on a change in pull request #9423: URL: https://github.com/apache/arrow/pull/9423#discussion_r575584575 ## File path: r/tests/testthat/test-dplyr.R ## @@ -256,6 +262,29 @@ test_that("filter() with %in%", { ) }) +test_that("filter() with string ops", { + expect_dplyr_equal( +input %>% + filter(dbl > 2, toupper(chr) %in% c("D", "F")) %>% + collect(), +tbl + ) Review comment: I looked into this. You can't just define `as.character` in the test because you need to replace the one that `toupper` calls. You can't use `testthat::with_mock` to mock base functions anymore, and my other hack of using `trace` to insert a crash seems to crash too many things (presumably legitimate places where as.character gets called elsewhere in the dplyr code). `mockery` might allow this more targeted but I haven't had great success setting it up before; there are other ways we could instrument our code to make sure the Arrow function is called, but I'm not sure it's worth it. It _definitely_ won't work if you do this on a Dataset, so we get some related test coverage there. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #9423: ARROW-9856: [R] Add bindings for string compute functions
nealrichardson commented on a change in pull request #9423: URL: https://github.com/apache/arrow/pull/9423#discussion_r575584575 ## File path: r/tests/testthat/test-dplyr.R ## @@ -256,6 +262,29 @@ test_that("filter() with %in%", { ) }) +test_that("filter() with string ops", { + expect_dplyr_equal( +input %>% + filter(dbl > 2, toupper(chr) %in% c("D", "F")) %>% + collect(), +tbl + ) Review comment: I looked into this. You can't just define `as.character` in the test because you need to replace the one that `toupper` calls. You can't use `testthat::with_mock` to mock base functions anymore, and my other hack of using `trace` to insert a crash seems to crash too many things (presumably legitimate places where as.character gets called elsewhere in the dplyr code). `mockery` might allow this more targeted but I haven't had great success setting it up before; there are other ways we could instrument our code to make sure the Arrow function is called, but I'm not sure it's worth it. It _definitely_ won't work if you called `toupper` on a Dataset expression, so we get some related test coverage there. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] houqp commented on pull request #9412: ARROW-11491: [Rust] support JSON schema inference for nested list and struct
houqp commented on pull request #9412: URL: https://github.com/apache/arrow/pull/9412#issuecomment-778534603 sorry for the delay, i have been busy lately. will update the PR this weekend to address all the feedback commented so far. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] seddonm1 opened a new pull request #9485: ARROW-11616: [Rust][DataFusion] Add collect_partitioned on DataFrame
seddonm1 opened a new pull request #9485: URL: https://github.com/apache/arrow/pull/9485 The DataFrame API has a `collect` method which invokes the `collect(plan: Arc) -> Result>` function which will collect records into a single vector of RecordBatches removing any partitioning via `MergeExec`. This PR adds the DataFrame `collect_partitioned` method so that partitioning can be maintained. This allows easy passing into a new `MemTable`. @andygrove This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9485: ARROW-11616: [Rust][DataFusion] Add collect_partitioned on DataFrame
github-actions[bot] commented on pull request #9485: URL: https://github.com/apache/arrow/pull/9485#issuecomment-778550493 https://issues.apache.org/jira/browse/ARROW-11616 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] codecov-io commented on pull request #9485: ARROW-11616: [Rust][DataFusion] Add collect_partitioned on DataFrame
codecov-io commented on pull request #9485: URL: https://github.com/apache/arrow/pull/9485#issuecomment-778552128 # [Codecov](https://codecov.io/gh/apache/arrow/pull/9485?src=pr&el=h1) Report > Merging [#9485](https://codecov.io/gh/apache/arrow/pull/9485?src=pr&el=desc) (5ed4321) into [master](https://codecov.io/gh/apache/arrow/commit/7660a22090fcf1d0230ca1e700a4b98f647b0c48?el=desc) (7660a22) will **decrease** coverage by `0.00%`. > The diff coverage is `0.00%`. [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9485/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9485?src=pr&el=tree) ```diff @@Coverage Diff @@ ## master#9485 +/- ## == - Coverage 82.31% 82.30% -0.01% == Files 234 234 Lines 5448254488 +6 == Hits4484944849 - Misses 9633 9639 +6 ``` | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9485?src=pr&el=tree) | Coverage Δ | | |---|---|---| | [rust/datafusion/src/execution/dataframe\_impl.rs](https://codecov.io/gh/apache/arrow/pull/9485/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9leGVjdXRpb24vZGF0YWZyYW1lX2ltcGwucnM=) | `90.13% <0.00%> (-3.71%)` | :arrow_down: | -- [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9485?src=pr&el=continue). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9485?src=pr&el=footer). Last update [7660a22...5ed4321](https://codecov.io/gh/apache/arrow/pull/9485?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on pull request #8949: ARROW-10880: [Java] Support compressing RecordBatch IPC buffers by LZ4
emkornfield commented on pull request #8949: URL: https://github.com/apache/arrow/pull/8949#issuecomment-778562980 > > efore we merge (once we verify it is working I can take a final look) > > Sure. I will do some tests for that. To run tests it should be sufficient to unskip the Java implementation in archery. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] projjal opened a new pull request #9486: ARROW-11617: [C++][Gandiva] Fix nested if-else optimisation in gandiva
projjal opened a new pull request #9486: URL: https://github.com/apache/arrow/pull/9486 In gandiva, when we have nested if-else statements we reuse the local bitmap and treat it is a single logical if - elseif - .. - --else condition. However, when he have say another function between them like IF THEN ELSE function( IF THEN ELSE ) in such cases also currently we are doing same thing which can lead to incorrect results This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9486: ARROW-11617: [C++][Gandiva] Fix nested if-else optimisation in gandiva
github-actions[bot] commented on pull request #9486: URL: https://github.com/apache/arrow/pull/9486#issuecomment-778565617 https://issues.apache.org/jira/browse/ARROW-11617 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nevi-me commented on a change in pull request #9425: ARROW-11504: [Rust] Added checks to List DataType.
nevi-me commented on a change in pull request #9425: URL: https://github.com/apache/arrow/pull/9425#discussion_r575625396 ## File path: rust/arrow/src/array/array_list.rs ## @@ -31,12 +31,19 @@ use crate::datatypes::*; /// trait declaring an offset size, relevant for i32 vs i64 array types. pub trait OffsetSizeTrait: ArrowNativeType + Num + Ord + std::ops::AddAssign { +fn is_large() -> bool; + fn prefix() -> &'static str; fn to_isize(&self) -> isize; } impl OffsetSizeTrait for i32 { +#[inline] +fn is_large() -> bool { +false +} + fn prefix() -> &'static str { Review comment: We might still need it, we also use it for formatting in `Display` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] codecov-io commented on pull request #9469: ARROW-11599: [Rust] Add function to create array with all nulls
codecov-io commented on pull request #9469: URL: https://github.com/apache/arrow/pull/9469#issuecomment-778569205 # [Codecov](https://codecov.io/gh/apache/arrow/pull/9469?src=pr&el=h1) Report > Merging [#9469](https://codecov.io/gh/apache/arrow/pull/9469?src=pr&el=desc) (8c45e47) into [master](https://codecov.io/gh/apache/arrow/commit/7660a22090fcf1d0230ca1e700a4b98f647b0c48?el=desc) (7660a22) will **decrease** coverage by `0.03%`. > The diff coverage is `56.86%`. [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9469/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9469?src=pr&el=tree) ```diff @@Coverage Diff @@ ## master#9469 +/- ## == - Coverage 82.31% 82.28% -0.04% == Files 234 234 Lines 5448254594 +112 == + Hits4484944920 +71 - Misses 9633 9674 +41 ``` | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9469?src=pr&el=tree) | Coverage Δ | | |---|---|---| | [rust/datafusion/src/scalar.rs](https://codecov.io/gh/apache/arrow/pull/9469/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zY2FsYXIucnM=) | `56.73% <15.00%> (+1.04%)` | :arrow_up: | | [rust/arrow/src/array/array.rs](https://codecov.io/gh/apache/arrow/pull/9469/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXkucnM=) | `75.50% <62.99%> (-13.03%)` | :arrow_down: | | [rust/arrow/src/array/null.rs](https://codecov.io/gh/apache/arrow/pull/9469/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvbnVsbC5ycw==) | `87.50% <66.66%> (-5.10%)` | :arrow_down: | | [rust/datafusion/src/physical\_plan/parquet.rs](https://codecov.io/gh/apache/arrow/pull/9469/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3BhcnF1ZXQucnM=) | `88.05% <66.66%> (-0.06%)` | :arrow_down: | | [rust/arrow/src/array/equal/utils.rs](https://codecov.io/gh/apache/arrow/pull/9469/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvdXRpbHMucnM=) | `76.47% <0.00%> (+0.98%)` | :arrow_up: | | [rust/arrow/src/array/transform/fixed\_binary.rs](https://codecov.io/gh/apache/arrow/pull/9469/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvdHJhbnNmb3JtL2ZpeGVkX2JpbmFyeS5ycw==) | `84.21% <0.00%> (+5.26%)` | :arrow_up: | -- [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9469?src=pr&el=continue). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9469?src=pr&el=footer). Last update [7660a22...8c45e47](https://codecov.io/gh/apache/arrow/pull/9469?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] projjal closed pull request #9486: ARROW-11617: [C++][Gandiva] Fix nested if-else optimisation in gandiva
projjal closed pull request #9486: URL: https://github.com/apache/arrow/pull/9486 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org