[jira] [Created] (ARROW-12290) [Rust][DataFusion] Add input_file_name function
Mike Seddon created ARROW-12290: --- Summary: [Rust][DataFusion] Add input_file_name function Key: ARROW-12290 URL: https://issues.apache.org/jira/browse/ARROW-12290 Project: Apache Arrow Issue Type: Improvement Reporter: Mike Seddon Assignee: Mike Seddon For lineage and diffing purposes (used by protocols like DeltaLake) it can be useful to know the source of input data for a Dataframe. This adds the `input_file_name` function which, like Spark, returns the name of the file being read, or NULL if not available. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12186) [Rust][DataFusion] Fix regexp_match test
Mike Seddon created ARROW-12186: --- Summary: [Rust][DataFusion] Fix regexp_match test Key: ARROW-12186 URL: https://issues.apache.org/jira/browse/ARROW-12186 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Reporter: Mike Seddon Assignee: Mike Seddon The current location for the regexp_match will not work correctly with the feature flags. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11791) [Rust][DataFusion]
Mike Seddon created ARROW-11791: --- Summary: [Rust][DataFusion] Key: ARROW-11791 URL: https://issues.apache.org/jira/browse/ARROW-11791 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Reporter: Mike Seddon Assignee: Mike Seddon After https://github.com/apache/arrow/pull/9523 RepartitionExec is pulling all data into memory before starting the stream which crashes on large sets. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11775) [Rust][DataFusion] Feature Flags for Dependencies
Mike Seddon created ARROW-11775: --- Summary: [Rust][DataFusion] Feature Flags for Dependencies Key: ARROW-11775 URL: https://issues.apache.org/jira/browse/ARROW-11775 Project: Apache Arrow Issue Type: Improvement Reporter: Mike Seddon Assignee: Mike Seddon As more features are added to DataFusion more dependencies will inevitably be required. To reduce the cost of importing and compiling these dependencies for projects that do not need the functionality it is proposed to use rust 'feature flags' (https://doc.rust-lang.org/cargo/reference/features.html) to be able to control this easily. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11738) Concat Functions
Mike Seddon created ARROW-11738: --- Summary: Concat Functions Key: ARROW-11738 URL: https://issues.apache.org/jira/browse/ARROW-11738 Project: Apache Arrow Issue Type: Sub-task Reporter: Mike Seddon Assignee: Mike Seddon Fix and Implement the concat functions -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11687) [Rust][DataFusion] RepartitionExec Hanging
Mike Seddon created ARROW-11687: --- Summary: [Rust][DataFusion] RepartitionExec Hanging Key: ARROW-11687 URL: https://issues.apache.org/jira/browse/ARROW-11687 Project: Apache Arrow Issue Type: Bug Reporter: Mike Seddon Assignee: Mike Seddon Found an interesting defect where the final partition of the `RepartitionExec::execute` thread spawner was consistently not being spawned via `tokio::spawn`. This meant that `RepartitionStream::poll_next` was sitting waiting forever for data that never arrived. It looks like a race condition where the `JoinHandle` was not being `await`ed and something strange going on with the internals of tokio like lazy evaluation? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11655) Pad/trim functions
Mike Seddon created ARROW-11655: --- Summary: Pad/trim functions Key: ARROW-11655 URL: https://issues.apache.org/jira/browse/ARROW-11655 Project: Apache Arrow Issue Type: Sub-task Reporter: Mike Seddon Assignee: Mike Seddon The Pad and Trimming functions -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11656) Left over functions/fixes
Mike Seddon created ARROW-11656: --- Summary: Left over functions/fixes Key: ARROW-11656 URL: https://issues.apache.org/jira/browse/ARROW-11656 Project: Apache Arrow Issue Type: Sub-task Reporter: Mike Seddon Assignee: Mike Seddon -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11654) Regex functions
Mike Seddon created ARROW-11654: --- Summary: Regex functions Key: ARROW-11654 URL: https://issues.apache.org/jira/browse/ARROW-11654 Project: Apache Arrow Issue Type: Sub-task Reporter: Mike Seddon Assignee: Mike Seddon The regexp Postgres functions -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11653) Ascii/unicode functions
Mike Seddon created ARROW-11653: --- Summary: Ascii/unicode functions Key: ARROW-11653 URL: https://issues.apache.org/jira/browse/ARROW-11653 Project: Apache Arrow Issue Type: Sub-task Reporter: Mike Seddon Implement the Postgres Ascii/Unicode functions -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11652) Signature::OneOf
Mike Seddon created ARROW-11652: --- Summary: Signature::OneOf Key: ARROW-11652 URL: https://issues.apache.org/jira/browse/ARROW-11652 Project: Apache Arrow Issue Type: Sub-task Reporter: Mike Seddon Assignee: Mike Seddon There needs to be a way of defining a function signature that supports multiple strict options: e.g. `lpad` [string, int] or [string, int, string] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11651) Postgres Length Functions
Mike Seddon created ARROW-11651: --- Summary: Postgres Length Functions Key: ARROW-11651 URL: https://issues.apache.org/jira/browse/ARROW-11651 Project: Apache Arrow Issue Type: Sub-task Reporter: Mike Seddon Assignee: Mike Seddon To break up the large PR this is just the Postgres length functions -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11650) [Rust][DataFusion] Add Postgres License
Mike Seddon created ARROW-11650: --- Summary: [Rust][DataFusion] Add Postgres License Key: ARROW-11650 URL: https://issues.apache.org/jira/browse/ARROW-11650 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Mike Seddon Assignee: Mike Seddon DataFusion aims to support the PostgreSQL compatibility. To achieve compatibility parts of the DataFusion code base may have reproduced code and documentation from the PostgreSQL project and needs the license to reflect this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11616) [Rust][DataFusion] Expose collect_partitioned for DataFrame
Mike Seddon created ARROW-11616: --- Summary: [Rust][DataFusion] Expose collect_partitioned for DataFrame Key: ARROW-11616 URL: https://issues.apache.org/jira/browse/ARROW-11616 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Mike Seddon Assignee: Mike Seddon The DataFrame API has a `collect` method which invokes the `collect(plan: Arc) -> Result>` function which will collect records into a single vector of RecordBatches removing the partitioning via `MergeExec`. The DataFrame should also expose the `collect_partitioned` method so that partitions can be maintained. ``` collect_partitioned( plan: Arc, ) -> Result>> ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11561) [Rust][DataFusion] Add Send + Sync to MemTable
Mike Seddon created ARROW-11561: --- Summary: [Rust][DataFusion] Add Send + Sync to MemTable Key: ARROW-11561 URL: https://issues.apache.org/jira/browse/ARROW-11561 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Reporter: Mike Seddon Assignee: Mike Seddon Add Send + Sync to the MemTable::load to allow the Spark `persist` behavior to be implemented for DataFrames -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11434) Length kernel returns bytes not character length
Mike Seddon created ARROW-11434: --- Summary: Length kernel returns bytes not character length Key: ARROW-11434 URL: https://issues.apache.org/jira/browse/ARROW-11434 Project: Apache Arrow Issue Type: Bug Components: Rust, Rust - DataFusion Reporter: Mike Seddon Assignee: Mike Seddon The rust `length` kernel currently counts number of bytes/octets rather than characters given that Arrow uses UTF8 encoding. This means that the result of the `length` kernel on a string like `josé` will be 5 bytes rather than 4 characters. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11339) [Rust][DataFusion] length kernel does not correctly calculate character length
Mike Seddon created ARROW-11339: --- Summary: [Rust][DataFusion] length kernel does not correctly calculate character length Key: ARROW-11339 URL: https://issues.apache.org/jira/browse/ARROW-11339 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Reporter: Mike Seddon Assignee: Mike Seddon The current kernel works for simple characters as it appears to be assuming that 1 byte = 1 character. this is very fast but is not a safe assumption given Arrow strings are utf8. A simple example of failure is from the Postgres example where the current `length` implementation will calculate 5. `char_length('josé') → 4` The correct method seems to be via https://docs.rs/unicode-segmentation/1.2.1/unicode_segmentation/struct.Graphemes.html which I can implement in my work here: https://github.com/apache/arrow/pull/9243 and remove from kernel. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11298) [Rust][DataFusion] Implement Postgres String Functions
Mike Seddon created ARROW-11298: --- Summary: [Rust][DataFusion] Implement Postgres String Functions Key: ARROW-11298 URL: https://issues.apache.org/jira/browse/ARROW-11298 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Reporter: Mike Seddon Assignee: Mike Seddon This is a general task to add the Postgres String Functions to DataFusion. https://www.postgresql.org/docs/13/functions-string.html -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11102) [Rust][DataFusion] fmt::Debug for ScalarValue(Utf8) is always quoted
Mike Seddon created ARROW-11102: --- Summary: [Rust][DataFusion] fmt::Debug for ScalarValue(Utf8) is always quoted Key: ARROW-11102 URL: https://issues.apache.org/jira/browse/ARROW-11102 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Reporter: Mike Seddon Assignee: Mike Seddon When viewing the plans it is difficult to differentiate between a true NULL value and a quoted string like "NULL". -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11036) [Rust][DataFusion] Allow CSVReader to infer only columns not types
Mike Seddon created ARROW-11036: --- Summary: [Rust][DataFusion] Allow CSVReader to infer only columns not types Key: ARROW-11036 URL: https://issues.apache.org/jira/browse/ARROW-11036 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Mike Seddon Currently the CSVReader will only infer number of columns if it also attempts to infer types. This should be decoupled so that a user can easily extract a fully Utf8 typed CSV with the number of columns matching the input file. The user can then do CAST() or equivalent to control the parsing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11013) [Rust] CSV Reader cannot handle leading/trailing WhiteSpace
Mike Seddon created ARROW-11013: --- Summary: [Rust] CSV Reader cannot handle leading/trailing WhiteSpace Key: ARROW-11013 URL: https://issues.apache.org/jira/browse/ARROW-11013 Project: Apache Arrow Issue Type: Bug Components: Rust, Rust - DataFusion Affects Versions: 2.0.0 Reporter: Mike Seddon Currently the CSV Reader assumes very clean input data which does not have things like leading spaces. This means parsing data like the TPC-H 'answers' set from the databricks/tpch_dbgen repo does not work (like below). Spark uses the Univocity parser library provides the options 'ignoreLeadingWhitespace' and 'ignoreTrailingWhitespace' which would help fix this issue. ``` l|l|sum_qty|sum_base_price|sum_disc_pricesum_chargeavg_qtyavg_priceavg_disccount_order A|F|37734107.00|56586554400.73|53758257134.87|55909065222.83|25.52|38273.13|0.05| 1478493 ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10970) [Rust][DataFusion] Implement Value(Null)
Mike Seddon created ARROW-10970: --- Summary: [Rust][DataFusion] Implement Value(Null) Key: ARROW-10970 URL: https://issues.apache.org/jira/browse/ARROW-10970 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Mike Seddon We need to add support for the NULL value. For example: ```sql SELECT char_length(NULL) AS char_length_null ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10969) [Rust][DataFusion] Implement basic String Functions
Mike Seddon created ARROW-10969: --- Summary: [Rust][DataFusion] Implement basic String Functions Key: ARROW-10969 URL: https://issues.apache.org/jira/browse/ARROW-10969 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Reporter: Mike Seddon Assignee: Mike Seddon There are not many ANSI SQL functions currently supported. This ticket is an umbrella for increasing the support. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10947) [Rust][DataFusion] Refactor UTF8 to Date32 for Performance
Mike Seddon created ARROW-10947: --- Summary: [Rust][DataFusion] Refactor UTF8 to Date32 for Performance Key: ARROW-10947 URL: https://issues.apache.org/jira/browse/ARROW-10947 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Reporter: Mike Seddon Assignee: Mike Seddon After adding benchmarking capability to the UTF8 to Date32/Date64 CAST functions there was opportunity to improve the performance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10907) [Rust][DataFusion] Cast UTF8 to Date64 Incorrect
Mike Seddon created ARROW-10907: --- Summary: [Rust][DataFusion] Cast UTF8 to Date64 Incorrect Key: ARROW-10907 URL: https://issues.apache.org/jira/browse/ARROW-10907 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Reporter: Mike Seddon Assignee: Mike Seddon Fix For: 3.0.0 The current UTF8 to Date64 Cast behavior is incorrect in that it works on a `%Y-%m-%d` rather than `%Y-%m-%dT%H:%M:%S`. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10839) [Rust] [DataFusion] Implement BETWEEN Operator
Mike Seddon created ARROW-10839: --- Summary: [Rust] [DataFusion] Implement BETWEEN Operator Key: ARROW-10839 URL: https://issues.apache.org/jira/browse/ARROW-10839 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Mike Seddon Assignee: Mike Seddon Fix For: 3.0.0 3 of the 22 TPC-H queries use the *BETWEEN* operator which is syntactic sugar for: {value} >= {low} AND {value} <= {high} e.g. `and l_discount between 0.06 - 0.01 and 0.06 + 0.01` is equal to `and l_discount > 0.06 - 0.01 and l_discount < 0.06 + 0.01` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10820) [Rust] [DataFusion] Complete TPC-H Benchmark Queries
Mike Seddon created ARROW-10820: --- Summary: [Rust] [DataFusion] Complete TPC-H Benchmark Queries Key: ARROW-10820 URL: https://issues.apache.org/jira/browse/ARROW-10820 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Mike Seddon Assignee: Mike Seddon Add the rest of the TPC-H queries so they can be easily executed as more SQL functionality is implemented. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10819) [Rust] [DataFusion] Implement EXISTS operator
Mike Seddon created ARROW-10819: --- Summary: [Rust] [DataFusion] Implement EXISTS operator Key: ARROW-10819 URL: https://issues.apache.org/jira/browse/ARROW-10819 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Mike Seddon The TPC-H queries include use of the EXISTS which is used to test for the existence of any record in a subquery. For example: and *exists* ( select * from lineitem where l_orderkey = o_orderkey and l_commitdate < l_receiptdate ) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10818) [Rust] [DataFusion] Implement DECIMAL type
Mike Seddon created ARROW-10818: --- Summary: [Rust] [DataFusion] Implement DECIMAL type Key: ARROW-10818 URL: https://issues.apache.org/jira/browse/ARROW-10818 Project: Apache Arrow Issue Type: New Feature Components: Rust, Rust - DataFusion Reporter: Mike Seddon The TPC-H benchmarks correctly specify that all MONEY columns are DECIMAL type (precision and scale are not specified). We currently use `DataType::Float64` which is much lighter than a true Decimal type. To be a valid benchmark we need to ensure we support the same precision as the reference implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10817) [Rust] [DataFusion] Implement inline CAST syntax
Mike Seddon created ARROW-10817: --- Summary: [Rust] [DataFusion] Implement inline CAST syntax Key: ARROW-10817 URL: https://issues.apache.org/jira/browse/ARROW-10817 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Affects Versions: 3.0.0 Reporter: Mike Seddon Of the 22 TPC-H queries, 11 rely on what I am calling 'inline casting' of dates e.g.: l_shipdate <= *date* '1998-12-01' We need to be able to parse this to the correct `CastExpr`. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10816) [Rust] [DataFusion] Implement INTERVAL
Mike Seddon created ARROW-10816: --- Summary: [Rust] [DataFusion] Implement INTERVAL Key: ARROW-10816 URL: https://issues.apache.org/jira/browse/ARROW-10816 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Affects Versions: 3.0.0 Reporter: Mike Seddon Of the 22 TPC-H queries, 9 depend on the INTERVAL functionality. e.g. from query 1: l_shipdate <= date '1998-12-01' - *interval* '[DELTA]' day (3) -- This message was sent by Atlassian Jira (v8.3.4#803005)