[jira] [Created] (ARROW-12290) [Rust][DataFusion] Add input_file_name function

2021-04-07 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-12290:
---

 Summary: [Rust][DataFusion] Add input_file_name function
 Key: ARROW-12290
 URL: https://issues.apache.org/jira/browse/ARROW-12290
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Mike Seddon
Assignee: Mike Seddon


For lineage and diffing purposes (used by protocols like DeltaLake) it can be 
useful to know the source of input data for a Dataframe. This adds the 
`input_file_name` function which, like Spark, returns the name of the file 
being read, or NULL if not available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12186) [Rust][DataFusion] Fix regexp_match test

2021-04-01 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-12186:
---

 Summary: [Rust][DataFusion] Fix regexp_match test
 Key: ARROW-12186
 URL: https://issues.apache.org/jira/browse/ARROW-12186
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Reporter: Mike Seddon
Assignee: Mike Seddon


The current location for the regexp_match will not work correctly with the 
feature flags.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11791) [Rust][DataFusion]

2021-02-25 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-11791:
---

 Summary: [Rust][DataFusion]
 Key: ARROW-11791
 URL: https://issues.apache.org/jira/browse/ARROW-11791
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Reporter: Mike Seddon
Assignee: Mike Seddon


After https://github.com/apache/arrow/pull/9523 RepartitionExec is pulling all 
data into memory before starting the stream which crashes on large sets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11775) [Rust][DataFusion] Feature Flags for Dependencies

2021-02-24 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-11775:
---

 Summary: [Rust][DataFusion] Feature Flags for Dependencies
 Key: ARROW-11775
 URL: https://issues.apache.org/jira/browse/ARROW-11775
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Mike Seddon
Assignee: Mike Seddon


As more features are added to DataFusion more dependencies will inevitably be 
required. To reduce the cost of importing and compiling these dependencies for 
projects that do not need the functionality it is proposed to use rust 'feature 
flags' (https://doc.rust-lang.org/cargo/reference/features.html) to be able to 
control this easily.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11738) Concat Functions

2021-02-22 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-11738:
---

 Summary: Concat Functions
 Key: ARROW-11738
 URL: https://issues.apache.org/jira/browse/ARROW-11738
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Mike Seddon
Assignee: Mike Seddon


Fix and Implement the concat functions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11687) [Rust][DataFusion] RepartitionExec Hanging

2021-02-17 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-11687:
---

 Summary: [Rust][DataFusion] RepartitionExec Hanging
 Key: ARROW-11687
 URL: https://issues.apache.org/jira/browse/ARROW-11687
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Mike Seddon
Assignee: Mike Seddon


Found an interesting defect where the final partition of the 
`RepartitionExec::execute` thread spawner was consistently not being spawned 
via `tokio::spawn`. This meant that `RepartitionStream::poll_next` was sitting 
waiting forever for data that never arrived.

It looks like a race condition where the `JoinHandle` was not being `await`ed 
and something strange going on with the internals of tokio like lazy evaluation?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11655) Pad/trim functions

2021-02-16 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-11655:
---

 Summary: Pad/trim functions
 Key: ARROW-11655
 URL: https://issues.apache.org/jira/browse/ARROW-11655
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Mike Seddon
Assignee: Mike Seddon


The Pad and Trimming functions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11656) Left over functions/fixes

2021-02-16 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-11656:
---

 Summary: Left over functions/fixes
 Key: ARROW-11656
 URL: https://issues.apache.org/jira/browse/ARROW-11656
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Mike Seddon
Assignee: Mike Seddon






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11654) Regex functions

2021-02-16 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-11654:
---

 Summary: Regex functions
 Key: ARROW-11654
 URL: https://issues.apache.org/jira/browse/ARROW-11654
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Mike Seddon
Assignee: Mike Seddon


The regexp Postgres functions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11653) Ascii/unicode functions

2021-02-16 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-11653:
---

 Summary: Ascii/unicode functions
 Key: ARROW-11653
 URL: https://issues.apache.org/jira/browse/ARROW-11653
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Mike Seddon


Implement the Postgres Ascii/Unicode functions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11652) Signature::OneOf

2021-02-16 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-11652:
---

 Summary: Signature::OneOf
 Key: ARROW-11652
 URL: https://issues.apache.org/jira/browse/ARROW-11652
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Mike Seddon
Assignee: Mike Seddon


There needs to be a way of defining a function signature that supports multiple 
strict options:

e.g. `lpad`
[string, int] or [string, int, string]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11651) Postgres Length Functions

2021-02-16 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-11651:
---

 Summary: Postgres Length Functions
 Key: ARROW-11651
 URL: https://issues.apache.org/jira/browse/ARROW-11651
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Mike Seddon
Assignee: Mike Seddon


To break up the large PR this is just the Postgres length functions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11650) [Rust][DataFusion] Add Postgres License

2021-02-16 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-11650:
---

 Summary: [Rust][DataFusion] Add Postgres License
 Key: ARROW-11650
 URL: https://issues.apache.org/jira/browse/ARROW-11650
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Mike Seddon
Assignee: Mike Seddon


DataFusion aims to support the PostgreSQL compatibility. To achieve 
compatibility
parts of the DataFusion code base may have reproduced code and documentation 
from the
PostgreSQL project and needs the license to reflect this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11616) [Rust][DataFusion] Expose collect_partitioned for DataFrame

2021-02-12 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-11616:
---

 Summary: [Rust][DataFusion] Expose collect_partitioned for 
DataFrame
 Key: ARROW-11616
 URL: https://issues.apache.org/jira/browse/ARROW-11616
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Mike Seddon
Assignee: Mike Seddon


The DataFrame API has a `collect` method which invokes the `collect(plan: 
Arc) -> Result>` function which will 
collect records into a single vector of RecordBatches removing the partitioning 
via `MergeExec`.

The DataFrame should also expose the `collect_partitioned` method so that 
partitions can be maintained.

```
collect_partitioned(
plan: Arc,
) -> Result>> 
```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11561) [Rust][DataFusion] Add Send + Sync to MemTable

2021-02-08 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-11561:
---

 Summary: [Rust][DataFusion] Add Send + Sync to MemTable
 Key: ARROW-11561
 URL: https://issues.apache.org/jira/browse/ARROW-11561
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Reporter: Mike Seddon
Assignee: Mike Seddon


Add Send + Sync to the MemTable::load to allow the Spark `persist` behavior to 
be implemented for DataFrames



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11434) Length kernel returns bytes not character length

2021-01-29 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-11434:
---

 Summary: Length kernel returns bytes not character length
 Key: ARROW-11434
 URL: https://issues.apache.org/jira/browse/ARROW-11434
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust, Rust - DataFusion
Reporter: Mike Seddon
Assignee: Mike Seddon


The rust `length` kernel currently counts number of bytes/octets rather than 
characters given that Arrow uses UTF8 encoding.

This means that the result of the `length` kernel on a string like `josé` will 
be 5 bytes rather than 4 characters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11339) [Rust][DataFusion] length kernel does not correctly calculate character length

2021-01-21 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-11339:
---

 Summary: [Rust][DataFusion] length kernel does not correctly 
calculate character length
 Key: ARROW-11339
 URL: https://issues.apache.org/jira/browse/ARROW-11339
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Reporter: Mike Seddon
Assignee: Mike Seddon


The current kernel works for simple characters as it appears to be assuming 
that 1 byte = 1 character. this is very fast but is not a safe assumption given 
Arrow strings are utf8. 

A simple example of failure is from the Postgres example where the current 
`length` implementation will calculate 5.

`char_length('josé') → 4`

The correct method seems to be via 
https://docs.rs/unicode-segmentation/1.2.1/unicode_segmentation/struct.Graphemes.html
 which I can implement in my work here: 
https://github.com/apache/arrow/pull/9243 and remove from kernel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11298) [Rust][DataFusion] Implement Postgres String Functions

2021-01-17 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-11298:
---

 Summary: [Rust][DataFusion] Implement Postgres String Functions
 Key: ARROW-11298
 URL: https://issues.apache.org/jira/browse/ARROW-11298
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Reporter: Mike Seddon
Assignee: Mike Seddon


This is a general task to add the Postgres String Functions to DataFusion.

https://www.postgresql.org/docs/13/functions-string.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11102) [Rust][DataFusion] fmt::Debug for ScalarValue(Utf8) is always quoted

2021-01-01 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-11102:
---

 Summary: [Rust][DataFusion] fmt::Debug for ScalarValue(Utf8) is 
always quoted
 Key: ARROW-11102
 URL: https://issues.apache.org/jira/browse/ARROW-11102
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Reporter: Mike Seddon
Assignee: Mike Seddon


When viewing the plans it is difficult to differentiate between a true NULL 
value and a quoted string like "NULL". 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11036) [Rust][DataFusion] Allow CSVReader to infer only columns not types

2020-12-25 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-11036:
---

 Summary: [Rust][DataFusion] Allow CSVReader to infer only columns 
not types
 Key: ARROW-11036
 URL: https://issues.apache.org/jira/browse/ARROW-11036
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Mike Seddon


Currently the CSVReader will only infer number of columns if it also attempts 
to infer types. This should be decoupled so that a user can easily extract a 
fully Utf8 typed CSV with the number of columns matching the input file. The 
user can then do CAST() or equivalent to control the parsing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11013) [Rust] CSV Reader cannot handle leading/trailing WhiteSpace

2020-12-22 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-11013:
---

 Summary: [Rust] CSV Reader cannot handle leading/trailing 
WhiteSpace
 Key: ARROW-11013
 URL: https://issues.apache.org/jira/browse/ARROW-11013
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust, Rust - DataFusion
Affects Versions: 2.0.0
Reporter: Mike Seddon


Currently the CSV Reader assumes very clean input data which does not have 
things like leading spaces. This means parsing data like the TPC-H 'answers' 
set from the databricks/tpch_dbgen repo does not work (like below).

Spark uses the Univocity parser library provides the options 
'ignoreLeadingWhitespace' and 'ignoreTrailingWhitespace' which would help fix 
this issue.

```
l|l|sum_qty|sum_base_price|sum_disc_pricesum_chargeavg_qtyavg_priceavg_disccount_order
   
A|F|37734107.00|56586554400.73|53758257134.87|55909065222.83|25.52|38273.13|0.05|
   1478493
```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10970) [Rust][DataFusion] Implement Value(Null)

2020-12-18 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-10970:
---

 Summary: [Rust][DataFusion] Implement Value(Null)
 Key: ARROW-10970
 URL: https://issues.apache.org/jira/browse/ARROW-10970
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Mike Seddon


We need to add support for the NULL value. 

For example:

```sql
SELECT char_length(NULL) AS char_length_null
```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10969) [Rust][DataFusion] Implement basic String Functions

2020-12-18 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-10969:
---

 Summary: [Rust][DataFusion] Implement basic String Functions
 Key: ARROW-10969
 URL: https://issues.apache.org/jira/browse/ARROW-10969
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Reporter: Mike Seddon
Assignee: Mike Seddon


There are not many ANSI SQL functions currently supported. This ticket is an 
umbrella for increasing the support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10947) [Rust][DataFusion] Refactor UTF8 to Date32 for Performance

2020-12-16 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-10947:
---

 Summary: [Rust][DataFusion] Refactor UTF8 to Date32 for Performance
 Key: ARROW-10947
 URL: https://issues.apache.org/jira/browse/ARROW-10947
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Reporter: Mike Seddon
Assignee: Mike Seddon


After adding benchmarking capability to the UTF8 to Date32/Date64 CAST 
functions there was opportunity to improve the performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10907) [Rust][DataFusion] Cast UTF8 to Date64 Incorrect

2020-12-14 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-10907:
---

 Summary: [Rust][DataFusion] Cast UTF8 to Date64 Incorrect
 Key: ARROW-10907
 URL: https://issues.apache.org/jira/browse/ARROW-10907
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Reporter: Mike Seddon
Assignee: Mike Seddon
 Fix For: 3.0.0


The current UTF8 to Date64 Cast behavior is incorrect in that it works on a 
`%Y-%m-%d` rather than `%Y-%m-%dT%H:%M:%S`. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10839) [Rust] [DataFusion] Implement BETWEEN Operator

2020-12-07 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-10839:
---

 Summary: [Rust] [DataFusion] Implement BETWEEN Operator
 Key: ARROW-10839
 URL: https://issues.apache.org/jira/browse/ARROW-10839
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Mike Seddon
Assignee: Mike Seddon
 Fix For: 3.0.0


3 of the 22 TPC-H queries use the *BETWEEN* operator which is syntactic sugar 
for:
{value} >= {low} AND {value} <= {high}

e.g.
`and l_discount between 0.06 - 0.01 and 0.06 + 0.01`
is equal to
`and l_discount > 0.06 - 0.01 and l_discount < 0.06 + 0.01`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10820) [Rust] [DataFusion] Complete TPC-H Benchmark Queries

2020-12-05 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-10820:
---

 Summary: [Rust] [DataFusion] Complete TPC-H Benchmark Queries
 Key: ARROW-10820
 URL: https://issues.apache.org/jira/browse/ARROW-10820
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Mike Seddon
Assignee: Mike Seddon


Add the rest of the TPC-H queries so they can be easily executed as more SQL 
functionality is implemented.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10819) [Rust] [DataFusion] Implement EXISTS operator

2020-12-05 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-10819:
---

 Summary: [Rust] [DataFusion] Implement EXISTS operator
 Key: ARROW-10819
 URL: https://issues.apache.org/jira/browse/ARROW-10819
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Mike Seddon


The TPC-H queries include use of the EXISTS which is used to test for the 
existence of any record in a subquery. For example:

and *exists* (
select
*
from
lineitem
where
l_orderkey = o_orderkey
and l_commitdate < l_receiptdate
)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10818) [Rust] [DataFusion] Implement DECIMAL type

2020-12-05 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-10818:
---

 Summary: [Rust] [DataFusion] Implement DECIMAL type
 Key: ARROW-10818
 URL: https://issues.apache.org/jira/browse/ARROW-10818
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust, Rust - DataFusion
Reporter: Mike Seddon


The TPC-H benchmarks correctly specify that all MONEY columns are DECIMAL type 
(precision and scale are not specified). We currently use `DataType::Float64` 
which is much lighter than a true Decimal type.

To be a valid benchmark we need to ensure we support the same precision as the 
reference implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10817) [Rust] [DataFusion] Implement inline CAST syntax

2020-12-05 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-10817:
---

 Summary: [Rust] [DataFusion] Implement inline CAST syntax
 Key: ARROW-10817
 URL: https://issues.apache.org/jira/browse/ARROW-10817
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Affects Versions: 3.0.0
Reporter: Mike Seddon


Of the 22 TPC-H queries, 11 rely on what I am calling 'inline casting' of dates 
e.g.:

l_shipdate <= *date* '1998-12-01' 

We need to be able to parse this to the correct `CastExpr`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10816) [Rust] [DataFusion] Implement INTERVAL

2020-12-05 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-10816:
---

 Summary: [Rust] [DataFusion] Implement INTERVAL
 Key: ARROW-10816
 URL: https://issues.apache.org/jira/browse/ARROW-10816
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Affects Versions: 3.0.0
Reporter: Mike Seddon


Of the 22 TPC-H queries, 9 depend on the INTERVAL functionality. e.g. from 
query 1:

l_shipdate <= date '1998-12-01' - *interval* '[DELTA]' day (3)




--
This message was sent by Atlassian Jira
(v8.3.4#803005)