melgenek commented on issue #4462: URL: https://github.com/apache/arrow-datafusion/issues/4462#issuecomment-1374123119
Hi there! I am new to Datafusion and would like to learn more about it. And I found this issue, which I'd like to tackle. I did some investigation regarding sqllogictest, sqllogictest-rs and datafusion. And I now see some ways to implement a solution. My main question is: what level of compatibility between Postgres and Datafusion would you like to check? Currently, the tests in `integration-test` compare values within an error using [np.testing.assert_allclose](https://github.com/apache/arrow-datafusion/blob/master/integration-tests/test_psql_parity.py#L91). But while values are close, they are not always fully equal in terms of representation. For example, the test [simple_window_partition_aggregation.sql](https://github.com/apache/arrow-datafusion/blob/master/integration-tests/sqls/simple_window_partition_aggregation.sql) result has an `avg` value which Postgres represents as `-13.8571428571428571`, but Datafusion does as `-13.857142857142858`. The values are close, but not in a text form representation. ------------------------- Regarding implementation I imagine some ways: **1.** The very basic way without changes to `sqllogictest-rs` would be to define sqllogic tests with results explicitly. Each test would be created twice: once for Postgres, and once for Datafusion. Tests would differ For example, ``` onlyif postgres query R select 2.0; ---- 2.0 onlyif datafusion query R select 2.0; ---- 2 ``` Pros: - each output could be different. For example, as in the example above the values are the same, though their representation is different. - this doesn't change the idea of sqllogictest, and uses out-of-the-box features Cons: - all outputs have to be created manually. - in many cases there would be just duplicate queries and duplicate sets of results. **2.** Another option that uses existing features is: - create a `test.slt` test without outputs - run Postgress to complete the results. Do it the same way the [--complete](https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/tests/sqllogictests/README.md?plain=1#L51) parameter works now. Output this file into a `test.tmp.slt` - Run Datafusion over `test.tmp.slt` Pros: - no new `sqllogictest-rs` features required Cons: - there only way to compare results is to compare their text representations **3.** Use [query labels](https://duckdb.org/dev/sqllogictest/result_verification#query-labels) to compare of labeled queries, but run multiple database engines at the same time. The queries would look like ``` onlyif postgres query R nosort q42 select 2.0; onlyif datafusion query R nosort q42 select 2.0; ``` Pros: - results are generated by sqllogictest. There is no need to define them manually - there is no new sqllogictest syntax features - the tool would be pretty generic. It could be reused for different database combinations Cons: - `sqllogictest-rs` currently doesn't support comparison of labeled queries. This has to be implemented. - this is not really an intended use of `onlyif`. The sqllogictest in general and its implementations suppose that there is one database at a time. The same suite of tests can run on many different databases, but not at the same time. In my suggesting, we suppose that there are 2 databases. This fast is not necessarily bad but stretches the current idea of sqllogictest a little - there is no way to define comparison rules, like in option 2 **4.** Create an `sqllogictest::Runner` with a custom [Validator](https://github.com/risinglightdb/sqllogictest-rs/blob/6b2f9810dfcee39c64be961269ae41b11fbc706a/sqllogictest/src/runner.rs#L397). The `Validator` is currently defined as ``` pub type Validator = fn(actual: &[Vec<String>], expected: &[String]) -> bool; ``` If `Validator` is updated to contain the original query [record](https://github.com/risinglightdb/sqllogictest-rs/blob/6b2f9810dfcee39c64be961269ae41b11fbc706a/sqllogictest/src/parser.rs#L80-L108), then implementation can actully run Postgres, and disregard whatever expected value is passed to the validator. ``` pub type Validator = fn(record: Record, actual: &[Vec<String>], expected: &[String]) -> bool; ... fn postgres_validator(record: Record, actual: &[Vec<String>], expected: &[String]) { // let postgres_result = call_postgres(record.sql); // postgres_result == actual } ``` Pros: - relatively small number of changes to `sqllogictest-rs` - provides a more elaborate `Validator` that could be used for other purposes Cons: - comparing float values with error would still require more changes. `sqllogictest-rs` treats results in `actual`/`expected` field as `String` now. So there is no way to do number rounding. Another enum, for example, [DBOutput](https://github.com/risinglightdb/sqllogictest-rs/blob/6b2f9810dfcee39c64be961269ae41b11fbc706a/sqllogictest/src/runner.rs#L74) has to be introduced instead of `String` to carry type information. See [this issue](https://github.com/risinglightdb/sqllogictest-rs/issues/93) for a discussion of non-text comparisons. - value Validator would likely need to be merged with a [type validator](https://github.com/risinglightdb/sqllogictest-rs/issues/36#issuecomment-1341725683). Otherwise, there could be a need to call Postgress twice: in value validation, and in type validation **5.** Introduce a way to run a statement/query on multiple databases at the same time. This can be either achieved by using labels ``` query R nosort postgres datafusion select 2.0; ``` or the way that is similar to what was proposed in the description of this ``` # compare(datafusion, postgres) query R select 2.0; ``` Pros: - this is a new generic feature that would allow having multiple databases at the same time Cons: - `sqllogictest-rs` has to be updated to support having multiple runners - `# compare(datafusion, postgres)` would require updating not only execution, but also parsing logic ------------------------- To summarize my understanding of `sqllogictest`: - tests check text representations of results - if comparing of same, but not exactly equal value is required, then they have to be transformed to the same text representation. For example, comparing floating numbers could require using sql functions like `round` or `truncate`. In order to demonstrate how Postgres compatibility could look like using `sqllogictest` I created an implementation based on the 2nd approach https://github.com/apache/arrow-datafusion/pull/4834. I would be glad to hear some feedback regarding my pr and thoughts about which way to proceed. Thank you in advance! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
