[jira] [Created] (ARROW-11179) [Format] Make comments in fb files friendly to rust doc
meng qingyou created ARROW-11179: Summary: [Format] Make comments in fb files friendly to rust doc Key: ARROW-11179 URL: https://issues.apache.org/jira/browse/ARROW-11179 Project: Apache Arrow Issue Type: Improvement Reporter: meng qingyou Currently, comments in flatbuffer files are directly copied to rust and c++ source codes. That's great but there are some problems to `cargo doc`, for example: * array element abc[1] or link label [smith2017knl] causes `broken intra doc links` warning * figures/code blocks are flatten into one line [example from arrow 2.0.0doc|https://docs.rs/arrow/2.0.0/arrow/ipc/gen/SparseTensor/struct.SparseTensorIndexCSF.html#method.indptrType] After flatc generating, those ipc files have to be updated manually to fix the above problems. So I'm suggesting update flatbuffer comments to address this problem. * Escape inline code with `` * Escape code block with ```. * remove square brackets from link label -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11178) [Rust] StructArray: handling duplicate field names
meng qingyou created ARROW-11178: Summary: [Rust] StructArray: handling duplicate field names Key: ARROW-11178 URL: https://issues.apache.org/jira/browse/ARROW-11178 Project: Apache Arrow Issue Type: Improvement Reporter: meng qingyou The arrow spec leaves the solution of `duplicate field names` to implementors. The C++'s solution: ignore or raise error, the Java's solution: ignore, append, replace or raise error. Both use ignore as the default. Here is the references: * [https://github.com/apache/arrow/blob/57376d28cf433bed95f19fa44c1e90a780ba54e8/cpp/src/arrow/type.cc] * [https://github.com/apache/arrow/blob/25c736d48dc289f457e74d15d05db65f6d539447/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractStructVector.java] I'm not expert at database or data science, but as far as I know, in the traditional RDBMS domain, it's unusual to allow duplicate field names. Further more, in the data analysis domain, perhaps it's usual to normalize/clean various kind of bad/dirty data *interactively* with tools like `pandas`? Back to the problem, I have an example: given duplicate field names A A A B B, the user who knows actual data MAY choose to: replace first A with second A and append third A, and ignore second B. Or the duplication was just mistake? Quote from [~nevi_me]: "I also prefer raising an error by default, as that'll make users aware very quickly". Is not acceptable if we silently append/ignore/replace duplicate fields, resulting unexpected results that user does not aware at all. If we choose to support `replace`, `ignore` or `append`, at least we must let user control the exact behavior. For IPC data, perhaps custom metadata (for file, message and field) is the only choice. I suggest just record this problem here, keep raising error until it's really necessary to support other solutions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11170) [Rust] limit unnecessary public
meng qingyou created ARROW-11170: Summary: [Rust] limit unnecessary public Key: ARROW-11170 URL: https://issues.apache.org/jira/browse/ARROW-11170 Project: Apache Arrow Issue Type: Improvement Reporter: meng qingyou Public things make it hard to change the signatures and calling conventions, perhaps we should have a check and reclaim some visibility -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11168) [Rust] cargo doc warnings
meng qingyou created ARROW-11168: Summary: [Rust] cargo doc warnings Key: ARROW-11168 URL: https://issues.apache.org/jira/browse/ARROW-11168 Project: Apache Arrow Issue Type: Improvement Reporter: meng qingyou {code:java} Compiling parquet_derive v3.0.0-SNAPSHOT (/Users/mqy/tools/data/arrow/rust/parquet_derive) warning: unresolved link to `array::data::ArrayData` --> arrow/src/lib.rs:73:64 | 73 | //! Data in [`Array`](array::Array) is stored in [`ArrayData`](array::data::ArrayData), that in turn | ^^ no item named `data` in module `array` | = note: `#[warn(broken_intra_doc_links)]` on by defaultwarning: unresolved link to `array::data::ArrayData` --> arrow/src/lib.rs:74:44 | 74 | //! is a collection of other [`ArrayData`](array::data::ArrayData) and [`Buffers`](buffer::Buffer). |^^ no item named `data` in module `array`warning: unresolved link to `0` --> arrow/src/array/array_primitive.rs:52:67 | 52 | /// raw_values must have a value equivalent to data.buffers()[0].raw_data() | ^ no item named `0` in scope | = help: to escape `[` and `]` characters, add '\' before them like `\[` or `\]`warning: unresolved link to `Datatype` --> arrow/src/ffi.rs:24:115 | 24 | //! The second interface maps native Rust types to the Rust-specific implementation of Arrow such as `format` to [Datatype], | no item named `Datatype` in scope | = help: to escape `[` and `]` characters, add '\' before them like `\[` or `\]`warning: unresolved link to `4` --> arrow/src/ipc/gen/Message.rs:339:63 | 339 | /// For example, a List with values [[1, 2, 3], null, [4], [5, 6], null] | ^ no item named `4` in scope | = help: to escape `[` and `]` characters, add '\' before them like `\[` or `\]`warning: unresolved link to `0` --> arrow/src/ipc/gen/Schema.rs:1597:15 | 1597 | /// - child[0] entries: Struct | ^ no item named `0` in scope | = help: to escape `[` and `]` characters, add '\' before them like `\[` or `\]`warning: unresolved link to `0` --> arrow/src/ipc/gen/Schema.rs:1598:17 | 1598 | /// - child[0] key: K | ^ no item named `0` in scope | = help: to escape `[` and `]` characters, add '\' before them like `\[` or `\]`warning: unresolved link to `1` --> arrow/src/ipc/gen/Schema.rs:1599:17 | 1599 | /// - child[1] value: V | ^ no item named `1` in scope | = help: to escape `[` and `]` characters, add '\' before them like `\[` or `\]`warning: unresolved link to `offset` --> arrow/src/ipc/gen/Schema.rs:1706:28 | 1706 | /// for each child typeIds[offset] is the id used in the type vector |^^ no item named `offset` in scope | = help: to escape `[` and `]` characters, add '\' before them like `\[` or `\]`warning: unresolved link to `i` --> arrow/src/ipc/gen/SparseTensor.rs:521:40 | 521 | /// The i-th row spans from indptr[i] to indptr[i+1] in the data. |^ no item named `i` in scope | = help: to escape `[` and `]` characters, add '\' before them like `\[` or `\]`warning: unresolved link to `smith2017knl` --> arrow/src/ipc/gen/SparseTensor.rs:753:14 | 753 | /// See [smith2017knl]: http://shaden.io/pub-files/smith2017knl.pdf | no item named `smith2017knl` in scope | = help: to escape `[` and `]` characters, add '\' before them like `\[` or `\]`warning: unresolved link to `i` --> arrow/src/ipc/gen/SparseTensor.rs:795:75 | 795 | /// indptrBuffers. A pair of consecutive values at indptrBuffers[dim][i] | ^ no item named `i` in scope | = help: to escape `[` and `]` characters, add '\' before them like `\[` or `\]`warning: unresolved link to `i` --> arrow/src/ipc/gen/SparseTensor.rs:797:73 | 797 | /// indicesBuffers[dim + 1] who are children of indicesBuffers[dim][i] node. | ^ no item named `i` in scope | = help: to escape `[` and `]` characters, add '\' before them like `\[` or `\]`warning: 13 warnings emitted Documenting futures v0.3.8 Documenting tower-balance v0.3.0 Documenting tower v0.3.1 Documenting parquet v3.0.0-SNAPSHOT (/Users/mqy/tools/data/arrow/ru
[jira] [Created] (ARROW-11160) [Rust] pitfall in xxx_equal functions
meng qingyou created ARROW-11160: Summary: [Rust] pitfall in xxx_equal functions Key: ARROW-11160 URL: https://issues.apache.org/jira/browse/ARROW-11160 Project: Apache Arrow Issue Type: Improvement Reporter: meng qingyou Functions `pub(super) fn _equal` in rust/src/array/equal/ MUST fail if called directly instead of from `equal`. boolean.rs as example: ``` lhs_is_null || (lhs_is_null == rhs_is_null) && equal_bits(...) ``` The bug is not seen in tests at present because `equal` shortcuts the condition: zero null s v.s. nulls To verify this problem , adds a new test in `mod.rs`: ``` #[test] fntest_boolean_equal_bug() { leta=BooleanArray::from(vec![None]).data(); letb=BooleanArray::from(vec![Some(false)]).data(); letlhs=a.as_ref(); letrhs=b.as_ref(); letlhs_nulls=lhs.null_buffer(); letrhs_nulls=rhs.null_buffer(); assert_eq!(boolean::boolean_equal(lhs, rhs, lhs_nulls, rhs_nulls, 0, 0, lhs.len()), false); } ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11131) [Rust] Optimize performance of bool_equal
meng qingyou created ARROW-11131: Summary: [Rust] Optimize performance of bool_equal Key: ARROW-11131 URL: https://issues.apache.org/jira/browse/ARROW-11131 Project: Apache Arrow Issue Type: Improvement Reporter: meng qingyou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11101) [Rust] udate git pre-commit hook
meng qingyou created ARROW-11101: Summary: [Rust] udate git pre-commit hook Key: ARROW-11101 URL: https://issues.apache.org/jira/browse/ARROW-11101 Project: Apache Arrow Issue Type: Improvement Reporter: meng qingyou Currently, the client side git pre-commit hook check and run `cargo fmt` with stable version, but the CI check may fail in nightly version occasionally even if the code has been formatted with stable. It seems that, this problem can be resolved by: running `cargo +nighty fmt` before `cargo +stable fmt`. Thus `pre-commit.sh` should be updated in this way. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11034) [Rust] rustfmt cleanup
meng qingyou created ARROW-11034: Summary: [Rust] rustfmt cleanup Key: ARROW-11034 URL: https://issues.apache.org/jira/browse/ARROW-11034 Project: Apache Arrow Issue Type: Improvement Reporter: meng qingyou We have basically switched to the stable rust after #9698 and #8930. There is a annoying problem that: cargo +stable fmt prints warning like this: *Warning: can't set `ignore ... unstable features are only available in nightly channel.* Since "arrow/src/ipc/gen" has been formatted with the stable, we can remove the ignore list from rustfmt.toml -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10967) Make env vars ARROW_TEST_DATA and PARQUET_TEST_DATA optional
meng qingyou created ARROW-10967: Summary: Make env vars ARROW_TEST_DATA and PARQUET_TEST_DATA optional Key: ARROW-10967 URL: https://issues.apache.org/jira/browse/ARROW-10967 Project: Apache Arrow Issue Type: Test Reporter: meng qingyou Facts/problems: # Two vars *c* and *PARQUET_TEST_DATA* are required by be set for running tests, benchmarks, examples. # There are totally eighteen .rs files use these environment variables. # The major usage likes this: ``` let testdata = std::env::var("PARQUET_TEST_DATA").expect("PARQUET_TEST_DATA not defined");``` # Somebody tried to assembly the test data directories by appending relative dir to *current dir* of current running process, but highly depend on the actual current dir (for example, rust/, rust/datafusion, etc.). Here is my solution: Suppose: # *current_dir* is ALWAYS inside the *git workspace dir* # We know an *absolute dir X relative to git workspace dir* Get absolute dir of X == get absolute dir *TOP* of *git workspace dir*. Given *current dir* (in *git workspace dir*),we visit the dir and it's parents, check if ."git" (file or dir)exists. The first dir that contains ".git" SHOULD be *git workspace dir*. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10952) [Rust] Add pre-commit hook
meng qingyou created ARROW-10952: Summary: [Rust] Add pre-commit hook Key: ARROW-10952 URL: https://issues.apache.org/jira/browse/ARROW-10952 Project: Apache Arrow Issue Type: Improvement Reporter: meng qingyou If a commit contains only fixes for rust format, it would be very sad to wait for the slow CI checking done, and it may block checking of PRs. So, the PR adds a git pre-commit hook file, which do *cargo fmt --check* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10938) [Arrow] upgrade dependency "flatbuffers" to 0.8.0
meng qingyou created ARROW-10938: Summary: [Arrow] upgrade dependency "flatbuffers" to 0.8.0 Key: ARROW-10938 URL: https://issues.apache.org/jira/browse/ARROW-10938 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: meng qingyou [flatbuffers](https://crates.io/crates/flatbuffers) 0.8.0 was released on Dec 10, 2020, with some notable changes: verifier common rust traits to FlatBufferBuilder new VectorIter add FlatBufferBuilder::force_defaults API Optional Scalars up to 2018 edition possible performance speedup ... and minor breaking change to some APIs, for example: remote "get_", return Result. Let's try this version. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10915) Make ARROW_TEST_DATA and PARQUET_TEST_DATA absolute dirs
meng qingyou created ARROW-10915: Summary: Make ARROW_TEST_DATA and PARQUET_TEST_DATA absolute dirs Key: ARROW-10915 URL: https://issues.apache.org/jira/browse/ARROW-10915 Project: Apache Arrow Issue Type: Test Components: Rust Reporter: meng qingyou In rust/README.md, both *ARROW_TEST_DATA* and *PARQUET_TEST_DATA* are set as relative path. The problem is: we MAY have to reset them back-and-forth across top and subdirectories -- that's annoying. So, the obvious solution is: set the Env vars as absolute dirs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10832) Evaluate latest snapshot flatc
meng qingyou created ARROW-10832: Summary: Evaluate latest snapshot flatc Key: ARROW-10832 URL: https://issues.apache.org/jira/browse/ARROW-10832 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: meng qingyou Flatc 1.12.0 generates incorrect code [https://github.com/google/flatbuffers/issues/5849|http://example.com/] Let's try snapshort flatc built from flatbuffers source: looks good with just several adaptions. -- This message was sent by Atlassian Jira (v8.3.4#803005)