[jira] [Created] (ARROW-11179) [Format] Make comments in fb files friendly to rust doc

2021-01-07 Thread meng qingyou (Jira)
meng qingyou created ARROW-11179:


 Summary: [Format] Make comments in fb files friendly to rust doc
 Key: ARROW-11179
 URL: https://issues.apache.org/jira/browse/ARROW-11179
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: meng qingyou


Currently, comments in flatbuffer files are directly copied to rust and c++ 
source codes.

That's great but there are some problems to `cargo doc`, for example:
 * array element abc[1] or link label [smith2017knl] causes `broken intra doc 
links` warning
 * figures/code blocks are flatten into one line [example from arrow 
2.0.0doc|https://docs.rs/arrow/2.0.0/arrow/ipc/gen/SparseTensor/struct.SparseTensorIndexCSF.html#method.indptrType]

After flatc generating, those ipc files have to be updated manually to fix the 
above problems.

So I'm suggesting update flatbuffer comments to address this problem.
 * Escape inline code with ``
 * Escape code block with ```.
 * remove square brackets from link label



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11178) [Rust] StructArray: handling duplicate field names

2021-01-07 Thread meng qingyou (Jira)
meng qingyou created ARROW-11178:


 Summary: [Rust] StructArray: handling duplicate field names
 Key: ARROW-11178
 URL: https://issues.apache.org/jira/browse/ARROW-11178
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: meng qingyou


The arrow spec leaves the solution of `duplicate field names` to implementors.

The C++'s solution: ignore or raise error, the Java's solution: ignore, append, 
replace or raise error. Both use ignore as the default. Here is the references:
 * 
[https://github.com/apache/arrow/blob/57376d28cf433bed95f19fa44c1e90a780ba54e8/cpp/src/arrow/type.cc]
 * 
[https://github.com/apache/arrow/blob/25c736d48dc289f457e74d15d05db65f6d539447/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractStructVector.java]

I'm not expert at database or data science, but as far as I know, in the 
traditional RDBMS domain, it's unusual to allow duplicate field names. Further 
more, in the data analysis domain, perhaps it's usual to normalize/clean 
various kind of bad/dirty data *interactively* with tools like `pandas`?

Back to the problem, I have an example: given duplicate field names A A A B B, 
the user who knows actual data MAY choose to: replace first A with second A and 
append third A, and ignore second B. Or the duplication was just mistake?

Quote from [~nevi_me]: "I also prefer raising an error by default, as that'll 
make users aware very quickly". Is not acceptable if we silently 
append/ignore/replace duplicate fields, resulting unexpected results that user 
does not aware at all.

If we choose to support `replace`, `ignore` or `append`, at least we must let 
user control the exact behavior.  For IPC data, perhaps custom metadata (for 
file, message and field) is the only choice. I suggest just record this problem 
here, keep raising error until it's really necessary to support other solutions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11170) [Rust] limit unnecessary public

2021-01-07 Thread meng qingyou (Jira)
meng qingyou created ARROW-11170:


 Summary: [Rust] limit unnecessary public
 Key: ARROW-11170
 URL: https://issues.apache.org/jira/browse/ARROW-11170
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: meng qingyou


Public things make it hard to change the signatures and calling conventions, 
perhaps we should have a check and reclaim some visibility



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11168) [Rust] cargo doc warnings

2021-01-07 Thread meng qingyou (Jira)
meng qingyou created ARROW-11168:


 Summary: [Rust] cargo doc warnings
 Key: ARROW-11168
 URL: https://issues.apache.org/jira/browse/ARROW-11168
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: meng qingyou


{code:java}
   Compiling parquet_derive v3.0.0-SNAPSHOT 
(/Users/mqy/tools/data/arrow/rust/parquet_derive)
warning: unresolved link to `array::data::ArrayData`
  --> arrow/src/lib.rs:73:64
   |
73 | //! Data in [`Array`](array::Array) is stored in 
[`ArrayData`](array::data::ArrayData), that in turn
   |
^^ no item named `data` in module `array`
   |
   = note: `#[warn(broken_intra_doc_links)]` on by defaultwarning: unresolved 
link to `array::data::ArrayData`
  --> arrow/src/lib.rs:74:44
   |
74 | //! is a collection of other [`ArrayData`](array::data::ArrayData) and 
[`Buffers`](buffer::Buffer).
   |^^ no item 
named `data` in module `array`warning: unresolved link to `0`
  --> arrow/src/array/array_primitive.rs:52:67
   |
52 | /// raw_values must have a value equivalent to 
data.buffers()[0].raw_data()
   |   ^ no 
item named `0` in scope
   |
   = help: to escape `[` and `]` characters, add '\' before them like `\[` or 
`\]`warning: unresolved link to `Datatype`
  --> arrow/src/ffi.rs:24:115
   |
24 | //! The second interface maps native Rust types to the Rust-specific 
implementation of Arrow such as `format` to [Datatype],
   |
    no item named `Datatype` in 
scope
   |
   = help: to escape `[` and `]` characters, add '\' before them like `\[` or 
`\]`warning: unresolved link to `4`
   --> arrow/src/ipc/gen/Message.rs:339:63
|
339 | /// For example, a List with values [[1, 2, 3], null, [4], [5, 6], 
null]
|   ^ no item 
named `4` in scope
|
= help: to escape `[` and `]` characters, add '\' before them like `\[` or 
`\]`warning: unresolved link to `0`
--> arrow/src/ipc/gen/Schema.rs:1597:15
 |
1597 | ///   - child[0] entries: Struct
 |   ^ no item named `0` in scope
 |
 = help: to escape `[` and `]` characters, add '\' before them like `\[` or 
`\]`warning: unresolved link to `0`
--> arrow/src/ipc/gen/Schema.rs:1598:17
 |
1598 | /// - child[0] key: K
 | ^ no item named `0` in scope
 |
 = help: to escape `[` and `]` characters, add '\' before them like `\[` or 
`\]`warning: unresolved link to `1`
--> arrow/src/ipc/gen/Schema.rs:1599:17
 |
1599 | /// - child[1] value: V
 | ^ no item named `1` in scope
 |
 = help: to escape `[` and `]` characters, add '\' before them like `\[` or 
`\]`warning: unresolved link to `offset`
--> arrow/src/ipc/gen/Schema.rs:1706:28
 |
1706 | /// for each child typeIds[offset] is the id used in the type vector
 |^^ no item named `offset` in scope
 |
 = help: to escape `[` and `]` characters, add '\' before them like `\[` or 
`\]`warning: unresolved link to `i`
   --> arrow/src/ipc/gen/SparseTensor.rs:521:40
|
521 | /// The i-th row spans from indptr[i] to indptr[i+1] in the data.
|^ no item named `i` in scope
|
= help: to escape `[` and `]` characters, add '\' before them like `\[` or 
`\]`warning: unresolved link to `smith2017knl`
   --> arrow/src/ipc/gen/SparseTensor.rs:753:14
|
753 | /// See [smith2017knl]: http://shaden.io/pub-files/smith2017knl.pdf
|   no item named `smith2017knl` in scope
|
= help: to escape `[` and `]` characters, add '\' before them like `\[` or 
`\]`warning: unresolved link to `i`
   --> arrow/src/ipc/gen/SparseTensor.rs:795:75
|
795 | /// indptrBuffers. A pair of consecutive values at 
indptrBuffers[dim][i]
|   
^ no item named `i` in scope
|
= help: to escape `[` and `]` characters, add '\' before them like `\[` or 
`\]`warning: unresolved link to `i`
   --> arrow/src/ipc/gen/SparseTensor.rs:797:73
|
797 | /// indicesBuffers[dim + 1] who are children of 
indicesBuffers[dim][i] node.
| ^ 
no item named `i` in scope
|
= help: to escape `[` and `]` characters, add '\' before them like `\[` or 
`\]`warning: 13 warnings emitted Documenting futures v0.3.8
 Documenting tower-balance v0.3.0
 Documenting tower v0.3.1
 Documenting parquet v3.0.0-SNAPSHOT (/Users/mqy/tools/data/arrow/ru

[jira] [Created] (ARROW-11160) [Rust] pitfall in xxx_equal functions

2021-01-07 Thread meng qingyou (Jira)
meng qingyou created ARROW-11160:


 Summary: [Rust] pitfall in xxx_equal functions
 Key: ARROW-11160
 URL: https://issues.apache.org/jira/browse/ARROW-11160
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: meng qingyou


Functions `pub(super) fn _equal` in rust/src/array/equal/ MUST fail if 
called directly instead of from `equal`.  boolean.rs as  example:

```
lhs_is_null
|| (lhs_is_null == rhs_is_null)
&& equal_bits(...)
```
The bug is not seen in tests at present because `equal` shortcuts the 
condition: zero null s v.s. nulls
 
To verify this problem , adds a new test in `mod.rs`:
```
#[test]
fntest_boolean_equal_bug() {
leta=BooleanArray::from(vec![None]).data();
letb=BooleanArray::from(vec![Some(false)]).data();

letlhs=a.as_ref();
letrhs=b.as_ref();
letlhs_nulls=lhs.null_buffer();
letrhs_nulls=rhs.null_buffer();

assert_eq!(boolean::boolean_equal(lhs, rhs, lhs_nulls, rhs_nulls, 0, 0, 
lhs.len()), false);
}
```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11131) [Rust] Optimize performance of bool_equal

2021-01-05 Thread meng qingyou (Jira)
meng qingyou created ARROW-11131:


 Summary: [Rust] Optimize performance of bool_equal
 Key: ARROW-11131
 URL: https://issues.apache.org/jira/browse/ARROW-11131
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: meng qingyou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11101) [Rust] udate git pre-commit hook

2021-01-01 Thread meng qingyou (Jira)
meng qingyou created ARROW-11101:


 Summary: [Rust] udate git pre-commit hook
 Key: ARROW-11101
 URL: https://issues.apache.org/jira/browse/ARROW-11101
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: meng qingyou


Currently,  the client side git pre-commit hook check and run `cargo fmt` with 
stable version, but the CI check may fail in nightly version occasionally even 
if the code has been formatted with stable.

It seems that, this problem can be resolved by: running `cargo +nighty fmt` 
before  `cargo +stable fmt`. Thus `pre-commit.sh` should be updated in this way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11034) [Rust] rustfmt cleanup

2020-12-25 Thread meng qingyou (Jira)
meng qingyou created ARROW-11034:


 Summary: [Rust] rustfmt cleanup
 Key: ARROW-11034
 URL: https://issues.apache.org/jira/browse/ARROW-11034
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: meng qingyou


We have basically switched to the stable rust after #9698 and #8930.

 

There is a annoying problem that: cargo +stable fmt prints warning like this:

*Warning: can't set `ignore ... unstable features are only available in nightly 
channel.*

 

Since "arrow/src/ipc/gen" has been formatted with the stable, we can remove the 
ignore list from rustfmt.toml



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10967) Make env vars ARROW_TEST_DATA and PARQUET_TEST_DATA optional

2020-12-18 Thread meng qingyou (Jira)
meng qingyou created ARROW-10967:


 Summary: Make env vars ARROW_TEST_DATA and PARQUET_TEST_DATA 
optional
 Key: ARROW-10967
 URL: https://issues.apache.org/jira/browse/ARROW-10967
 Project: Apache Arrow
  Issue Type: Test
Reporter: meng qingyou


Facts/problems:
 # Two vars *c* and *PARQUET_TEST_DATA* are required by be set for running 
tests,  benchmarks, examples.
 # There are totally eighteen .rs files use these environment variables.
 # The major usage likes this: ```
let testdata =
std::env::var("PARQUET_TEST_DATA").expect("PARQUET_TEST_DATA not defined");```
 # Somebody tried to assembly the test data directories by appending relative 
dir to *current dir* of current running process, but highly depend on the 
actual current dir (for example, rust/, rust/datafusion, etc.).

Here is my solution:

Suppose:
 # *current_dir* is ALWAYS inside the *git workspace dir*
 # We know an *absolute dir X relative to git workspace dir*

Get absolute dir of X == get absolute dir *TOP* of *git workspace dir*.

Given *current dir* (in *git workspace dir*),we visit the dir and it's parents, 
check if ."git"  (file or dir)exists. The first dir that contains ".git" SHOULD 
be *git workspace dir*.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10952) [Rust] Add pre-commit hook

2020-12-17 Thread meng qingyou (Jira)
meng qingyou created ARROW-10952:


 Summary: [Rust] Add pre-commit hook
 Key: ARROW-10952
 URL: https://issues.apache.org/jira/browse/ARROW-10952
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: meng qingyou


If a commit contains only fixes for rust format, it would be very sad to wait 
for the slow CI checking done, and it may block checking of PRs.

So, the PR adds  a git pre-commit hook file, which do *cargo fmt --check*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10938) [Arrow] upgrade dependency "flatbuffers" to 0.8.0

2020-12-16 Thread meng qingyou (Jira)
meng qingyou created ARROW-10938:


 Summary: [Arrow] upgrade dependency "flatbuffers" to 0.8.0
 Key: ARROW-10938
 URL: https://issues.apache.org/jira/browse/ARROW-10938
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: meng qingyou


[flatbuffers](https://crates.io/crates/flatbuffers) 0.8.0 was released on Dec 
10, 2020, with some notable changes:

verifier
common rust traits to FlatBufferBuilder
new VectorIter
add FlatBufferBuilder::force_defaults API
Optional Scalars
up to 2018 edition
possible performance speedup
... and minor breaking change to some APIs, for example: remote "get_", return 
Result.

Let's try this version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10915) Make ARROW_TEST_DATA and PARQUET_TEST_DATA absolute dirs

2020-12-15 Thread meng qingyou (Jira)
meng qingyou created ARROW-10915:


 Summary: Make ARROW_TEST_DATA and PARQUET_TEST_DATA absolute dirs
 Key: ARROW-10915
 URL: https://issues.apache.org/jira/browse/ARROW-10915
 Project: Apache Arrow
  Issue Type: Test
  Components: Rust
Reporter: meng qingyou


In rust/README.md,  both *ARROW_TEST_DATA* and *PARQUET_TEST_DATA* are set as 
relative path. The problem is: we MAY have to reset them back-and-forth across 
top and subdirectories -- that's annoying. So, the obvious solution is: set the 
Env vars as absolute dirs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10832) Evaluate latest snapshot flatc

2020-12-07 Thread meng qingyou (Jira)
meng qingyou created ARROW-10832:


 Summary: Evaluate latest snapshot flatc
 Key: ARROW-10832
 URL: https://issues.apache.org/jira/browse/ARROW-10832
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: meng qingyou


Flatc 1.12.0 generates incorrect code 
[https://github.com/google/flatbuffers/issues/5849|http://example.com/]

Let's try snapshort flatc built from flatbuffers source: looks good with just 
several adaptions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)