carols10cents opened a new issue #589:
URL: https://github.com/apache/arrow-rs/issues/589
**Describe the bug**
I check my home directory into git. My home directory contains `.cargo`, my
`CARGO_HOME` directory. When I write a Parquet file, its `FileMetaData`
contains:
```
created_by: Some(
"parquet-rs version 5.0.0 (build
3ef76a677716df403a13964a58351abe37c1754d)",
),
```
That SHA is of a commit in my home directory, not in Parquet, and not in the
project using Parquet.
I have a test in the project that verifies the size of the parquet file
data, and the test was failing for me because the content was 49 bytes too
much, the exact size of the extra content above. I verified that in CI, the
test passes, and the `FileMetaData` under test contains:
```
created_by: Some(
"parquet-rs version 5.0.0",
),
```
**To Reproduce**
- Check your home directory into git, or alternately set `CARGO_HOME` to a
directory in a git repository.
- Generate a parquet file and check the metadata.
- Observe the `created_by` contains a hash from the git directory
`CARGO_HOME` is in.
I'm not sure if it's going to be possible to create a failing test for this
given the environmental aspect... [the current
test](https://github.com/apache/arrow-rs/blob/30f1b1fe8681914d0bd8fc5062338aa78f35b1f1/parquet/src/file/properties.rs#L525)
only checks that the `created_at` value is the value of the
`PARQUET_CREATED_BY` environment variable but the problem is what gets in the
`PARQUET_CREATED_BY` environment variable in the first place.
**Expected behavior**
I expected to get the exact same Parquet file content whether my home
directory is checked into Git or not 🤣
**Additional context**
The `PARQUET_CREATED_BY` environment variable is set [in the build
script](https://github.com/apache/arrow-rs/blob/30f1b1fe8681914d0bd8fc5062338aa78f35b1f1/parquet/build.rs#L24-L27)
if `git rev-parse HEAD` returns a value. Considering this is only getting set
if you have a non-standard setup like I do, I think this should just be removed
entirely. I'm going to prepare a PR for discussion with this solution :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]